Re: [galaxy-dev] Deploying LOC files for tool built-in data during a tool installation

15 Oct 2013

      Hi all,

I think what we have are two similar, but somewhat separate problems:
1.) We need a way via the UI for an admin to be able to add additional configuration entries to data tables / .loc files.
2.) We need a way to bootstrap/initialize a Galaxy installation with data table/ .loc file entries ('built-in data') during installation for 
	a.) a 'production' Galaxy instance - this would include local dev/testing/etc instances
	b.) automated testing framework - tests should run fast, but meaningfully test a tool, e.g., the horse mitochondrial genome could be a fine built-in genome for running automated tool tests, but not desired to be automatically installed into a production Galaxy instance

For 1.), we now have Data Managers. A Data Manager will do all the heavy lifting of adding additional data table entries. e.g. for bwa, it can build the mapping indexes and add the properly delimited line to the .loc file. These are accessed through the admin interface, under Manage local data. Data Managers are installed from a ToolShed, or can be installed manually. In addition to direct interactive usage, Data Manager tools can be included in workflows or accessed via the tools API. Not only does the use of a Data Manager remove the technical burdens/concerns of adding new entries to a data table / .loc file, it also provides for the same reproducibility and provenance tracking that is afforded to regular Galaxy tools. The documentation for Data Managers is currently limited to the tutorial-style doc here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers/HowTo/Define; a more formal / config syntax type of page will also be made available, although the tutorial is a pretty inclusive description of the steps needed to define a Data Manager.

For 2.): bootstrapping data during an installation process is something that still needs to be more completely spec'd out and implemented. This bootstrapping process should be able to make use of the Data Managers or download/move/utilize pre-built configurations. (A Data Manager itself can have its underlying actions being a downloading process, e.g. the fetch genomes data manager)

Lets start by considering the Users' point of view. We have 2 types of users: GalaxyAdmin and ToolDev and use a BWA tool as an example.
GalaxyAdmin: 
	Clicks buttons to install tool suite that includes the BWA tool and a BWA indexer Data Manager. (so far there is no change from how it works now)
	The Galaxy installer methodology recognizes that it is possible to add built-in data:
		Some preassembled mapping indexes are available (pre-built built-in)
		Mapping indexes can be created for any entry in the all_fasta data table.
	The User clicks checkboxes/multiple selects for preassembled data to download and also selects the fasta entries to be indexed with the Data Manager tool.

ToolDev:
	In ToolShed repository, needs to provide a description of a and b; for simplicity we can assume b is a subset of a, but with a different attribute/flag (e.g. test_only, 'real', both) or perhaps a different filename; abstractly, they are the same thing just run at different times with the testing ones not requiring user interaction/selection.

So, the real question becomes, what does this description look like? It is probably an XML file, for now lets call it  '__data_table_bootstrap__.xml' (alternatively, we can roll it directly into the existing data_manager_conf.xml files in the toolshed, although for a list of static downloads, we don't need an actual data manager tool). It could look something like this (quick and dirty pass, elements and values are made up):

		<data_table name="bwa_index" production="True" testing="False"> <!--both would default to True -->
			 <data_manager id="bwa_indexer">
				<param name="list_of_fasta_files_from_all_fasta"/> <!-- corresponds to the all_fasta parameter value in the data manager, cycles over each of the fasta values to provide selections -->
			</data_manager>
			<download code="script_for_prebuilt.py"> <!-- could be static list or some dynamically determined listing />
				<available method="get_list" /> <!-- returns sets of parameter values for available data to download-->
				<fetch method="get_indexes" /> <!-- takes the values selected from above available, returns list of URI source and relative target-->
			</download>
			<download > <!-- could be static list or some dynamically determined listing />
				<entry>
					<field name="dbkey" value="hg19"/>
					<field name="description" value="Human - hg19"/>
					<files>
						<source="http://file1" target="relative/path/to/file1"/>
						<source="http://file2" target="relative/path/to/file2"/>
					</files>
				</entry>
			</download>
		</data_table>

Any thoughts?

Thanks,

Dan

On Oct 8, 2013, at 4:26 PM, Guest, Simon wrote:
...
...
I look forward to some more details from Dan on *.loc
file setup.
Hi Peter, Dan and all,
What a timely discussion!  I am just in the process of setting up loc files for some new indexes I have created (bowtie2, etc), and would really like to see this automated.
I see there is a Galaxy script scripts/loc_files/create_all_fasta_loc.py, which is quite sophisticated, and does this job nicely for all_fasta.loc.  I'm feeling an urge to somehow extend this script to cope with other datatypes besides fasta, but am wondering if this will be wasted effort if there will soon be a better way to handle this.
Can Dan or anyone else comment on this?
cheers,
Simon
On Oct 8, 2013, at 11:25 AM, Peter Cock wrote:
...
On Tue, Oct 8, 2013 at 4:13 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
...
...
...
I don't agree with this - the sample files should be used as guidance for
the admin to create functionally correct .loc files.  This is the same
aopproach used for all Galaxy .sample files ( e.g., universe_wsgi.ini.sample
<-> universe_wsgi.ini, etc )
Why then does the tool_conf.xml.sample file get used by the
test framework then? This is a clear example of *.xml.sample
being used in the test framework over the 'real' file *.xml.
I really don't understand this design choice - I would use
tool_conf.xml (it lists the tools actually installed on our Galaxy,
and therefore the things worth testing) while by default
tool_conf.xml.sample includes a whole load of things where
the binaries etc are missing and so the tests will fail (hiding
potential real failures in the noise).
I'm not quite sure of the reason for htis as I didn't make this
design choice - I'm sure "ancient Galaxy history" plays a role
in this decision.
Probably ;)
...
...
Perhaps rather than overloading *.loc.sample with two roles
(sample configuration/documentation and unit tests), we
need to introduce *.loc.test for functional testing purposes?
I'm hoping we don''t have to go this route as we have so many
priorities.  If you would like this implemented though, please
add a new Trello card and we'll consider it.
Filed: https://trello.com/c/P90b5Pa0/1165-functional-tests-need-separate-loc-files-...
...
...
That still leaves open the question of how best to install
the test databases or files that the *.loc.test file would
point at for running functional tests.
Yes!
I look forward to some more details from Dan on *.loc
file setup.
Thank you,
Peter

Re: [galaxy-dev] Deploying LOC files for tool built-in data during a tool installation

Daniel Blankenberg