Deploying LOC files for tool built-in data during a tool installation
Hi list, The tool I am currently wrapping has built-in data, which may be used by the tool users (through a relevant < from_data_table> + .LOC file configuration). They are .fasta databases which are rather small and are thus bundled in the tool distribution package. Thanks to the tool_dependencies.xml file, said distribution package is downloaded at install time, code is compiled, and since they are here, the data files are copied to $INSTAL L_DIR too , ready to be used. After that, the user still has to edit tool-data/my_fancy_data_files.loc ; but the thing is, during the install I know where these data files are (since I copied those there), so I would like to save the user the trouble and set up this file automagically. I would have two questions: 1/ Is it okay to have tool built-in data files in $INSTAL L_DIR, or would it be considered bad practice? 2/ Is there a way to set up the tool-data/my_fancy_data_files.loc during the install? Here are the options I though of: *shipping a “real” my_fancy_data_files.loc.sample with the good paths already set-up, which is going to be copied as the .loc file (a rather ugly hack) *using more <action type="shell_command"> during install to create my_fancy_data_files.loc (but deploying this file it is not part of the tool dependency install per se) *variant of the previous : shipping my_fancy_data_files.loc as part of the tool distribution package, and copy it through shell_command (same concern than above). Any thoughts? Cheers, -- Jean-Frédéric Bonsai Bioinformatics group
Hello Jean-Frédéric, Sorry for the delay in this response. Please see my inline comments. On Feb 8, 2013, at 10:33 AM, Jean-Frédéric Berthelot wrote:
Hi list,
The tool I am currently wrapping has built-in data, which may be used by the tool users (through a relevant <from_data_table> + .LOC file configuration). They are .fasta databases which are rather small and are thus bundled in the tool distribution package.
Thanks to the tool_dependencies.xml file, said distribution package is downloaded at install time, code is compiled, and since they are here, the data files are copied to $INSTALL_DIR too, ready to be used.
After that, the user still has to edit tool-data/my_fancy_data_files.loc ; but the thing is, during the install I know where these data files are (since I copied those there), so I would like to save the user the trouble and set up this file automagically.
I would have two questions:
1/ Is it okay to have tool built-in data files in $INSTALL_DIR, or would it be considered bad practice?
This is difficult to answer. Generally, data files should be located in a shared location so that other tools can access them as well. However, there are potentially exceptions to this that are acceptable. The fact that the fasta data files are small and you are using a tool_dependencies.xml file to define a relationship to them for your tools is a good approach because it allows the data files to be used by other tools in separate repositories via a complex repository dependency definition in the remote repository. If these fasta data files are available for download via a clone or a url, then in the near future the new Galaxy Data Manager (which uses a new, special category of Galaxy tools which are of type "data_manager") may be useful in this scenario. Data Manager tools can be associated with tools in a repository like yours using repository dependency definitions, so they will be installed along with the selected repository. These data manager tools allow for specified data to be installed into the Galaxy environment for use by tools. This new component is not yet released, but it is close. In the meantime, your approach is the only way to make this work. If your files are not downloadable, then we might plan to allow simplified bootstrapping of .loc files in the tol-data directory with files included in the repository. This would take some planning, and it's availability would not be in the short term
2/ Is there a way to set up the tool-data/my_fancy_data_files.loc during the install? Here are the options I though of: *shipping a “real” my_fancy_data_files.loc.sample with the good paths already set-up, which is going to be copied as the .loc file (a rather ugly hack)
Assuming you use a file name that is not already in the Galaxy tool-data subdirectory, the above approach is probably the only way you can do this in a fully automated right now. Again, when the new Data Manager is released, it will handle this kind of automated configuration. But in the meantime, manual intervention is generally required to add the information to appropriate .loc files in the tool-data directory.
*using more <action type="shell_command"> during install to create my_fancy_data_files.loc (but deploying this file it is not part of the tool dependency install per se)
I advise against the above approach. The "best practice" use of tool dependency definitions is to restrict movement of files to location within the defined $INSTALL_DIR (the installation directory of the tol dependency package) or $REPOSITORY_INSTALL_DIR (the installation directory of the repository), which is set at installation time. Hard-coding file paths in <action> tags is fragile, and not recommeded.
*variant of the previous : shipping my_fancy_data_files.loc as part of the tool distribution package, and copy it through shell_command (same concern than above).
The above approach is not recommended either - same issue as above.
Any thoughts?
Cheers,
-- Jean-Frédéric Bonsai Bioinformatics group
Thanks very much Jean-Frédéric, Greg Von Kuster
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Greg, Jean-Frédéric, I'm returning to this old thread rather than starting a new one, since it is nicely aligned with something I wanted to raise. On Tue, Feb 19, 2013 at 2:23 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hello Jean-Frédéric,
Sorry for the delay in this response. Please see my inline comments.
On Feb 8, 2013, at 10:33 AM, Jean-Frédéric Berthelot wrote:
Hi list,
The tool I am currently wrapping has built-in data, which may be used by the tool users (through a relevant <from_data_table> + .LOC file configuration). They are .fasta databases which are rather small and are thus bundled in the tool distribution package.
Thanks to the tool_dependencies.xml file, said distribution package is downloaded at install time, code is compiled, and since they are here, the data files are copied to $INSTALL_DIR too, ready to be used.
After that, the user still has to edit tool-data/my_fancy_data_files.loc ; but the thing is, during the install I know where these data files are (since I copied those there), so I would like to save the user the trouble and set up this file automagically.
I would have two questions:
1/ Is it okay to have tool built-in data files in $INSTALL_DIR, or would it be considered bad practice?
This is difficult to answer. Generally, data files should be located in a shared location so that other tools can access them as well. However, there are potentially exceptions to this that are acceptable. The fact that the fasta data files are small and you are using a tool_dependencies.xml file to define a relationship to them for your tools is a good approach because it allows the data files to be used by other tools in separate repositories via a complex repository dependency definition in the remote repository.
If these fasta data files are available for download via a clone or a url, then in the near future the new Galaxy Data Manager (which uses a new, special category of Galaxy tools which are of type "data_manager") may be useful in this scenario. Data Manager tools can be associated with tools in a repository like yours using repository dependency definitions, so they will be installed along with the selected repository. These data manager tools allow for specified data to be installed into the Galaxy environment for use by tools. This new component is not yet released, but it is close. In the meantime, your approach is the only way to make this work.
If your files are not downloadable, then we might plan to allow simplified bootstrapping of .loc files in the tol-data directory with files included in the repository. This would take some planning, and it's availability would not be in the short term
Any news Greg? I see there is an empty page on the wiki here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers And some actual content here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers/HowTo/Define
2/ Is there a way to set up the tool-data/my_fancy_data_files.loc during the install? Here are the options I though of: *shipping a “real” my_fancy_data_files.loc.sample with the good paths already set-up, which is going to be copied as the .loc file (a rather ugly hack)
Assuming you use a file name that is not already in the Galaxy tool-data subdirectory, the above approach is probably the only way you can do this in a fully automated right now. Again, when the new Data Manager is released, it will handle this kind of automated configuration. But in the meantime, manual intervention is generally required to add the information to appropriate .loc files in the tool-data directory.
Is that still the case today?
*using more <action type="shell_command"> during install to create my_fancy_data_files.loc (but deploying this file it is not part of the tool dependency install per se)
I advise against the above approach. The "best practice" use of tool dependency definitions is to restrict movement of files to location within the defined $INSTALL_DIR (the installation directory of the tol dependency package) or $REPOSITORY_INSTALL_DIR (the installation directory of the repository), which is set at installation time. Hard-coding file paths in <action> tags is fragile, and not recommeded.
*variant of the previous : shipping my_fancy_data_files.loc as part of the tool distribution package, and copy it through shell_command (same concern than above).
The above approach is not recommended either - same issue as above.
I may not be following your recommendation - in a couple of tools I provide a functional working *.loc.sample file which is installed as the default *.loc file. I do this in both the Blast2GO and EffectiveT3 wrappers, but in both cases I've avoided the need for absolute paths (and the worry about where to put the files) and used relative paths (and put the files in $INSTALL_DIR). This works quite well: http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go http://toolshed.g2.bx.psu.edu/view/peterjc/effectiveT3 However, for something like NCBI BLAST setting up some test databases via the <action> tags would be a bit more fiddly - although it could let me increase the tools' test coverage. As an aside, I've asked before about why the function tests look at *.loc rather than *.loc.sample and not had a clear answer. As soon as the local administrator edits the provided default *.loc files, this could break functional tests using the *.loc.sample values. The simple fix is for the test framework to preferentially load the *.loc.sample file if present: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-April/014370.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016159.html Regards, Peter
Hi Peter and others, On Oct 8, 2013, at 10:22 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Greg, Jean-Frédéric,
I'm returning to this old thread rather than starting a new one, since it is nicely aligned with something I wanted to raise.
On Tue, Feb 19, 2013 at 2:23 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hello Jean-Frédéric,
Sorry for the delay in this response. Please see my inline comments.
On Feb 8, 2013, at 10:33 AM, Jean-Frédéric Berthelot wrote:
Hi list,
The tool I am currently wrapping has built-in data, which may be used by the tool users (through a relevant <from_data_table> + .LOC file configuration). They are .fasta databases which are rather small and are thus bundled in the tool distribution package.
Thanks to the tool_dependencies.xml file, said distribution package is downloaded at install time, code is compiled, and since they are here, the data files are copied to $INSTALL_DIR too, ready to be used.
After that, the user still has to edit tool-data/my_fancy_data_files.loc ; but the thing is, during the install I know where these data files are (since I copied those there), so I would like to save the user the trouble and set up this file automagically.
I would have two questions:
1/ Is it okay to have tool built-in data files in $INSTALL_DIR, or would it be considered bad practice?
This is difficult to answer. Generally, data files should be located in a shared location so that other tools can access them as well. However, there are potentially exceptions to this that are acceptable. The fact that the fasta data files are small and you are using a tool_dependencies.xml file to define a relationship to them for your tools is a good approach because it allows the data files to be used by other tools in separate repositories via a complex repository dependency definition in the remote repository.
If these fasta data files are available for download via a clone or a url, then in the near future the new Galaxy Data Manager (which uses a new, special category of Galaxy tools which are of type "data_manager") may be useful in this scenario. Data Manager tools can be associated with tools in a repository like yours using repository dependency definitions, so they will be installed along with the selected repository. These data manager tools allow for specified data to be installed into the Galaxy environment for use by tools. This new component is not yet released, but it is close. In the meantime, your approach is the only way to make this work.
If your files are not downloadable, then we might plan to allow simplified bootstrapping of .loc files in the tol-data directory with files included in the repository. This would take some planning, and it's availability would not be in the short term
Any news Greg? I see there is an empty page on the wiki here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers
And some actual content here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers/HowTo/Define
Dan Blankenberg has completed the initial implementation of the Data Manager tools and will be creating the documentation at some point.
2/ Is there a way to set up the tool-data/my_fancy_data_files.loc during the install? Here are the options I though of: *shipping a “real” my_fancy_data_files.loc.sample with the good paths already set-up, which is going to be copied as the .loc file (a rather ugly hack)
Assuming you use a file name that is not already in the Galaxy tool-data subdirectory, the above approach is probably the only way you can do this in a fully automated right now. Again, when the new Data Manager is released, it will handle this kind of automated configuration. But in the meantime, manual intervention is generally required to add the information to appropriate .loc files in the tool-data directory.
Is that still the case today?
Dan will be able to provide the ideal answer to this question.
*using more <action type="shell_command"> during install to create my_fancy_data_files.loc (but deploying this file it is not part of the tool dependency install per se)
I advise against the above approach. The "best practice" use of tool dependency definitions is to restrict movement of files to location within the defined $INSTALL_DIR (the installation directory of the tol dependency package) or $REPOSITORY_INSTALL_DIR (the installation directory of the repository), which is set at installation time. Hard-coding file paths in <action> tags is fragile, and not recommeded.
*variant of the previous : shipping my_fancy_data_files.loc as part of the tool distribution package, and copy it through shell_command (same concern than above).
The above approach is not recommended either - same issue as above.
I may not be following your recommendation - in a couple of tools I provide a functional working *.loc.sample file which is installed as the default *.loc file.
I do this in both the Blast2GO and EffectiveT3 wrappers, but in both cases I've avoided the need for absolute paths (and the worry about where to put the files) and used relative paths (and put the files in $INSTALL_DIR). This works quite well:
http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go http://toolshed.g2.bx.psu.edu/view/peterjc/effectiveT3
I believe your approach is correct and follows the "best practice" descibed in this tool shed wiki section: http://wiki.galaxyproject.org/InstallingRepositoriesToGalaxy#Installing_Gala... Specifically, the following paragraphs: ================== Tool shed repositories that contain tools that include dynamically generated select list parameters that refer to an entry in the tool_data_table_conf.xml file must contain a tool_data_table_conf.xml.sample file that contains the required entry for each dynamic parameter. Similarly, any index files (i.e., ~/tool-data/xxx.loc files) to which the tool_data_table_conf.xml file entries refer must be defined in xxx.loc.sample files included in the tool shed repository along with the tools. If any of these tool_data_table_conf.xml entries or any of the required xxx.loc.sample files are missing from the tool shed repository, the tools will not properly load and metadata will not be generated for the repository. This means that the tools cannot be automatically installed into a Galaxy instance. For those tools that include dynamically generated select list parameters that require a missing entry in the tool_data_table_conf.xml file, this file will be modified in real time by adding the entry from a tool_data_table_conf.xml.sample file contained in the tool shed repository. ==================
However, for something like NCBI BLAST setting up some test databases via the <action> tags would be a bit more fiddly - although it could let me increase the tools' test coverage.
This may be fine, although I'm not quite clear on what you would be doing here.
As an aside, I've asked before about why the function tests look at *.loc rather than *.loc.sample and not had a clear answer.
The functional tests look at .loc files because they will have uncommented, functionally correct entries. The .loc.sample files usually have commented "sample" entries that provide an idea to the Galaxy admin as to what should actually go into the associated .loc file. For example, twobit.loc.sample has: #droPer1 /depot/data2/galaxy/droPer1/droPer1.2bit #apiMel2 /depot/data2/galaxy/apiMel2/apiMel2.2bit #droAna1 /depot/data2/galaxy/droAna1/droAna1.2bit #droAna2 /depot/data2/galaxy/droAna2/droAna2.2bit while twobit.loc has: droPer1 /depot/data2/galaxy/droPer1/droPer1.2bit apiMel2 /depot/data2/galaxy/apiMel2/apiMel2.2bit droAna1 /depot/data2/galaxy/droAna1/droAna1.2bit droAna2 /depot/data2/galaxy/droAna2/droAna2.2bit
As soon as the local administrator edits the provided default *.loc files, this could break functional tests using the *.loc.sample values.
The intent is that the local administrator manually edits the .loc file to include the functionally correct entries based on entries in the .loc.sample file.
The simple fix is for the test framework to preferentially load the *.loc.sample file if present:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-April/014370.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016159.html
I don't agree with this - the sample files should be used as guidance for the admin to create functionally correct .loc files. This is the same aopproach used for all Galaxy .sample files ( e.g., universe_wsgi.ini.sample <-> universe_wsgi.ini, etc )
Regards,
Peter
On Tue, Oct 8, 2013 at 3:47 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hi Peter and others,
Peter wrote:
As an aside, I've asked before about why the function tests look at *.loc rather than *.loc.sample and not had a clear answer.
The functional tests look at .loc files because they will have uncommented, functionally correct entries. The .loc.sample files usually have commented "sample" entries that provide an idea to the Galaxy admin as to what should actually go into the associated .loc file. For example, twobit.loc.sample has:
#droPer1 /depot/data2/galaxy/droPer1/droPer1.2bit #apiMel2 /depot/data2/galaxy/apiMel2/apiMel2.2bit #droAna1 /depot/data2/galaxy/droAna1/droAna1.2bit #droAna2 /depot/data2/galaxy/droAna2/droAna2.2bit
while twobit.loc has:
droPer1 /depot/data2/galaxy/droPer1/droPer1.2bit apiMel2 /depot/data2/galaxy/apiMel2/apiMel2.2bit droAna1 /depot/data2/galaxy/droAna1/droAna1.2bit droAna2 /depot/data2/galaxy/droAna2/droAna2.2bit
It depends on the tool, some example.loc.sample files already contain real working entries. In this case if it would be useful for the twobit unit tests, why not provide twobit.loc with the uncommented lines? (Either way the Galaxy Admin will have to edit twobit.loc to suit the local setup anyway.)
As soon as the local administrator edits the provided default *.loc files, this could break functional tests using the *.loc.sample values.
The intent is that the local administrator manually edits the .loc file to include the functionally correct entries based on entries in the .loc.sample file.
The simple fix is for the test framework to preferentially load the *.loc.sample file if present:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-April/014370.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016159.html
I don't agree with this - the sample files should be used as guidance for the admin to create functionally correct .loc files. This is the same aopproach used for all Galaxy .sample files ( e.g., universe_wsgi.ini.sample <-> universe_wsgi.ini, etc )
Why then does the tool_conf.xml.sample file get used by the test framework then? This is a clear example of *.xml.sample being used in the test framework over the 'real' file *.xml. I really don't understand this design choice - I would use tool_conf.xml (it lists the tools actually installed on our Galaxy, and therefore the things worth testing) while by default tool_conf.xml.sample includes a whole load of things where the binaries etc are missing and so the tests will fail (hiding potential real failures in the noise). The quick fix is to edit tool_conf.xml.sample but that can cause trouble with hg and system updates. (I appreciate as more and more tools leave the core framework and migrate to the tool shed this is less important, -- Perhaps rather than overloading *.loc.sample with two roles (sample configuration/documentation and unit tests), we need to introduce *.loc.test for functional testing purposes? That still leaves open the question of how best to install the test databases or files that the *.loc.test file would point at for running functional tests. Thanks, Peter
Hi Peter, On Oct 8, 2013, at 11:01 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Oct 8, 2013 at 3:47 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hi Peter and others,
Peter wrote:
As an aside, I've asked before about why the function tests look at *.loc rather than *.loc.sample and not had a clear answer.
The functional tests look at .loc files because they will have uncommented, functionally correct entries. The .loc.sample files usually have commented "sample" entries that provide an idea to the Galaxy admin as to what should actually go into the associated .loc file. For example, twobit.loc.sample has:
#droPer1 /depot/data2/galaxy/droPer1/droPer1.2bit #apiMel2 /depot/data2/galaxy/apiMel2/apiMel2.2bit #droAna1 /depot/data2/galaxy/droAna1/droAna1.2bit #droAna2 /depot/data2/galaxy/droAna2/droAna2.2bit
while twobit.loc has:
droPer1 /depot/data2/galaxy/droPer1/droPer1.2bit apiMel2 /depot/data2/galaxy/apiMel2/apiMel2.2bit droAna1 /depot/data2/galaxy/droAna1/droAna1.2bit droAna2 /depot/data2/galaxy/droAna2/droAna2.2bit
It depends on the tool, some example.loc.sample files already contain real working entries. In this case if it would be useful for the twobit unit tests, why not provide twobit.loc with the uncommented lines?
The .loc files are looked at because the .loc.sample files are not required to have uncommented unctional entries (although some obviously may have them).
(Either way the Galaxy Admin will have to edit twobit.loc to suit the local setup anyway.)
Yes
As soon as the local administrator edits the provided default *.loc files, this could break functional tests using the *.loc.sample values.
The intent is that the local administrator manually edits the .loc file to include the functionally correct entries based on entries in the .loc.sample file.
The simple fix is for the test framework to preferentially load the *.loc.sample file if present:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-April/014370.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016159.html
I don't agree with this - the sample files should be used as guidance for the admin to create functionally correct .loc files. This is the same aopproach used for all Galaxy .sample files ( e.g., universe_wsgi.ini.sample <-> universe_wsgi.ini, etc )
Why then does the tool_conf.xml.sample file get used by the test framework then? This is a clear example of *.xml.sample being used in the test framework over the 'real' file *.xml.
I really don't understand this design choice - I would use tool_conf.xml (it lists the tools actually installed on our Galaxy, and therefore the things worth testing) while by default tool_conf.xml.sample includes a whole load of things where the binaries etc are missing and so the tests will fail (hiding potential real failures in the noise).
I'm not quite sure of the reason for htis as I didn't make this design choice - I'm sure "ancient Galaxy history" plays a role in this decision.
The quick fix is to edit tool_conf.xml.sample but that can cause trouble with hg and system updates.
(I appreciate as more and more tools leave the core framework and migrate to the tool shed this is less important,
Yes, this is true.
--
Perhaps rather than overloading *.loc.sample with two roles (sample configuration/documentation and unit tests), we need to introduce *.loc.test for functional testing purposes?
I'm hoping we don''t have to go this route as we have so many priorities. If you would like this implemented though, please add a new Trello card and we'll consider it.
That still leaves open the question of how best to install the test databases or files that the *.loc.test file would point at for running functional tests.
Yes!
Thanks,
Peter
On Tue, Oct 8, 2013 at 4:13 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
I don't agree with this - the sample files should be used as guidance for the admin to create functionally correct .loc files. This is the same aopproach used for all Galaxy .sample files ( e.g., universe_wsgi.ini.sample <-> universe_wsgi.ini, etc )
Why then does the tool_conf.xml.sample file get used by the test framework then? This is a clear example of *.xml.sample being used in the test framework over the 'real' file *.xml.
I really don't understand this design choice - I would use tool_conf.xml (it lists the tools actually installed on our Galaxy, and therefore the things worth testing) while by default tool_conf.xml.sample includes a whole load of things where the binaries etc are missing and so the tests will fail (hiding potential real failures in the noise).
I'm not quite sure of the reason for htis as I didn't make this design choice - I'm sure "ancient Galaxy history" plays a role in this decision.
Probably ;)
Perhaps rather than overloading *.loc.sample with two roles (sample configuration/documentation and unit tests), we need to introduce *.loc.test for functional testing purposes?
I'm hoping we don''t have to go this route as we have so many priorities. If you would like this implemented though, please add a new Trello card and we'll consider it.
Filed: https://trello.com/c/P90b5Pa0/1165-functional-tests-need-separate-loc-files-...
That still leaves open the question of how best to install the test databases or files that the *.loc.test file would point at for running functional tests.
Yes!
I look forward to some more details from Dan on *.loc file setup. Thank you, Peter
I look forward to some more details from Dan on *.loc file setup.
Hi Peter, Dan and all, What a timely discussion! I am just in the process of setting up loc files for some new indexes I have created (bowtie2, etc), and would really like to see this automated. I see there is a Galaxy script scripts/loc_files/create_all_fasta_loc.py, which is quite sophisticated, and does this job nicely for all_fasta.loc. I'm feeling an urge to somehow extend this script to cope with other datatypes besides fasta, but am wondering if this will be wasted effort if there will soon be a better way to handle this. Can Dan or anyone else comment on this? cheers, Simon
Hi all, I think what we have are two similar, but somewhat separate problems: 1.) We need a way via the UI for an admin to be able to add additional configuration entries to data tables / .loc files. 2.) We need a way to bootstrap/initialize a Galaxy installation with data table/ .loc file entries ('built-in data') during installation for a.) a 'production' Galaxy instance - this would include local dev/testing/etc instances b.) automated testing framework - tests should run fast, but meaningfully test a tool, e.g., the horse mitochondrial genome could be a fine built-in genome for running automated tool tests, but not desired to be automatically installed into a production Galaxy instance For 1.), we now have Data Managers. A Data Manager will do all the heavy lifting of adding additional data table entries. e.g. for bwa, it can build the mapping indexes and add the properly delimited line to the .loc file. These are accessed through the admin interface, under Manage local data. Data Managers are installed from a ToolShed, or can be installed manually. In addition to direct interactive usage, Data Manager tools can be included in workflows or accessed via the tools API. Not only does the use of a Data Manager remove the technical burdens/concerns of adding new entries to a data table / .loc file, it also provides for the same reproducibility and provenance tracking that is afforded to regular Galaxy tools. The documentation for Data Managers is currently limited to the tutorial-style doc here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers/HowTo/Define; a more formal / config syntax type of page will also be made available, although the tutorial is a pretty inclusive description of the steps needed to define a Data Manager. For 2.): bootstrapping data during an installation process is something that still needs to be more completely spec'd out and implemented. This bootstrapping process should be able to make use of the Data Managers or download/move/utilize pre-built configurations. (A Data Manager itself can have its underlying actions being a downloading process, e.g. the fetch genomes data manager) Lets start by considering the Users' point of view. We have 2 types of users: GalaxyAdmin and ToolDev and use a BWA tool as an example. GalaxyAdmin: Clicks buttons to install tool suite that includes the BWA tool and a BWA indexer Data Manager. (so far there is no change from how it works now) The Galaxy installer methodology recognizes that it is possible to add built-in data: Some preassembled mapping indexes are available (pre-built built-in) Mapping indexes can be created for any entry in the all_fasta data table. The User clicks checkboxes/multiple selects for preassembled data to download and also selects the fasta entries to be indexed with the Data Manager tool. ToolDev: In ToolShed repository, needs to provide a description of a and b; for simplicity we can assume b is a subset of a, but with a different attribute/flag (e.g. test_only, 'real', both) or perhaps a different filename; abstractly, they are the same thing just run at different times with the testing ones not requiring user interaction/selection. So, the real question becomes, what does this description look like? It is probably an XML file, for now lets call it '__data_table_bootstrap__.xml' (alternatively, we can roll it directly into the existing data_manager_conf.xml files in the toolshed, although for a list of static downloads, we don't need an actual data manager tool). It could look something like this (quick and dirty pass, elements and values are made up): <data_table name="bwa_index" production="True" testing="False"> <!--both would default to True --> <data_manager id="bwa_indexer"> <param name="list_of_fasta_files_from_all_fasta"/> <!-- corresponds to the all_fasta parameter value in the data manager, cycles over each of the fasta values to provide selections --> </data_manager> <download code="script_for_prebuilt.py"> <!-- could be static list or some dynamically determined listing /> <available method="get_list" /> <!-- returns sets of parameter values for available data to download--> <fetch method="get_indexes" /> <!-- takes the values selected from above available, returns list of URI source and relative target--> </download> <download > <!-- could be static list or some dynamically determined listing /> <entry> <field name="dbkey" value="hg19"/> <field name="description" value="Human - hg19"/> <files> <source="http://file1" target="relative/path/to/file1"/> <source="http://file2" target="relative/path/to/file2"/> </files> </entry> </download> </data_table> Any thoughts? Thanks, Dan On Oct 8, 2013, at 4:26 PM, Guest, Simon wrote:
I look forward to some more details from Dan on *.loc file setup.
Hi Peter, Dan and all,
What a timely discussion! I am just in the process of setting up loc files for some new indexes I have created (bowtie2, etc), and would really like to see this automated.
I see there is a Galaxy script scripts/loc_files/create_all_fasta_loc.py, which is quite sophisticated, and does this job nicely for all_fasta.loc. I'm feeling an urge to somehow extend this script to cope with other datatypes besides fasta, but am wondering if this will be wasted effort if there will soon be a better way to handle this.
Can Dan or anyone else comment on this?
cheers, Simon
On Oct 8, 2013, at 11:25 AM, Peter Cock wrote:
On Tue, Oct 8, 2013 at 4:13 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
I don't agree with this - the sample files should be used as guidance for the admin to create functionally correct .loc files. This is the same aopproach used for all Galaxy .sample files ( e.g., universe_wsgi.ini.sample <-> universe_wsgi.ini, etc )
Why then does the tool_conf.xml.sample file get used by the test framework then? This is a clear example of *.xml.sample being used in the test framework over the 'real' file *.xml.
I really don't understand this design choice - I would use tool_conf.xml (it lists the tools actually installed on our Galaxy, and therefore the things worth testing) while by default tool_conf.xml.sample includes a whole load of things where the binaries etc are missing and so the tests will fail (hiding potential real failures in the noise).
I'm not quite sure of the reason for htis as I didn't make this design choice - I'm sure "ancient Galaxy history" plays a role in this decision.
Probably ;)
Perhaps rather than overloading *.loc.sample with two roles (sample configuration/documentation and unit tests), we need to introduce *.loc.test for functional testing purposes?
I'm hoping we don''t have to go this route as we have so many priorities. If you would like this implemented though, please add a new Trello card and we'll consider it.
Filed: https://trello.com/c/P90b5Pa0/1165-functional-tests-need-separate-loc-files-...
That still leaves open the question of how best to install the test databases or files that the *.loc.test file would point at for running functional tests.
Yes!
I look forward to some more details from Dan on *.loc file setup.
Thank you,
Peter
Hi Dan, On Tue, Oct 15, 2013 at 7:40 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi all,
I think what we have are two similar, but somewhat separate problems: 1.) We need a way via the UI for an admin to be able to add additional configuration entries to data tables / .loc files.
For 1.), we now have Data Managers. A Data Manager will do all the heavy lifting of adding additional data table entries. e.g. for bwa, it can build the mapping indexes and add the properly delimited line to the .loc file. These are accessed through the admin interface, under Manage local data. Data Managers are installed from a ToolShed, or can be installed manually. In addition to direct interactive usage, Data Manager tools can be included in workflows or accessed via the tools API. Not only does the use of a Data Manager remove the technical burdens/ concerns of adding new entries to a data table / .loc file, it also provides for the same reproducibility and provenance tracking that is afforded to regular Galaxy tools.
You said there Data Managers can be used within a workflow. I don't quite follow - aren't the Data Managers restricted to administrators only? If you don't mind me picking two specific examples of direct personal interest - which lead me to ask if there a default Data Manager which just offers a web GUI for editing any *.loc file as a table? -- Blast2GO - http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go This tool wrapper uses blast2go.loc which should list one or more Blast2G) *.properties files. These can in principle be used for advanced things like changing evidence weighting codes etc. However, the primary point is to point to different Blast2GO databases. There have been a series of (date stamped) public (free) Blast2GO databases, and my tool installation script already sets up the *.properties files for the most recent databases (which it uses for a unit test), which was your point 2 (below). The local Galaxy administrator may need to add extra entries to the blast2go.loc file, for instance when there is a new public database release, or if they setup a local database (recommended). This seems to be an easy case (since there is little that we can automate). A simple interface for adding lines to the *.loc files would be enough, assuming it includes a file select browser. -- BLAST+ - http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/ This uses blastdb.loc (nucleotides), blastdb_p.loc (proteins) etc. A simple interface for adding lines to the *.loc files would be useful, although the oddities of BLAST database naming might need a little code on top of a plain file select browser (the database name if the file path temp without the *.nal, *.pal, etc extension). There is potential for offering to automatically create databases from this all_fasta data table you mention below?
The documentation for Data Managers is currently limited to the tutorial-style doc here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers/HowTo/Define; a more formal / config syntax type of page will also be made available, although the tutorial is a pretty inclusive description of the steps needed to define a Data Manager.
Could I suggest you add that information (paraphrase what you just said in this email) to the main page: http://wiki.galaxyproject.org/Admin/Tools/DataManagers I think that would help.
2.) We need a way to bootstrap/initialize a Galaxy installation with data table/ .loc file entries ('built-in data') during installation for a.) a 'production' Galaxy instance - this would include local dev/testing/etc instances b.) automated testing framework - tests should run fast, but meaningfully test a tool, e.g., the horse mitochondrial genome could be a fine built-in genome for running automated tool tests, but not desired to be automatically installed into a production Galaxy instance
For 2.): bootstrapping data during an installation process is something that still needs to be more completely spec'd out and implemented. ...
OK, so the Data Manager work does not yet cover bootstrapping (installing data as part of tool installation from the tool shed etc). Regarding 2(b), Greg and I talked about this earlier in the thread and I filed Trello Card 1165 on a related issue: https://trello.com/c/P90b5Pa0/1165-functional-tests-need-separate-loc-files-... Thanks, Peter
Hi Peter, Please see replies inline, below. Thanks, Dan On Oct 17, 2013, at 5:36 AM, Peter Cock wrote:
Hi Dan,
On Tue, Oct 15, 2013 at 7:40 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi all,
I think what we have are two similar, but somewhat separate problems: 1.) We need a way via the UI for an admin to be able to add additional configuration entries to data tables / .loc files.
For 1.), we now have Data Managers. A Data Manager will do all the heavy lifting of adding additional data table entries. e.g. for bwa, it can build the mapping indexes and add the properly delimited line to the .loc file. These are accessed through the admin interface, under Manage local data. Data Managers are installed from a ToolShed, or can be installed manually. In addition to direct interactive usage, Data Manager tools can be included in workflows or accessed via the tools API. Not only does the use of a Data Manager remove the technical burdens/ concerns of adding new entries to a data table / .loc file, it also provides for the same reproducibility and provenance tracking that is afforded to regular Galaxy tools.
You said there Data Managers can be used within a workflow. I don't quite follow - aren't the Data Managers restricted to administrators only?
This is correct. Admins can run workflows containing Data Managers, while standard users cannot. Additionally, the selection list for any installed Data Managers will only appear within the workflow editor for an admin.
If you don't mind me picking two specific examples of direct personal interest - which lead me to ask if there a default Data Manager which just offers a web GUI for editing any *.loc file as a table?
Something like this for adding entries could be done now, although currently existing entries cannot be modified or removed by using Data Managers. There is not currently a generic Data Manager written that will do this though. On my list of things to do is to write a Data Manager that would generically make use of our datacache rsync server, but there is not an ETA for this. Another one, or the same one, could also make use of S3, which would be particularly useful for Cloud instances.
--
Blast2GO - http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go This tool wrapper uses blast2go.loc which should list one or more Blast2G) *.properties files. These can in principle be used for advanced things like changing evidence weighting codes etc. However, the primary point is to point to different Blast2GO databases.
There have been a series of (date stamped) public (free) Blast2GO databases, and my tool installation script already sets up the *.properties files for the most recent databases (which it uses for a unit test), which was your point 2 (below).
The local Galaxy administrator may need to add extra entries to the blast2go.loc file, for instance when there is a new public database release, or if they setup a local database (recommended).
This seems to be an easy case (since there is little that we can automate). A simple interface for adding lines to the *.loc files would be enough, assuming it includes a file select browser.
In this case, you could define a blast2go Data Manager that would be able to allow the selection of the external public (free) Blast2GO that the user wants. A code file could be used to populate this list dynamically from the external server's contents until a more generalized way of doing so is made available to tool parameters. The underlying Data Manager tool would then retrieve the database and return a JSON description of the fields to add to the data table .loc file. This same Data Manager could be allowed to add a file locally from a server's filesystem. We don't have a filesystem select widget for tools yet, but you could use a textbox for manual entry or use a select list/drill down with dynamic code for this. A ServerFileToolParameter could be defined to list server contents directly, but we would want to make sure that ordinary tool devs are aware of it being a bit of security risk, depending upon how it is used (don't want ordinary users, selecting random files off of the filesystem in normal tools, usually). It may be worthwhile to have a look at the Reference Genome / all_fasta data manager (http://testtoolshed.g2.bx.psu.edu/view/blankenberg/data_manager_fetch_genome...), which can grab reference genome FASTAs from UCSC, NCBI, a URL, a Galaxy History, or a Directory on the server (copy or symlink) and then populates the all_fasta table.
--
BLAST+ - http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/ This uses blastdb.loc (nucleotides), blastdb_p.loc (proteins) etc. A simple interface for adding lines to the *.loc files would be useful, although the oddities of BLAST database naming might need a little code on top of a plain file select browser (the database name if the file path temp without the *.nal, *.pal, etc extension).
There is potential for offering to automatically create databases from this all_fasta data table you mention below?
The BWA index data manager (http://testtoolshed.g2.bx.psu.edu/view/blankenberg/data_manager_bwa_index_bu...) uses the genomes available under all_fasta for building the mapping indexes.
The documentation for Data Managers is currently limited to the tutorial-style doc here: http://wiki.galaxyproject.org/Admin/Tools/DataManagers/HowTo/Define; a more formal / config syntax type of page will also be made available, although the tutorial is a pretty inclusive description of the steps needed to define a Data Manager.
Could I suggest you add that information (paraphrase what you just said in this email) to the main page:
http://wiki.galaxyproject.org/Admin/Tools/DataManagers
I think that would help.
Great suggestion, I'll add a bit of this and link to this discussion.
2.) We need a way to bootstrap/initialize a Galaxy installation with data table/ .loc file entries ('built-in data') during installation for a.) a 'production' Galaxy instance - this would include local dev/testing/etc instances b.) automated testing framework - tests should run fast, but meaningfully test a tool, e.g., the horse mitochondrial genome could be a fine built-in genome for running automated tool tests, but not desired to be automatically installed into a production Galaxy instance
For 2.): bootstrapping data during an installation process is something that still needs to be more completely spec'd out and implemented. ...
OK, so the Data Manager work does not yet cover bootstrapping (installing data as part of tool installation from the tool shed etc).
Regarding 2(b), Greg and I talked about this earlier in the thread and I filed Trello Card 1165 on a related issue: https://trello.com/c/P90b5Pa0/1165-functional-tests-need-separate-loc-files-...
This is a very important feature, especially for the automated testing framework. I'll add a comment to the card referencing this thread. If anyone wants to help working out the XML spec, I think that would be a great help -- IMHO, defining a well-thought out, solid, flexible XML description is probably harder than the actual implementation.
Thanks,
Peter
participants (5)
-
Daniel Blankenberg
-
Greg Von Kuster
-
Guest, Simon
-
Jean-Frédéric Berthelot
-
Peter Cock