It is easiest to generate tools for galaxy when the
applications or scripts can take arbitrarily named input
files and generate output to given path names.
Input directories, output directories are very convenient on
the command line, but more of a challenge when crafting a
galaxy tool.
That said, many applications require a wrapper script to
work with in galaxy.
Thank you for the consistent script_info[] help/usage syntax
in the qiime scripts, which enabled me to generate a
skeleton galaxy tool_config file for each qiime script.
I had some time last spring to work on integrating qiime
into galaxy.
Unfortunately, I haven't had any time since to work on this.
I put those partial results on the Galaxy Tool Shed:
http://toolshed.g2.bx.psu.edu/
There's a continuing effort at George Mason University to
incorporate qiime into galaxy tools, so you may want to ask
them what they need.
I started by generating galaxy tool_config files, e.g.
align_seqs.xml, by using python to get the script_info[]
from the qiime script:
$ cat generate_tool_config.bash
#!/usr/bin/env bash
python $1 > ${1%.*}.help
cat tool_template.txt | sed "s/__TOOL_BINARY__/${1}/" |
python -i $1 -h > ${1%.*}.log
(I'll attach tool_template.txt )
This generated skeleton tool_config .xml files that I could
then edit as needed.
(
http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax
)
I originally was calling all qiime scripts from a tool
wrapper: qiime_wrapper.py
But, if a script can be called with any input filepaths and
write its results to any filepaths, and only writes to
STDERR when it fails, then you could call that script
directly.
When should you use a tool_wrapper or call the qiime script
directly?
Many of the qiime scripts could probably be called
directly, especially if it can be called with arbitary
input/output file pathnames.
The reasons for using a tool wrapper may be if
input/output needs to be manipulated, moved, renamed in
order to be used by the qiime script.
You'll also need a tool wrapper if the names or number of
the output files can not be determined from the parameter
settings.
(
http://wiki.g2.bx.psu.edu/Admin/Tools/Multiple%20Output%20Files
)
If your tool relies on a file ext to determine a format,
you'll have to rename the input.
( Galaxy dataset pathnames will look something like:
/<your_galaxy_file_path>/072/dataset_72931.dat )
The format/type of a dataset is stored in its metadata, so
the tool_config can use that information, especially if a
script can take muliple alternative input formats.
A tool_wrapper can also be used to manage the stdout or
stderr from a tool. Galaxy currently interprets any output
on stderr as a failure.
A couple changes in galaxy should make somethings easier
than when I first attempted this:
- galaxy now accepts dataset requests with sub
directories. (
https://bitbucket.org/galaxy/galaxy-central/issue/494/support-sub-dirs-in-extra_files_path-patch
)
That means that output HTML files with links into sub
directories can be left intact, with the html copied to the
output dataset and the linked files to its
"extra_files_path".
- if you know the pathname of an output relative to the
working directory, galaxy can copy it automatically to the
output dataset using the from_work_dir attribute.
( see example in:
https://bitbucket.org/galaxy/galaxy-central/src/21b645303c02/tools/ngs_rna/tophat_wrapper.xml
)
Datatypes
You may want to create new datatypes to make it easier for
the user to correctly select inputs to a tool from previous
outputs.
For example, the qiime mapping file is a tabular file with
specific requirements. I put a 'qiimemapping' datatype in
lib/galaxy/datatypes/metagenomics.py and datatypes_conf.xml
so an input could generate a select list containing only
qiimemapping datasets rather than all tabular ones.
Generating a configfile
You can generate configfiles in the galaxy tool_config
.xml file. The configfile is generated by the Cheetah
interpreter just as the commandline is.
see: alpha_rarefaction.xml
The qiime_wrapper.py was patterned after the
mothur_wrapper.py with some of the same wrapper params to
handle run time determined output (perhaps not needed):
--galaxy_datasets
a comma separated list of regex:output_dataset the
wrapper searches the working_dir and copies the file that
matches the regex to the outout dataset
if the exact pathname is known, use the
"from_work_dir" attribute instead
--galaxy_datasetid
would be an output dataset id that would be used to
dynamically create additional new datasets at job
termination
(
http://wiki.g2.bx.psu.edu/Admin/Tools/Multiple%20Output%20Files
"Number of Output datasets cannot be determined until tool
run")
--galaxy_new_datasets
a comma separated list of regex:datatype used to
dynamically create additional new datasets at job
termination
--galaxy_new_files_path
the galaxy dir for dynamically generated output
datasets