Hi All,
We would like to use the new GATK modules in our DNA pipeline, so I have
tried to run the tools from the "Analyse data" menu After setting up the
appropriate tables in gatk_sorted_picard_index.loc, I made it run as
expected. However when I tried to run it from a workflow, things didn't
turn out so well - in fact it couldn't run at all.
I know that the GATK tools are still in beta, so I looked into the tool
xml-wrappers and fixed the error after some poking around, but in the
process of debugging the wrapper, I realized one general thing that I
found quite odd.
It's about meta data in Galaxy's input data file representations.
When a tool needs a reference genome (i.e. some mapping tool, or one of
the bam analysis tools), there are always only two options when it comes
to the source of the reference genome: 1: Get the reference data from
history or 2: use built in (sometime referred to as cached). In any case
the user has to select the reference genome before running the tool.
This is fine, but what happens if two or more of these tools are called
in the same workflow?
Well - Then the workflow designer can 1: choose a design that allows the
workflow to work only on one genome by selecting the "built-in" option
and select the proper genome, and then define one workflow pr. genome -
i.e by cloning the workflow and change the parameters for each tool, or
2: Set the state of all genome selection fields to "set at runtime".
This implies of course, that the user running the workflow must go
through all genome selection fields on all tools, and select the proper
genome before the workflow is executed, OR 3:Set all genome selections
to history, and specify a common workflow input for the genome reference
reference input for all tools in the workflow - this approach could be
problematic though, since not all tools use the same reference file format.
Clearly none of these methods are ideal, when working with data form
several genomes. So there is a fourth option that I miss in the current
Galaxy tool implementations (actually I thought that this was what
"cached" meant, until I looked at the xml-files).
Namely the ability to get the genome reference file at runtime from the
input data file meta data. This implies that the tools should have an
extra reference source selection option: "From input meta data". This
would allow workflow designers to forget all about reference data, since
the tools automatically will pick the appropriate reference genome from
the input file's meta data.
In fact this is not so difficult to implement with the operations that
are currently available in the wrapper xml / command language.
In the <inputs> section of the tool XML-file, the genome reference tags
could look like this (the example is from the fixed GATK "Count
Covariates on BAM files" tool XML file "count_covariates.xml"):
.
.
.
<param name="input_bam" type="data" format="bam" label="BAM file">
<validator type="unspecified_build" />
<validator type="dataset_metadata_in_data_table"
table_name="gatk_picard_indexes"
metadata_name="dbkey"
metadata_column="2"
message="Sequences are not currently available for the
specified build." />
</param>
<conditional name="reference_source">
<param name="reference_source_selector" type="select" label="Choose the
source for the reference list">
<option value="meta_data">From input file meta data</option>
<option value="internal">Internal reference</option>
<option value="history">History</option>
</param>
<when value="internal">
<param name="ref_file" type="select" label="Select a reference genome">
<options from_data_table="gatk_picard_indexes">
<filter type="sort_by" column="2" />
<validator type="no_options" message="No indexes are available" />
</options>
</param>
</when>
<when value="history">
<param name="ref_file" type="data" format="fasta" label="Using reference
file" />
</when>
</conditional>
.
.
.
Note that there is no genome selection box on the GUI, if the 'From
input meta data' option is selected, since it wouldn't make much sense.
Then in the command section it could read something like this:
.
.
.
#if $reference_source.reference_source_selector == "internal":
-R "${reference_source.ref_file.fields.path}"
#end if
#if $reference_source.reference_source_selector == "meta_data":
-R "${ filter( lambda x: str( x[1] ) == str(
$input_bam.metadata.dbkey ),
$__app__.tool_data_tables['gatk_picard_indexes'].get_fields() )[0][3] }"
#end if
#if str( $reference_source.reference_source_selector ) == "history":
-d "-R" "${reference_source.ref_file}"
"${reference_source.ref_file.ext}" "gatk_input"
#end if
.
.
.
When "From input meta data" is selected as the reference source, the
second if-statement above performs a lookup at run-time, to check the
state of the meta-data and retrieve the file path to the corresponding
genome reference data. This will always work, since another tool can get
the filepath from it's own axillary reference table, in this case the
table is called 'gatk_picard_indexes', but other tools might use other
reference files.
This works fine in the GATK tools, and could be standard in other tools
as well. The meta data is already there, I just don't see it put to any
use anywhere, except for informational purposes - which, in my humble
opinion, is a pity.
The only problem I see, is that it could be tricky to display a
meaningful message, when the input data doesn't contain the necessary
meta data. Then it is left to the underlying tool to shout out about the
error. An elegant solution to this error-problem could be an extension
to the command scripting language, or perhaps someone has an other idea?
Hope this information was useful to some of you Galaxy tool nerds out
there :-)
Kind regards, and thanks for a great framework
- Frank
--
Frank Sørensen, B.Sc., Programmer
Molecular Diagnostic Laboratory (MDL)
Molekylær Medicinsk Afdeling (MOMA)
Århus Universitetshospital Skejby, Brendstrupgårdsvej, 8200 Århus N
Tlf. +45 7845 5363