Hi All,
We would like to use the new GATK modules in our DNA pipeline, so I
have tried to run the tools from the "Analyse data" menu After
setting up the appropriate tables in gatk_sorted_picard_index.loc, I
made it run as expected. However when I tried to run it from a
workflow, things didn't turn out so well - in fact it couldn't run
at all.
I know that the GATK tools are still in beta, so I looked into the
tool xml-wrappers and fixed the error after some poking around, but
in the process of debugging the wrapper, I realized one general
thing that
I found quite odd.
It's about meta data in Galaxy's input data file representations.
When a tool needs a reference genome (i.e. some mapping tool, or one
of the bam analysis tools), there are always only two options when
it comes to the source of the reference genome: 1: Get the reference
data from history or 2: use built in (sometime referred to as
cached). In any case the user has to select the reference genome
before running the tool. This is fine, but what happens if two or
more of these tools are called in the same workflow?
Well - Then the workflow designer can 1: choose a design that allows
the workflow to work only on one genome by selecting the "built-in"
option and select the proper genome, and then define one workflow
pr. genome - i.e by cloning the workflow and change the parameters
for each tool, or 2: Set the state of all genome selection fields to
"set at runtime". This implies of course, that the user running the
workflow must go through all genome selection fields on all tools,
and select the proper genome before the workflow is executed, OR
3:Set all genome selections to history, and specify a common
workflow input for the genome reference reference input for all
tools in the workflow - this approach could be problematic though,
since not all tools use the same reference file format.
Clearly none of these methods are ideal, when working with data form
several genomes. So there is a fourth option that I miss in the
current Galaxy tool implementations (actually I thought that this
was what "cached" meant, until I looked at the xml-files).
Namely the ability to get the genome reference file at runtime from
the input data file meta data. This implies that the tools should
have an extra reference source selection option: "From input meta
data". This would allow workflow designers to forget all about
reference data, since the tools automatically will pick the
appropriate reference genome from the input file's meta data.
In fact this is not so difficult to implement with the operations
that are currently available in the wrapper xml / command language.
In the <inputs> section of the tool XML-file, the genome
reference tags could look like this (the example is from the fixed
GATK "Count Covariates on BAM files" tool XML file
"count_covariates.xml"):
.
.
.
<param name="input_bam" type="data" format="bam" label="BAM
file">
<validator type="unspecified_build" />
<validator type="dataset_metadata_in_data_table"
table_name="gatk_picard_indexes"
metadata_name="dbkey"
metadata_column="2"
message="Sequences are not currently available
for the specified build." />
</param>
<conditional name="reference_source">
<param name="reference_source_selector" type="select"
label="Choose the source for the reference list">
<option value="meta_data">From input file meta
data</option>
<option value="internal">Internal
reference</option>
<option value="history">History</option>
</param>
<when value="internal">
<param name="ref_file" type="select" label="Select a
reference genome">
<options from_data_table="gatk_picard_indexes">
<filter type="sort_by" column="2" />
<validator type="no_options" message="No indexes
are available" />
</options>
</param>
</when>
<when value="history">
<param name="ref_file" type="data" format="fasta"
label="Using reference file" />
</when>
</conditional>
.
.
.
Note that there is no genome selection box on the GUI, if the 'From
input meta data' option is selected, since it wouldn't make much
sense.
Then in the command section it could read something like this:
.
.
.
#if
$reference_source.reference_source_selector == "internal":
-R "${reference_source.ref_file.fields.path}"
#end if
#if $reference_source.reference_source_selector ==
"meta_data":
-R "${ filter( lambda x: str( x[1] ) == str(
$input_bam.metadata.dbkey ),
$__app__.tool_data_tables['gatk_picard_indexes'].get_fields()
)[0][3] }"
#end if
#if str( $reference_source.reference_source_selector ) ==
"history":
-d "-R" "${reference_source.ref_file}"
"${reference_source.ref_file.ext}" "gatk_input"
#end if
.
.
.
When "From input meta data" is selected as the reference source, the
second if-statement above performs a lookup at run-time, to check
the state of the meta-data and retrieve the file path to the
corresponding genome reference data. This will always work, since
another tool can get the filepath from it's own axillary reference
table, in this case the table is called 'gatk_picard_indexes', but
other tools might use other reference files.
This works fine in the GATK tools, and could be standard in other
tools as well. The meta data is already there, I just don't see it
put to any use anywhere, except for informational purposes - which,
in my humble opinion, is a pity.
The only problem I see, is that it could be tricky to display a
meaningful message, when the input data doesn't contain the
necessary meta data. Then it is left to the underlying tool to shout
out about the error. An elegant solution to this error-problem could
be an extension to the command scripting language, or perhaps
someone has an other idea?
Hope this information was useful to some of you Galaxy tool nerds
out there :-)
Kind regards, and thanks for a great framework
- Frank
--
Frank Sørensen, B.Sc., Programmer
Molecular Diagnostic Laboratory (MDL)
Molekylær Medicinsk Afdeling (MOMA)
Århus Universitetshospital Skejby, Brendstrupgårdsvej, 8200 Århus
N
Tlf. +45 7845 5363