Getting meta data from input files

15 Mar 2012

      Hi All,

We would like to use the new GATK modules in our DNA pipeline, so I have 
tried to run the tools from the "Analyse data" menu After setting up the 
appropriate tables in gatk_sorted_picard_index.loc, I made it run as 
expected. However when I tried to run it from a workflow, things didn't 
turn out so well - in fact it couldn't run at all.

I know that the GATK tools are still in beta, so I looked into the tool 
xml-wrappers and fixed the error after some poking around, but in the 
process of debugging the wrapper, I realized one general thing that I 
found quite odd.

It's about meta data in Galaxy's input data file representations.

When a tool needs a reference genome (i.e. some mapping tool, or one of 
the bam analysis tools), there are always only two options when it comes 
to the source of the reference genome: 1: Get the reference data from 
history or 2: use built in (sometime referred to as cached). In any case 
the user has to select the reference genome before running the tool. 
This is fine, but what happens if two or more of these tools are called 
in the same workflow?

Well - Then the workflow designer can 1: choose a design that allows the 
workflow to work only on one genome by selecting the "built-in" option 
and select the proper genome, and then define one workflow pr. genome - 
i.e by cloning the workflow and change the parameters for each tool, or 
2: Set the state of all genome selection fields to "set at runtime". 
This implies of course, that the user running the workflow must go 
through all genome selection fields on all tools, and select the proper 
genome before the workflow is executed, OR 3:Set all genome selections 
to history, and specify a common workflow input for the genome reference 
reference input for all tools in the workflow - this approach could be 
problematic though, since not all tools use the same reference file format.

Clearly none of these methods are ideal, when working with data form 
several genomes. So there is a fourth option that I miss in the current 
Galaxy tool implementations (actually I thought that this was what 
"cached" meant, until I looked at the xml-files).

Namely the ability to get the genome reference file at runtime from the 
input data file meta data. This implies that the tools should have an 
extra reference source selection option: "From input meta data". This 
would allow workflow designers to forget all about reference data, since 
the tools automatically will pick the appropriate reference genome from 
the input file's meta data.

In fact this is not so difficult to implement with the operations that 
are currently available in the wrapper xml / command language.

In the <inputs> section of the tool XML-file, the genome reference tags 
could look like this (the example is from the fixed GATK "Count 
Covariates on BAM files" tool XML file "count_covariates.xml"):

     .
     .
     .
<param name="input_bam" type="data" format="bam" label="BAM file">
<validator type="unspecified_build" />
<validator type="dataset_metadata_in_data_table"
                  table_name="gatk_picard_indexes"
                  metadata_name="dbkey"
                  metadata_column="2"
                  message="Sequences are not currently available for the 
specified build." />
</param>

<conditional name="reference_source">
<param name="reference_source_selector" type="select" label="Choose the 
source for the reference list">
<option value="meta_data">From input file meta data</option>
<option value="internal">Internal reference</option>
<option value="history">History</option>
</param>
<when value="internal">
<param name="ref_file" type="select" label="Select a reference genome">
<options from_data_table="gatk_picard_indexes">
<filter type="sort_by" column="2" />
<validator type="no_options" message="No indexes are available" />
</options>
</param>
</when>
<when value="history">
<param name="ref_file" type="data" format="fasta" label="Using reference 
file" />
</when>
</conditional>
     .
     .
     .

Note that there is no genome selection box on the GUI, if the 'From 
input meta data' option is selected, since it wouldn't make much sense.

Then in the command section it could read something like this:

     .
     .
     .
     #if $reference_source.reference_source_selector == "internal":
         -R "${reference_source.ref_file.fields.path}"
     #end if

     #if $reference_source.reference_source_selector == "meta_data":
         -R "${ filter( lambda x: str( x[1] ) == str( 
$input_bam.metadata.dbkey ), 
$__app__.tool_data_tables['gatk_picard_indexes'].get_fields() )[0][3] }"
     #end if

     #if str( $reference_source.reference_source_selector ) == "history":
         -d "-R" "${reference_source.ref_file}" 
"${reference_source.ref_file.ext}" "gatk_input"
     #end if
     .
     .
     .

When "From input meta data" is selected as the reference source, the 
second if-statement above performs a lookup at run-time, to check the 
state of the meta-data and retrieve the file path to the corresponding 
genome reference data. This will always work, since another tool can get 
the filepath from it's own axillary reference table, in this case the 
table is called 'gatk_picard_indexes', but other tools might use other 
reference files.

This works fine in the GATK tools, and could be standard in other tools 
as well. The meta data is already there, I just don't see it put to any 
use anywhere, except for informational purposes - which, in my humble 
opinion, is a pity.

The only problem I see, is that it could be tricky to display a 
meaningful message, when the input data doesn't contain the necessary 
meta data. Then it is left to the underlying tool to shout out about the 
error. An elegant solution to this error-problem could be an extension 
to the command scripting language, or perhaps someone has an other idea?

Hope this information was useful to some of you Galaxy tool nerds out 
there :-)

Kind regards, and thanks for a great framework

- Frank

-- 
Frank Sørensen, B.Sc., Programmer
Molecular Diagnostic Laboratory (MDL)
Molekylær Medicinsk Afdeling (MOMA)
Århus Universitetshospital Skejby, Brendstrupgårdsvej, 8200 Århus N
Tlf. +45 7845 5363

Frank Sørensen

tags

participants (1)