Getting meta data from input files
Hi All, We would like to use the new GATK modules in our DNA pipeline, so I have tried to run the tools from the "Analyse data" menu After setting up the appropriate tables in gatk_sorted_picard_index.loc, I made it run as expected. However when I tried to run it from a workflow, things didn't turn out so well - in fact it couldn't run at all. I know that the GATK tools are still in beta, so I looked into the tool xml-wrappers and fixed the error after some poking around, but in the process of debugging the wrapper, I realized one general thing that I found quite odd. It's about meta data in Galaxy's input data file representations. When a tool needs a reference genome (i.e. some mapping tool, or one of the bam analysis tools), there are always only two options when it comes to the source of the reference genome: 1: Get the reference data from history or 2: use built in (sometime referred to as cached). In any case the user has to select the reference genome before running the tool. This is fine, but what happens if two or more of these tools are called in the same workflow? Well - Then the workflow designer can 1: choose a design that allows the workflow to work only on one genome by selecting the "built-in" option and select the proper genome, and then define one workflow pr. genome - i.e by cloning the workflow and change the parameters for each tool, or 2: Set the state of all genome selection fields to "set at runtime". This implies of course, that the user running the workflow must go through all genome selection fields on all tools, and select the proper genome before the workflow is executed, OR 3:Set all genome selections to history, and specify a common workflow input for the genome reference reference input for all tools in the workflow - this approach could be problematic though, since not all tools use the same reference file format. Clearly none of these methods are ideal, when working with data form several genomes. So there is a fourth option that I miss in the current Galaxy tool implementations (actually I thought that this was what "cached" meant, until I looked at the xml-files). Namely the ability to get the genome reference file at runtime from the input data file meta data. This implies that the tools should have an extra reference source selection option: "From input meta data". This would allow workflow designers to forget all about reference data, since the tools automatically will pick the appropriate reference genome from the input file's meta data. In fact this is not so difficult to implement with the operations that are currently available in the wrapper xml / command language. In the <inputs> section of the tool XML-file, the genome reference tags could look like this (the example is from the fixed GATK "Count Covariates on BAM files" tool XML file "count_covariates.xml"): . . . <param name="input_bam" type="data" format="bam" label="BAM file"> <validator type="unspecified_build" /> <validator type="dataset_metadata_in_data_table" table_name="gatk_picard_indexes" metadata_name="dbkey" metadata_column="2" message="Sequences are not currently available for the specified build." /> </param> <conditional name="reference_source"> <param name="reference_source_selector" type="select" label="Choose the source for the reference list"> <option value="meta_data">From input file meta data</option> <option value="internal">Internal reference</option> <option value="history">History</option> </param> <when value="internal"> <param name="ref_file" type="select" label="Select a reference genome"> <options from_data_table="gatk_picard_indexes"> <filter type="sort_by" column="2" /> <validator type="no_options" message="No indexes are available" /> </options> </param> </when> <when value="history"> <param name="ref_file" type="data" format="fasta" label="Using reference file" /> </when> </conditional> . . . Note that there is no genome selection box on the GUI, if the 'From input meta data' option is selected, since it wouldn't make much sense. Then in the command section it could read something like this: . . . #if $reference_source.reference_source_selector == "internal": -R "${reference_source.ref_file.fields.path}" #end if #if $reference_source.reference_source_selector == "meta_data": -R "${ filter( lambda x: str( x[1] ) == str( $input_bam.metadata.dbkey ), $__app__.tool_data_tables['gatk_picard_indexes'].get_fields() )[0][3] }" #end if #if str( $reference_source.reference_source_selector ) == "history": -d "-R" "${reference_source.ref_file}" "${reference_source.ref_file.ext}" "gatk_input" #end if . . . When "From input meta data" is selected as the reference source, the second if-statement above performs a lookup at run-time, to check the state of the meta-data and retrieve the file path to the corresponding genome reference data. This will always work, since another tool can get the filepath from it's own axillary reference table, in this case the table is called 'gatk_picard_indexes', but other tools might use other reference files. This works fine in the GATK tools, and could be standard in other tools as well. The meta data is already there, I just don't see it put to any use anywhere, except for informational purposes - which, in my humble opinion, is a pity. The only problem I see, is that it could be tricky to display a meaningful message, when the input data doesn't contain the necessary meta data. Then it is left to the underlying tool to shout out about the error. An elegant solution to this error-problem could be an extension to the command scripting language, or perhaps someone has an other idea? Hope this information was useful to some of you Galaxy tool nerds out there :-) Kind regards, and thanks for a great framework - Frank -- Frank Sørensen, B.Sc., Programmer Molecular Diagnostic Laboratory (MDL) Molekylær Medicinsk Afdeling (MOMA) Århus Universitetshospital Skejby, Brendstrupgårdsvej, 8200 Århus N Tlf. +45 7845 5363
participants (1)
-
Frank Sørensen