bam split and gatk calling

6 May 2015

      Hello,

I have been working in the Galaxy parallelization module and I would like
to ask you some questions that I have about how to face one problem.
I have done one pull request about splitting bams:
https://github.com/galaxyproject/galaxy/pull/184

Regarding this, I think it is useful but it could be more while accessing
somehow the interval. I better explain it with an example:
If I define a simple tool like this, with the parallelism tag "actived":

<tool id="gatk" name="call with gatk">
  <description>gatk</description>
*  <parallelism method="multi" split_mode="by_interval"
split_size="100000000" merge_outputs="output" split_inputs="input"
...
</parallelism>*
<command>
## by_rname
ln -s $input input.bam;
samtools index input.bam;
UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
-o $output -L *REGION* ;

  </command>
  <inputs>
    <param format="bam" name="input" type="data" label="bam"/>
  </inputs>
  <outputs>
      <data format="vcf" name="output" />
  </outputs>

  <help>
  bwa
  </help>
</tool>

The region is based on the field split_size, it is better explained in the
PR.
How does the code from the PR work? It goes through the bam file and does
something like "samtools view *REGION *-o bam_splitted.bam", so then GATK
does the calling for this small bam, but what is the problem? As you know,
in the software GATK if you don't pass the region as an argument in the
command line it goes through all the genome, so it is very slow. So, what
would you recommend to me to be able to pass this information to GATK? I
was thinking to create, at the same time the bam is splitted, a file
region.bed and use it in the tool definition xml, so the command would be
like this:
  <command>
...
UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
-o $output -L *region.bed*;
</command>

This solution does not convince me too much because it is a bit intrusive
in the tool definition and also because you have to trust that the
*region.bed* file exists.
Do you have any opinion, suggestion...?

Thanks a lot!

Best regards

-- 
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: ralonso@cipf.es

Roberto Alonso CIPF

Peter Cock

Roberto Alonso CIPF

Peter Cock

Roberto Alonso CIPF

Joshua Udall

Roberto Alonso CIPF

tags

participants (3)