Hi Roberto,
Given the way BAM indexing works, I see no reason to actually
split the BAM file at all - it seems like wasted disk IO.
Instead, can you split a BED file into sub-regions? This way
each child GATK job would look at the full BAM file but only for
a small region described in the split BED region file?
Peter
> ___________________________________________________________
On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
> Hello,
>
> I have been working in the Galaxy parallelization module and I would like to
> ask you some questions that I have about how to face one problem.
> I have done one pull request about splitting bams:
> https://github.com/galaxyproject/galaxy/pull/184
>
> Regarding this, I think it is useful but it could be more while accessing
> somehow the interval. I better explain it with an example:
> If I define a simple tool like this, with the parallelism tag "actived":
>
> <tool id="gatk" name="call with gatk">
> <description>gatk</description>
> <parallelism method="multi" split_mode="by_interval"
> split_size="100000000" merge_outputs="output" split_inputs="input"
>></parallelism>
>
> <command>
> ## by_rname
> ln -s $input input.bam;
> samtools index input.bam;
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L REGION ;
>
> </command>
> <inputs>
> <param format="bam" name="input" type="data" label="bam"/>
> </inputs>
> <outputs>
> <data format="vcf" name="output" />
> </outputs>
>
> <help>
> bwa
> </help>
> </tool>
>
> The region is based on the field split_size, it is better explained in the
> PR.
> How does the code from the PR work? It goes through the bam file and does
> something like "samtools view REGION -o bam_splitted.bam", so then GATK does
> the calling for this small bam, but what is the problem? As you know, in the
> software GATK if you don't pass the region as an argument in the command
> line it goes through all the genome, so it is very slow. So, what would you
> recommend to me to be able to pass this information to GATK? I was thinking
> to create, at the same time the bam is splitted, a file region.bed and use
> it in the tool definition xml, so the command would be like this:
> <command>
> ...
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L region.bed;
> </command>
>
> This solution does not convince me too much because it is a bit intrusive in
> the tool definition and also because you have to trust that the region.bed
> file exists.
> Do you have any opinion, suggestion...?
>
> Thanks a lot!
>
> Best regards
>
>
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: +34 963289680 Ext. 1021
> Fax: +34 963289574
> E-Mail: ralonso@cipf.es
>
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
> http://galaxyproject.org/search/mailinglists/