Hello,

I agree, what you say fits perfectly for GATK, but as I wanted to create a more generic code I did it this way (also because I am a newbie in the galaxy code and I didn't know so well how to implement this ). What about a tool that doesn't accept a region, just a bam? Maybe we can put another parameter in the parallelism tag that force to split the bam. 
Mostly, just to create a bed file would be better, right? 
What do you think?

Regards

On 6 May 2015 at 12:23, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Roberto,

Given the way BAM indexing works, I see no reason to actually
split the BAM file at all - it seems like wasted disk IO.

Instead, can you split a BED file into sub-regions? This way
each child GATK job would look at the full BAM file but only for
a small region described in the split BED region file?

Peter


On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
> Hello,
>
> I have been working in the Galaxy parallelization module and I would like to
> ask you some questions that I have about how to face one problem.
> I have done one pull request about splitting bams:
> https://github.com/galaxyproject/galaxy/pull/184
>
> Regarding this, I think it is useful but it could be more while accessing
> somehow the interval. I better explain it with an example:
> If I define a simple tool like this, with the parallelism tag "actived":
>
> <tool id="gatk" name="call with gatk">
>   <description>gatk</description>
>   <parallelism method="multi" split_mode="by_interval"
> split_size="100000000" merge_outputs="output" split_inputs="input"
>></parallelism>
>
>   <command>
> ## by_rname
> ln -s $input input.bam;
> samtools index input.bam;
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L REGION ;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
>   <help>
>   bwa
>   </help>
> </tool>
>
> The region is based on the field split_size, it is better explained in the
> PR.
> How does the code from the PR work? It goes through the bam file and does
> something like "samtools view REGION -o bam_splitted.bam", so then GATK does
> the calling for this small bam, but what is the problem? As you know, in the
> software GATK if you don't pass the region as an argument in the command
> line it goes through all the genome, so it is very slow. So, what would you
> recommend to me to be able to pass this information to GATK? I was thinking
> to create, at the same time the bam is splitted, a file region.bed and use
> it in the tool definition xml, so the command would be like this:
>   <command>
> ...
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L region.bed;
> </command>
>
> This solution does not convince me too much because it is a bit intrusive in
> the tool definition and also because you have to trust that the region.bed
> file exists.
> Do you have any opinion, suggestion...?
>
> Thanks a lot!
>
> Best regards
>
>
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: +34 963289680 Ext. 1021
> Fax: +34 963289574
> E-Mail: ralonso@cipf.es
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: ralonso@cipf.es