GATK best practices with local installation of Galaxy
Hello guys, I'm trying to run a pipeline of the best practices for snp and indel discovery as described by the people at Broad and I'm running into troubles with the GATK tools in a local installation of Galaxy. The main problem I have is that merging bam files with the samtools merge tool doesn't keep read group for each sample, causing "Count Covariates" to crash. The pipeline works fine with a single bam file, but I need to realign at least two files at a time. Is there a way to set the read group of a merged bam inside Galaxy? Are there plans to include the "merge" tool from Picard in Galaxy? Is there an easy way for me to do this locally? (Although I would like to run this in the cloud later on when the workflow is ready). Thanks! Camille -- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
Camille, thanks for reporting this - I think you have found a bug. We definitely need to be able to preserve metadata when we merge bams. Thanks for your suggestion of using mergeSamFiles - yes, I think it might be a good fix for this problem - but it will take a little while and won't reach the Main site for a few weeks once it's done. It is possible to write your own wrapper locally if you need it fast. Sorry for the inconvenience and thanks again. On Wed, Aug 3, 2011 at 6:15 PM, Camille Stephan <camille.stephan@irbbarcelona.org> wrote:
Hello guys, I'm trying to run a pipeline of the best practices for snp and indel discovery as described by the people at Broad and I'm running into troubles with the GATK tools in a local installation of Galaxy. The main problem I have is that merging bam files with the samtools merge tool doesn't keep read group for each sample, causing "Count Covariates" to crash. The pipeline works fine with a single bam file, but I need to realign at least two files at a time. Is there a way to set the read group of a merged bam inside Galaxy? Are there plans to include the "merge" tool from Picard in Galaxy? Is there an easy way for me to do this locally? (Although I would like to run this in the cloud later on when the workflow is ready).
Thanks! Camille
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
Hi Ross, thanks for your answer. I found a dirty fix for merging pairs of bam files, had to change a couple of things in my local installation though. - Add group reads to each BAM file separately using Picard's Add or Replace Groups <http://localhost:8080/tool_runner?tool_id=picard_ARRG> (with ID=s1 and ID=s2 for each file) - Create the "rg.txt" file containing something like this: @RG ID:s1 SM:s1 LB:s1 PL:Illumina @RG ID:s2 SM:s2 LB:s2 PL:Illumina Modify sam_merge.py to call: "samtools merge -rh path/to/rg.txt %s %s..." It works. The problem is all (pairs of) files will end up with the same IDs and labels, unless the rg.txt file is changed every time. Would it be very difficult to add to the Galaxy wrapper the option of creating rg.txt on the fly and adding the -h option to the samtools call? I'm not familiar with creating wrappers for Galaxy, any suggestion as to where to start? Thanks again, Camille On Wed, Aug 3, 2011 at 2:34 PM, Ross <ross.lazarus@gmail.com> wrote:
Camille, thanks for reporting this - I think you have found a bug. We definitely need to be able to preserve metadata when we merge bams. Thanks for your suggestion of using mergeSamFiles - yes, I think it might be a good fix for this problem - but it will take a little while and won't reach the Main site for a few weeks once it's done. It is possible to write your own wrapper locally if you need it fast. Sorry for the inconvenience and thanks again.
Hello guys, I'm trying to run a pipeline of the best practices for snp and indel discovery as described by the people at Broad and I'm running into
On Wed, Aug 3, 2011 at 6:15 PM, Camille Stephan <camille.stephan@irbbarcelona.org> wrote: troubles
with the GATK tools in a local installation of Galaxy. The main problem I have is that merging bam files with the samtools merge tool doesn't keep read group for each sample, causing "Count Covariates" to crash. The pipeline works fine with a single bam file, but I need to realign at least two files at a time. Is there a way to set the read group of a merged bam inside Galaxy? Are there plans to include the "merge" tool from Picard in Galaxy? Is there an easy way for me to do this locally? (Although I would like to run this in the cloud later on when the workflow is ready).
Thanks! Camille
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
Hi, Camille, I can see this really needs a 'proper' fix - preferably taking advantage of the automated header merge. Preserving the metadata from each bam automatically is safer and less error-prone but you could use the existing "Replace sam/bam header" tool to do the surgery once you have a correct header in SAM format in your history? I'm currently testing changes which replace the current samtools merge code with a call to Picard MergeSamFiles. I'll add a switch to control whether all input headers are merged in case there are situations where it's not wanted. I'll let you know when you can try it out on our test instance and which revision of the galaxy-central repository contains the changes so you can get it working on your local installation. On Wed, Aug 3, 2011 at 11:49 PM, Camille Stephan <camille.stephan@irbbarcelona.org> wrote:
Hi Ross, thanks for your answer. I found a dirty fix for merging pairs of bam files, had to change a couple of things in my local installation though.
- Add group reads to each BAM file separately using Picard's Add or Replace Groups (with ID=s1 and ID=s2 for each file) - Create the "rg.txt" file containing something like this:
@RG ID:s1 SM:s1 LB:s1 PL:Illumina @RG ID:s2 SM:s2 LB:s2 PL:Illumina
Modify sam_merge.py to call:
"samtools merge -rh path/to/rg.txt %s %s..."
It works. The problem is all (pairs of) files will end up with the same IDs and labels, unless the rg.txt file is changed every time. Would it be very difficult to add to the Galaxy wrapper the option of creating rg.txt on the fly and adding the -h option to the samtools call?
I'm not familiar with creating wrappers for Galaxy, any suggestion as to where to start?
Thanks again, Camille
On Wed, Aug 3, 2011 at 2:34 PM, Ross <ross.lazarus@gmail.com> wrote:
Camille, thanks for reporting this - I think you have found a bug. We definitely need to be able to preserve metadata when we merge bams. Thanks for your suggestion of using mergeSamFiles - yes, I think it might be a good fix for this problem - but it will take a little while and won't reach the Main site for a few weeks once it's done. It is possible to write your own wrapper locally if you need it fast. Sorry for the inconvenience and thanks again.
On Wed, Aug 3, 2011 at 6:15 PM, Camille Stephan <camille.stephan@irbbarcelona.org> wrote:
Hello guys, I'm trying to run a pipeline of the best practices for snp and indel discovery as described by the people at Broad and I'm running into troubles with the GATK tools in a local installation of Galaxy. The main problem I have is that merging bam files with the samtools merge tool doesn't keep read group for each sample, causing "Count Covariates" to crash. The pipeline works fine with a single bam file, but I need to realign at least two files at a time. Is there a way to set the read group of a merged bam inside Galaxy? Are there plans to include the "merge" tool from Picard in Galaxy? Is there an easy way for me to do this locally? (Although I would like to run this in the cloud later on when the workflow is ready).
Thanks! Camille
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
Hi, Camille. If you can find some time to upload some of your bam files, could you please test the revised bam merge tool on http://test.g2.bx.psu.edu/ and let me know how you go. This won't be on the main site until the next scheduled update in a few weeks. If you need this locally, the changes are in galaxy-central from where anyone can grab them - the key file you need to update is tools/samtools/sam_merge.xml and you'll also need MergeSamFiles.jar from a recent Picard release to be available in your tool-data/shared/jars directory. Hope this helps - thanks for pointing out the bug. On Thu, Aug 4, 2011 at 12:02 PM, Ross <ross.lazarus@gmail.com> wrote:
Hi, Camille,
I can see this really needs a 'proper' fix - preferably taking advantage of the automated header merge. Preserving the metadata from each bam automatically is safer and less error-prone but you could use the existing "Replace sam/bam header" tool to do the surgery once you have a correct header in SAM format in your history?
I'm currently testing changes which replace the current samtools merge code with a call to Picard MergeSamFiles. I'll add a switch to control whether all input headers are merged in case there are situations where it's not wanted.
I'll let you know when you can try it out on our test instance and which revision of the galaxy-central repository contains the changes so you can get it working on your local installation.
On Wed, Aug 3, 2011 at 11:49 PM, Camille Stephan <camille.stephan@irbbarcelona.org> wrote:
Hi Ross, thanks for your answer. I found a dirty fix for merging pairs of bam files, had to change a couple of things in my local installation though.
- Add group reads to each BAM file separately using Picard's Add or Replace Groups (with ID=s1 and ID=s2 for each file) - Create the "rg.txt" file containing something like this:
@RG ID:s1 SM:s1 LB:s1 PL:Illumina @RG ID:s2 SM:s2 LB:s2 PL:Illumina
Modify sam_merge.py to call:
"samtools merge -rh path/to/rg.txt %s %s..."
It works. The problem is all (pairs of) files will end up with the same IDs and labels, unless the rg.txt file is changed every time. Would it be very difficult to add to the Galaxy wrapper the option of creating rg.txt on the fly and adding the -h option to the samtools call?
I'm not familiar with creating wrappers for Galaxy, any suggestion as to where to start?
Thanks again, Camille
On Wed, Aug 3, 2011 at 2:34 PM, Ross <ross.lazarus@gmail.com> wrote:
Camille, thanks for reporting this - I think you have found a bug. We definitely need to be able to preserve metadata when we merge bams. Thanks for your suggestion of using mergeSamFiles - yes, I think it might be a good fix for this problem - but it will take a little while and won't reach the Main site for a few weeks once it's done. It is possible to write your own wrapper locally if you need it fast. Sorry for the inconvenience and thanks again.
On Wed, Aug 3, 2011 at 6:15 PM, Camille Stephan <camille.stephan@irbbarcelona.org> wrote:
Hello guys, I'm trying to run a pipeline of the best practices for snp and indel discovery as described by the people at Broad and I'm running into troubles with the GATK tools in a local installation of Galaxy. The main problem I have is that merging bam files with the samtools merge tool doesn't keep read group for each sample, causing "Count Covariates" to crash. The pipeline works fine with a single bam file, but I need to realign at least two files at a time. Is there a way to set the read group of a merged bam inside Galaxy? Are there plans to include the "merge" tool from Picard in Galaxy? Is there an easy way for me to do this locally? (Although I would like to run this in the cloud later on when the workflow is ready).
Thanks! Camille
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
-- *** Camille Stephan-Otto Attolini, PhD Senior Research Officer, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0553
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
-- Ross Lazarus MBBS MPH; Associate Professor, Harvard Medical School; Director of Bioinformatics, Channing Lab; Tel: +1 617 505 4850; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444;
participants (2)
-
Camille Stephan
-
Ross