Optimizing Disk space using by combining mapping with SAM-to-BAM
Hi all - I find it redundant to hold on to SAM output from NGS Mapping tools such when I end up converting the SAM files to BAM files anyway. The cleanup scripts require the history items to be deleted, but I don't want to delete them yet as I want the entire workflow to be kept until we are done analyzing our data. So, I was thinking of a way to remove the intermediate SAM files and thought how I would do this on the command line...simply pipe the output of BWA to samtools to create a BAM file and never have a SAM file to deal with. The BWA tool runner can be modified to pipe BWA output directly to samtools so a SAM file is never physical stored on disk. Has anyone done this? Does this seem like a good idea? Ryan -- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
Hello Ryan, I'm in the exact same situation with my bowtie/tophat tools, going back and forth between outputing a SAM, sorted SAM, BAM or sorted BAM, and I'm still not sure what's the best method. Storage wise - you're correct, just saving the sorted BAM is the best (even more with the fact the processing SAM files as text is so horrendous that I think alnost no tool uses them directly, always requiring intervals or sorted BAM). But one annoyance (for me) is that samtools (the program) is very in-efficient - using only a single thread (and the sort part isn't doing a great job at that). So if I give the "mapping" tool as a whole 20 threads or more, and a part of the running time (the samtools sort part) is only using a single-thread - I'm wasting the other threads, as they sit idle waiting for the sort to finish. I also tried sorting the SAM file directly, using GNU sort (version 8.10 can use multiple threads, and the memory management actually works, as opposed to "samtools sort -m") - but I'm not sure it's worth the effort. I didn't find an optimal solution that I like, and I'm interested to hear what others think. -gordon Ryan Golhar wrote, On 04/05/2011 01:08 PM:
Hi all - I find it redundant to hold on to SAM output from NGS Mapping tools such when I end up converting the SAM files to BAM files anyway. The cleanup scripts require the history items to be deleted, but I don't want to delete them yet as I want the entire workflow to be kept until we are done analyzing our data.
So, I was thinking of a way to remove the intermediate SAM files and thought how I would do this on the command line...simply pipe the output of BWA to samtools to create a BAM file and never have a SAM file to deal with.
The BWA tool runner can be modified to pipe BWA output directly to samtools so a SAM file is never physical stored on disk. Has anyone done this? Does this seem like a good idea?
Ryan
Hi Gordon - It looks like on the samtools mailing lists there is an active discussion on speeding up sorting: http://sourceforge.net/mailarchive/message.php?msg_id=27247076 http://sourceforge.net/mailarchive/message.php?msg_id=26990598 Interestingly enough there is a recommendation on using Picard instead of samtools. Are there Galaxy tools scripts for Picard? This might be useful. That still doesn't negate the fact that SAM files are being created and need to be converted to BAM files. Right now, I think I can live with sacrificing a little time for a single-threaded sort than for losing disk space from SAM files unnecessarily. Ryan On 4/5/11 1:18 PM, Assaf Gordon wrote:
Hello Ryan,
I'm in the exact same situation with my bowtie/tophat tools, going back and forth between outputing a SAM, sorted SAM, BAM or sorted BAM, and I'm still not sure what's the best method.
Storage wise - you're correct, just saving the sorted BAM is the best (even more with the fact the processing SAM files as text is so horrendous that I think alnost no tool uses them directly, always requiring intervals or sorted BAM).
But one annoyance (for me) is that samtools (the program) is very in-efficient - using only a single thread (and the sort part isn't doing a great job at that).
So if I give the "mapping" tool as a whole 20 threads or more, and a part of the running time (the samtools sort part) is only using a single-thread - I'm wasting the other threads, as they sit idle waiting for the sort to finish.
I also tried sorting the SAM file directly, using GNU sort (version 8.10 can use multiple threads, and the memory management actually works, as opposed to "samtools sort -m") - but I'm not sure it's worth the effort.
I didn't find an optimal solution that I like, and I'm interested to hear what others think.
-gordon
Ryan Golhar wrote, On 04/05/2011 01:08 PM:
Hi all - I find it redundant to hold on to SAM output from NGS Mapping tools such when I end up converting the SAM files to BAM files anyway. The cleanup scripts require the history items to be deleted, but I don't want to delete them yet as I want the entire workflow to be kept until we are done analyzing our data.
So, I was thinking of a way to remove the intermediate SAM files and thought how I would do this on the command line...simply pipe the output of BWA to samtools to create a BAM file and never have a SAM file to deal with.
The BWA tool runner can be modified to pipe BWA output directly to samtools so a SAM file is never physical stored on disk. Has anyone done this? Does this seem like a good idea?
Ryan
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
participants (2)
-
Assaf Gordon
-
Ryan Golhar