Re: [galaxy-dev] disk space and file formats

3 Sep 2011

      On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J
<cjfields@illinois.edu> wrote:
...
On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
...
...
What, like a BAM file of unaligned reads? Uses gzip compression, and
tracks the pairing information explicitly :) Some tools will already take
this as an input format, but not all.
ah, yes, precisely.  i actually think illumina's pipeline produces
files in this format now.
Oh do they? - that's interesting. Do you have a reference/link?
...
...
wrappers which create a temporary fastq file would need to be created
but that's easy enough.
My argument against that is the cost of going from BAM -> temp
fastq may be prohibitive, e.g. the need to generate very large
temp fastq files on the fly as input for various applications may
lead one back to just keeping a permanent FASTQ around anyway.
True - if you can't update the tools you need to take BAM.
In some cases at least you can pipe the gzipped FASTQ
into alignment tools which accepts FASTQ on stdin, so
there is no temp file per se.
...
 One could probably get better performance out of a simpler
format that removes most of the 'AM' parts of BAM.
Yes, but that meaning inventing yet another file format. At least
gzipped FASTQ is quite straightforward.
...
Or is the idea that the file itself is modified, like a database?
That would be quite a dramatic change from the current
Galaxy workflow system - I doubt that would be acceptable
in general.
...
And how would indexing work (BAM uses binning on the
match to the reference seq), or does it matter?
BAM indexing as done in samtools/picard is only for the aligned
reads - so no help for a BAM file of unaligned reads. You could
use a different indexing system (e.g. by read name) and the
same BAM BGZF block offset system (I've tried this as an
experiment with Biopython's SQLite indexing of sequence files).

However, for tasks taking unaligned reads as input, you
generally just iterate over the reads in the order on disk.
...
I recall hdf5 was planned as an alternate format (PacBio uses
it, IIRC), and of course there is NCBI's .sra format.  Anyone
using the latter two?
Moving from the custom BGZF modified gzip format used in
BAM to HD5 has been proposed on the samtools mailing list
(as Chris knows), and there is a proof of principle implementation
too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
The SAM/BAM group didn't seem overly enthusiastic though.

For the NCBI's .sra format, there is no open specification, just
their public domain source code:
http://seqanswers.com/forums/showthread.php?t=12054

Regards,

Peter

Re: [galaxy-dev] disk space and file formats

Peter Cock