Re: [galaxy-dev] disk space and file formats

2 Sep 2011

      On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
...
...
What, like a BAM file of unaligned reads? Uses gzip compression, and
tracks the pairing information explicitly :) Some tools will already take
this as an input format, but not all.
ah, yes, precisely.  i actually think illumina's pipeline produces
files in this format now.
wrappers which create a temporary fastq file would need to be created
but that's easy enough.
My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.  One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM.  Or is the idea that the file itself is modified, like a database?  And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter?

I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format.  Anyone using the latter two? 

chris

Re: [galaxy-dev] disk space and file formats

Fields, Christopher J