On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.
ah, yes, precisely. i actually think illumina's pipeline produces files in this format now. wrappers which create a temporary fastq file would need to be created but that's easy enough.
My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway. One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM. Or is the idea that the file itself is modified, like a database? And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter? I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two? chris