On Sep 2, 2011, at 8:02 PM, Peter Cock wrote:
On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:
On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.
ah, yes, precisely. i actually think illumina's pipeline produces files in this format now.
Oh do they? - that's interesting. Do you have a reference/link?
Yeah, here at The Genome Institute at Wash-U, we get Illumina data directly in BAM format, and tries to avoid fastq conversion. The latest BWA supports a BAM of reads as input, as well as making BAM output. Hopefully most tools will go that way. You can always engineer something with named pipes in the mean time to avoid the read/write to real disk, but that requires some care.
wrappers which create a temporary fastq file would need to be created but that's easy enough.
My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.
True - if you can't update the tools you need to take BAM. In some cases at least you can pipe the gzipped FASTQ into alignment tools which accepts FASTQ on stdin, so there is no temp file per se.
One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM.
Yes, but that meaning inventing yet another file format. At least gzipped FASTQ is quite straightforward.
Or is the idea that the file itself is modified, like a database?
That would be quite a dramatic change from the current Galaxy workflow system - I doubt that would be acceptable in general.
And mutable data structure like that are harder to manage in a high-throughput environment.
And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter?
BAM indexing as done in samtools/picard is only for the aligned reads - so no help for a BAM file of unaligned reads. You could use a different indexing system (e.g. by read name) and the same BAM BGZF block offset system (I've tried this as an experiment with Biopython's SQLite indexing of sequence files).
However, for tasks taking unaligned reads as input, you generally just iterate over the reads in the order on disk.
I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two?
Moving from the custom BGZF modified gzip format used in BAM to HD5 has been proposed on the samtools mailing list (as Chris knows), and there is a proof of principle implementation too in BioHDF, http://www.hdfgroup.org/projects/biohdf/ The SAM/BAM group didn't seem overly enthusiastic though.
HDF5 sounds really great, though I don't think PacBio has the data volume to tax it the way Illumina does. There was some speculation that HDF5 would be underneath a new BAM standard, but I don't know the status of that. We did a few experiments in house with BioHDF in its infancy to see how it compared to BAM and it didn't capture all of the data (it was missing the somewhat critical cigar strings at this time) ...and haven't revisited it since then. I'm sure it would be effective in storing reads, but starting your own standard when Illumina makes BAMs will probably not ultimately be as useful as going with BAM format.
For the NCBI's .sra format, there is no open specification, just their public domain source code: http://seqanswers.com/forums/showthread.php?t=12054
This standard is ...complex, with the associated down-sides. We only convert things into that format if explicitly required to do so.
Regards,
Peter
Best of luck, Scott -- Scott Smith Manager, Application Programming and Development Analysis Pipeline The Genome Institute Washington University School of Medicine