On Thu, Sep 1, 2011 at 11:02 PM, Edward Kirton <eskirton@lbl.gov> wrote:
Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon. My first attempt at taming the beast was to try to create a single read QC tool that did such things as convert qual encoding, qual-end trimming, etc. (very basic functions). Such a tool could simply be a wrapper around your favorite existing tools, but doesn't keep the intermediate files. The added benefit is that it runs faster because it only has to queue onto the cluster once. Sure, one might argue that it's nice to have all the intermediate files just in case you wish to review them, but in practice, I have found this happens relatively infrequently and is too expensive. If you're a small lab maybe that's fine, but if you generate a lot of sequence, a more production-line approach is reasonable.
Sounds very sensible if you have some frequently repeated multistep analyses.
I've been toying with the idea of replacing all the fastq datatypes with a single fastq datatype that is sanger-encoded and gzipped. I think gzipped reads files are about 1/4 of the unpacked version. Of course, many tools will require a wrapper if they don't accept gzipped input, but that's trivial (and many already support compressed reads). However the import tool automatically uncompressed uploaded files so I'd need to do some hacking there to prevent this.
Hmm. Probably there are some tasks where a gzip'd FASTQ isn't ideal, but for the fairly typical case of intreating over the records it should be fine.
Heck, what we really need is a nice compact binary format for reads, perhaps which doesn't even store ids (although pairing would need to be recorded). Thoughts?
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all. Peter