Re: [galaxy-dev] disk space and file formats

2 Sep 2011


      On Thu, Sep 1, 2011 at 11:02 PM, Edward Kirton <eskirton@lbl.gov> wrote:
...
Read QC intermediate files account for most of the storage used on our
galaxy site. And it's a real problem that I must solve soon.
My first attempt at taming the beast was to try to create a single read QC
tool that did such things as convert qual encoding, qual-end trimming, etc.
(very basic functions).  Such a tool could simply be a wrapper around your
favorite existing tools, but doesn't keep the intermediate files.  The added
benefit is that it runs faster because it only has to queue onto the cluster
once.
Sure, one might argue that it's nice to have all the intermediate files just
in case you wish to review them, but in practice, I have found this happens
relatively infrequently and is too expensive.  If you're a small lab maybe
that's fine, but if you generate a lot of sequence, a more production-line
approach is reasonable.
Sounds very sensible if you have some frequently repeated multistep
analyses.
...
I've been toying with the idea of replacing all the fastq datatypes with a
single fastq datatype that is sanger-encoded and gzipped.  I think gzipped
reads files are about 1/4 of the unpacked version.  Of course, many tools
will require a wrapper if they don't accept gzipped input, but that's
trivial (and many already support compressed reads).
However the import tool automatically uncompressed uploaded files so I'd
need to do some hacking there to prevent this.
Hmm. Probably there are some tasks where a gzip'd FASTQ isn't
ideal, but for the fairly typical case of intreating over the records
it should be fine.
...
Heck, what we really need is a nice compact binary format for reads, perhaps
which doesn't even store ids (although pairing would need to be recorded).
Thoughts?
What, like a BAM file of unaligned reads? Uses gzip compression, and
tracks the pairing information explicitly :) Some tools will already take
this as an input format, but not all.

Peter

Re: [galaxy-dev] disk space and file formats

Peter Cock