On Thu, Oct 6, 2011 at 4:48 AM, Duddy, John firstname.lastname@example.org wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
We'd discussed this and a more general approach where any file could be gzipped, but the code to do that doesn't exist yet: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-September/006745.html
Issue filed: https://bitbucket.org/galaxy/galaxy-central/issue/666/
That seems a better long term solution than the pragmatic short term solution of fastqsanger-gzip (or whatever it gets called). Note that it sounded like Edward Kirton might already be using this - you should be consistent.
The other strong idea from that thread was moving from FASTQ to unaligned BAM, which is gzipped compressed, and has explicit support for paired end reads, read groups, etc.