On Saturday, September 3, 2011, Edward Kirton <eskirton@lbl.gov> wrote:
of course there is a computational cost to compressing/uncompressing files but that's probably better than storing unnecessarily huge files. it's a trade-off.
It may still be faster due to less IO, probably depends on your hardware.
since i'm rapidly running out of storage, i think the best immediate solution for me is to deprecate all the fastq datatypes in favor of a new fastqsangergz and to bundle the read qc tools to eliminate intermediate files. sure, users won't be able to play around with their data as much, but my disk is 88% full and my cluster has been 100% occupied for 2-months straight, so less choice is probably better.
In your position I agree that is a pragmatic choice. You might be able to modify the file upload code to gzip any FASTQ files... that would prevent uncompressed FASTQ getting into new histories. I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?). Peter