Peter Cock wrote:
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Edward Kirton wrote:
Peter wrote:
I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?).
Perhaps we can flesh-out what more general solutions would look like...
Imagine the fastq datatypes were left alone and instead there's a mechanism by which files which haven't been used as input for x days get compressed by a cron job. the file server knows how to uncompress such files on the fly when needed. For the most part, files are uncompressed during analysis and are compressed when the files exist as an archive within galaxy.
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.
--nate
Yes, that's what I was envisioning Nate.
Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).
Makes sense.
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.
Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times. --nate
Peter