Peter Cock wrote:
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Peter Cock wrote:
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.
--nate
Yes, that's what I was envisioning Nate.
Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).
Makes sense.
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.
Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.
While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster.
Is it time to file an issue on bitbucket to track this potential enhancement?
Sure.
Peter