Re: [galaxy-dev] disk space and file formats

6 Sep 2011


      Peter Cock wrote:
...
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:
...
Peter Cock wrote:
...
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:
...
Ideally, there'd just be a column on the dataset table indicating
whether the dataset is compressed or not, and then tools get a new
way to indicate whether they can directly read compressed inputs, or
whether the input needs to be decompressed first.
--nate
Yes, that's what I was envisioning Nate.
Are there any schemes other than gzip which would make sense?
Perhaps rather than a boolean column (compressed or not), it
should specify the kind of compression if any (e.g. gzip).
Makes sense.
...
We need something which balances compression efficiency (size)
with decompression speed, while also being widely supported in
libraries for maximum tool uptake.
Yes, and there's a side effect of allowing this: you may decrease
efficiency if the tools used downstream all require decompression,
and you waste a bunch of time decompressing the dataset multiple
times.
While decompression wastes CPU time and makes things slower,
there is less data IO from disk (which may be network mounted)
which makes things faster. So overall, depending on the setup
and the task at hand, it could be faster.
Is it time to file an issue on bitbucket to track this potential
enhancement?
Sure.
...
Peter