copied from another thread:
What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road.
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:
> Peter Cock wrote:
>> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:
>> > Ideally, there'd just be a column on the dataset table indicatingWhile decompression wastes CPU time and makes things slower,
>> > whether the dataset is compressed or not, and then tools get a new
>> > way to indicate whether they can directly read compressed inputs, or
>> > whether the input needs to be decompressed first.
>> >
>> > --nate
>>
>> Yes, that's what I was envisioning Nate.
>>
>> Are there any schemes other than gzip which would make sense?
>> Perhaps rather than a boolean column (compressed or not), it
>> should specify the kind of compression if any (e.g. gzip).
>
> Makes sense.
>
>> We need something which balances compression efficiency (size)
>> with decompression speed, while also being widely supported in
>> libraries for maximum tool uptake.
>
> Yes, and there's a side effect of allowing this: you may decrease
> efficiency if the tools used downstream all require decompression,
> and you waste a bunch of time decompressing the dataset multiple
> times.
there is less data IO from disk (which may be network mounted)
which makes things faster. So overall, depending on the setup
and the task at hand, it could be faster.
Is it time to file an issue on bitbucket to track this potential
enhancement?
Peter