The use of (unaligned) BAM for readgroups seems like a good idea. At the very least it prevents inconsistently hacking this information into the FASTQ descriptor (a common problem with any simple format). chris On Sep 8, 2011, at 1:35 PM, Edward Kirton wrote:
copied from another thread:
On Thu, Sep 8, 2011 at 7:30 AM, Anton Nekrutenko <anton@bx.psu.edu> wrote: What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road.
this seems to be the simplest solution; i like it a lot. really, only the reads need to be compressed, most other outfiles are tiny by comparison, so a more general solution may be overkill. and if compression of everything is desired, zfs works well -- another of our sites (LANL) uses this and recommended it to me too. i just haven't been able to convince my own IT people to go this route for technical reason beyond my attention span.
On Tue, Sep 6, 2011 at 9:05 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote: On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Peter Cock wrote:
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.
--nate
Yes, that's what I was envisioning Nate.
Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).
Makes sense.
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.
Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.
While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster.
Is it time to file an issue on bitbucket to track this potential enhancement?
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: