Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or BGZF)

26 Nov 2013

      ...
-----Original Message-----
From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-
bounces@lists.bx.psu.edu] On Behalf Of Peter Cock
Sent: Friday, 22 November 2013 3:51 a.m.
To: Nate Coraor
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or
BGZF)
On Thu, Nov 21, 2013 at 2:27 PM, Peter Cock <p.j.a.cock@googlemail.com>
wrote:
...
On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <nate@bx.psu.edu> wrote:
...
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
...
Hello all,
What is the current status in Galaxy for supporting compressed files?
Hi Peter,
Unfortunately, there's been nothing done on this so far.  I'd love to
see it happen, but it hasn't been on the top of our priorities.
--nate
Hi Nate,
Where does this sit on the Galaxy team's priorities now, 18 months on?
I think I asked about this at the GCC2013, any it was seen as
important but not yet at the top of the priorities list.
Nate's reply on Twitter explained that the public Galaxy Instance (formerly
hosted at Penn State, now in Texas) uses transparent compression at the file
system level with ZFS - so Galaxy doesn't need to compress individual files.
Neat:
https://twitter.com/natefoo/status/403531922514522112
Peter
Our approach has been to integrate compress / un-compress, with the 
job splitter / cluster launch layer that we've been working on (current
incarnation is https://bitbucket.org/agr-bifo/tardis ).

tardis approach is to sniff the input(s) and if necessary insert an 
uncompressor into the input stream - the original data is left compressed 
in-place, and  only the splits of the data are uncompressed. Each uncompressed
data chunk is  launched for processing on the cluster as soon as it becomes available 
from the  uncompressed input stream. This can be quite a big performance advantage, as 
compared with uncompressing the entire input file, before splitting it,
which can be quite slow and has a bigger disk footprint. 

We find that job splitting, compression handling (and potentially other 
low level data file transforms - e.g. handling list files, random sampling of input) 
are interdependent.  Potentially  best encapsulated  in a lower 
layer,  which would avoid cluttering Galaxy's  high level bio-oriented 
data type ontologies ? 

Currently (as per earlier post) we are integrating this approach into Galaxy by modifying 
selected tool config files, longer term we think it could be possible to slot this 
low level data transform and task splitting layer  into the core galaxy stack.

John Chilton kindly set up a trello card on the general topic
of task splitting - this is at 

https://trello.com/c/H87LotF7

(Apologies if this repeats some of my earlier post - the point here is that 
e.g. compress/un-compress and task splitting are interdependent in 
practice, we find)

(Having said the above - encapsulating data transforms such as compress/un-compress
even lower down the stack, in the file-system itself , as per the public Galaxy 
instance (ZFS), could be pretty hard to beat !)

Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or BGZF)

McCulloch, Alan