From: galaxy-dev-bounces(a)lists.bx.psu.edu [mailto:galaxy-dev-
bounces(a)lists.bx.psu.edu] On Behalf Of Peter Cock
Sent: Friday, 22 November 2013 3:51 a.m.
To: Nate Coraor
Subject: Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or
On Thu, Nov 21, 2013 at 2:27 PM, Peter Cock <p.j.a.cock(a)googlemail.com>
> On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <nate(a)bx.psu.edu> wrote:
>> On May 16, 2012, at 9:47 AM, Peter Cock wrote:
>>> Hello all,
>>> What is the current status in Galaxy for supporting compressed files?
>> Hi Peter,
>> Unfortunately, there's been nothing done on this so far. I'd love to
>> see it happen, but it hasn't been on the top of our priorities.
> Hi Nate,
> Where does this sit on the Galaxy team's priorities now, 18 months on?
> I think I asked about this at the GCC2013, any it was seen as
> important but not yet at the top of the priorities list.
Nate's reply on Twitter explained that the public Galaxy Instance (formerly
hosted at Penn State, now in Texas) uses transparent compression at the file
system level with ZFS - so Galaxy doesn't need to compress individual files.
Our approach has been to integrate compress / un-compress, with the
job splitter / cluster launch layer that we've been working on (current
incarnation is https://bitbucket.org/agr-bifo/tardis
tardis approach is to sniff the input(s) and if necessary insert an
uncompressor into the input stream - the original data is left compressed
in-place, and only the splits of the data are uncompressed. Each uncompressed
data chunk is launched for processing on the cluster as soon as it becomes available
from the uncompressed input stream. This can be quite a big performance advantage, as
compared with uncompressing the entire input file, before splitting it,
which can be quite slow and has a bigger disk footprint.
We find that job splitting, compression handling (and potentially other
low level data file transforms - e.g. handling list files, random sampling of input)
are interdependent. Potentially best encapsulated in a lower
layer, which would avoid cluttering Galaxy's high level bio-oriented
data type ontologies ?
Currently (as per earlier post) we are integrating this approach into Galaxy by modifying
selected tool config files, longer term we think it could be possible to slot this
low level data transform and task splitting layer into the core galaxy stack.
John Chilton kindly set up a trello card on the general topic
of task splitting - this is at
(Apologies if this repeats some of my earlier post - the point here is that
e.g. compress/un-compress and task splitting are interdependent in
practice, we find)
(Having said the above - encapsulating data transforms such as compress/un-compress
even lower down the stack, in the file-system itself , as per the public Galaxy
instance (ZFS), could be pretty hard to beat !)