-----Original Message----- From: firstname.lastname@example.org [mailto:galaxy-dev- email@example.com] On Behalf Of Peter Cock Sent: Friday, 22 November 2013 3:51 a.m. To: Nate Coraor Cc: firstname.lastname@example.org Subject: Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or BGZF)
On Thu, Nov 21, 2013 at 2:27 PM, Peter Cock email@example.com wrote:
On Wed, May 16, 2012 at 9:39 PM, Nate Coraor firstname.lastname@example.org wrote:
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
What is the current status in Galaxy for supporting compressed files?
Unfortunately, there's been nothing done on this so far. I'd love to see it happen, but it hasn't been on the top of our priorities.
Where does this sit on the Galaxy team's priorities now, 18 months on? I think I asked about this at the GCC2013, any it was seen as important but not yet at the top of the priorities list.
Nate's reply on Twitter explained that the public Galaxy Instance (formerly hosted at Penn State, now in Texas) uses transparent compression at the file system level with ZFS - so Galaxy doesn't need to compress individual files. Neat: https://twitter.com/natefoo/status/403531922514522112
Our approach has been to integrate compress / un-compress, with the job splitter / cluster launch layer that we've been working on (current incarnation is https://bitbucket.org/agr-bifo/tardis ).
tardis approach is to sniff the input(s) and if necessary insert an uncompressor into the input stream - the original data is left compressed in-place, and only the splits of the data are uncompressed. Each uncompressed data chunk is launched for processing on the cluster as soon as it becomes available from the uncompressed input stream. This can be quite a big performance advantage, as compared with uncompressing the entire input file, before splitting it, which can be quite slow and has a bigger disk footprint.
We find that job splitting, compression handling (and potentially other low level data file transforms - e.g. handling list files, random sampling of input) are interdependent. Potentially best encapsulated in a lower layer, which would avoid cluttering Galaxy's high level bio-oriented data type ontologies ?
Currently (as per earlier post) we are integrating this approach into Galaxy by modifying selected tool config files, longer term we think it could be possible to slot this low level data transform and task splitting layer into the core galaxy stack.
John Chilton kindly set up a trello card on the general topic of task splitting - this is at
(Apologies if this repeats some of my earlier post - the point here is that e.g. compress/un-compress and task splitting are interdependent in practice, we find)
(Having said the above - encapsulating data transforms such as compress/un-compress even lower down the stack, in the file-system itself , as per the public Galaxy instance (ZFS), could be pretty hard to beat !)