Compressed data files in Galaxy? (e.g. GZIP or BGZF)
Hello all, What is the current status in Galaxy for supporting compressed files? We've talked about this before, for example in addition to FASTQ, many of us have expressed a wish to work with gzipped FASTQ. I understand that some have customized their local Galaxy installations to use gzipped FASTQ as a specific data type - I'm more interested in a general file format neutral solution. Also, I'd like to be able to used BGZF (not just GZIP) because it is better for random access - see for example http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html - and makes it much easier to break up large datafiles for sharing over a cluster (i.e. it could be exploited in the current Galaxy code for splitting large sequence files). The 11 May 2012 Galaxy Development News Brief http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-May/009757.html mentions tabix indexing - that uses bgzip, so is there something general in place yet to allow tool wrappers to say they accept not just given file formats, but different compressed versions of file formats? Ideally I'd like to be able to write an XML tool description saying a tool produced BGZF compressed tabular data, or GZIP compressed Sanger FASTQ etc. Similarly, I'd like to specify my tool accepts FASTA or gzipped FASTA (including BGZF FASTA). While for older tools if they say they accept only uncompressed FASTA, Galaxy could automatically decompress any compressed FASTA entries in my history on demand. Peter
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
Hello all,
What is the current status in Galaxy for supporting compressed files?
Hi Peter, Unfortunately, there's been nothing done on this so far. I'd love to see it happen, but it hasn't been on the top of our priorities. --nate
We've talked about this before, for example in addition to FASTQ, many of us have expressed a wish to work with gzipped FASTQ. I understand that some have customized their local Galaxy installations to use gzipped FASTQ as a specific data type - I'm more interested in a general file format neutral solution.
Also, I'd like to be able to used BGZF (not just GZIP) because it is better for random access - see for example http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html - and makes it much easier to break up large datafiles for sharing over a cluster (i.e. it could be exploited in the current Galaxy code for splitting large sequence files).
The 11 May 2012 Galaxy Development News Brief http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-May/009757.html mentions tabix indexing - that uses bgzip, so is there something general in place yet to allow tool wrappers to say they accept not just given file formats, but different compressed versions of file formats?
Ideally I'd like to be able to write an XML tool description saying a tool produced BGZF compressed tabular data, or GZIP compressed Sanger FASTQ etc. Similarly, I'd like to specify my tool accepts FASTA or gzipped FASTA (including BGZF FASTA). While for older tools if they say they accept only uncompressed FASTA, Galaxy could automatically decompress any compressed FASTA entries in my history on demand.
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <nate@bx.psu.edu> wrote:
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
Hello all,
What is the current status in Galaxy for supporting compressed files?
Hi Peter,
Unfortunately, there's been nothing done on this so far. I'd love to see it happen, but it hasn't been on the top of our priorities.
--nate
Hi Nate, Where does this sit on the Galaxy team's priorities now, 18 months on? I think I asked about this at the GCC2013, any it was seen as important but not yet at the top of the priorities list. I know that the main Galaxy instance has moved from Penn State to the iPlant Collaborative at the Texas Advanced Computing Center, and that probably gave you some breathing room for storage - but surely using compressed files would help enormously? I would still love to see built in support for a range of compressions, starting with none and gzipped as the priorities, but also ideally the option of BGZF (the random access gzip subtype), BZ2 (better compression), XY, blocked XY, etc. My vision is that in the tool XML we'd have optional tags on input data parameters to define supported compression types as well as supported file types, likewise for the output tags. For example, consider a FASTQ to FASTQ tool like a read trimmer or filter, like my seq_filter_by_id.xml https://github.com/peterjc/pico_galaxy/blob/master/tools/seq_filter_by_id/se... ... <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> ... If I modified this tool to handle gzipped files, then I would tell Galaxy that in some way, e.g. ... <param name="input_file" type="data" format="fasta,fastq,sff" compression="none,gzip" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> ... (For any input file without a compression tag, assume it expects no compression) In the same way that the Cheetah syntax gives access to the input file's type via $input_file.ext we'd need to access the compression type via somthing like $input_file.compression (e.g. if I needed to add -gzip switch to the command line). Then, before running the tool, Galaxy would have to compare the current compression of $input_file as stored on disk to the range of compression types the tool can accept. If the tool is happy as is, great - just run it. If not, then Galaxy would have to add a decompression step (e.g. gunzip). For the output files, Galaxy could probably sniff the files to spot the compression automatically, or it could be specified by the tool wrapper's XML. If we want Galaxy to automatically store files gzipped/bz2/..., then after running the tool it would have to call gzip/bzip/etc - but the tool and tool wrapper don't need to care about this. -- These potential (de)compressions could initially be done via $TEMP (or some configurable scratch space local to each cluster node), but we could avoid creating temp files by using named pipes. This would only work where the tool streamed the input/output file without using random access (seek calls), thus we would need additional XML markup, e.g. ... <param name="input_file" type="data" format="fasta,fastq,sff" compression="none,gzip" stream_file="true" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> ... Having tools mark input/output files as streamable would open the door to some fancy optimisations for chaining tools together where the intermediate files are never stored directly - but piped from one tool to another. This would be very attractive for running some workflows, see e.g.: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-November/017422.html However, a simple decompression/compression step with staging files in $TEMP alone would be a big step forward. Regards, Peter
On Thu, Nov 21, 2013 at 2:27 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <nate@bx.psu.edu> wrote:
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
Hello all,
What is the current status in Galaxy for supporting compressed files?
Hi Peter,
Unfortunately, there's been nothing done on this so far. I'd love to see it happen, but it hasn't been on the top of our priorities.
--nate
Hi Nate,
Where does this sit on the Galaxy team's priorities now, 18 months on? I think I asked about this at the GCC2013, any it was seen as important but not yet at the top of the priorities list.
Nate's reply on Twitter explained that the public Galaxy Instance (formerly hosted at Penn State, now in Texas) uses transparent compression at the file system level with ZFS - so Galaxy doesn't need to compress individual files. Neat: https://twitter.com/natefoo/status/403531922514522112 Peter
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev- bounces@lists.bx.psu.edu] On Behalf Of Peter Cock Sent: Friday, 22 November 2013 3:51 a.m. To: Nate Coraor Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or BGZF)
On Thu, Nov 21, 2013 at 2:27 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <nate@bx.psu.edu> wrote:
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
Hello all,
What is the current status in Galaxy for supporting compressed files?
Hi Peter,
Unfortunately, there's been nothing done on this so far. I'd love to see it happen, but it hasn't been on the top of our priorities.
--nate
Hi Nate,
Where does this sit on the Galaxy team's priorities now, 18 months on? I think I asked about this at the GCC2013, any it was seen as important but not yet at the top of the priorities list.
Nate's reply on Twitter explained that the public Galaxy Instance (formerly hosted at Penn State, now in Texas) uses transparent compression at the file system level with ZFS - so Galaxy doesn't need to compress individual files. Neat: https://twitter.com/natefoo/status/403531922514522112
Peter
Our approach has been to integrate compress / un-compress, with the job splitter / cluster launch layer that we've been working on (current incarnation is https://bitbucket.org/agr-bifo/tardis ). tardis approach is to sniff the input(s) and if necessary insert an uncompressor into the input stream - the original data is left compressed in-place, and only the splits of the data are uncompressed. Each uncompressed data chunk is launched for processing on the cluster as soon as it becomes available from the uncompressed input stream. This can be quite a big performance advantage, as compared with uncompressing the entire input file, before splitting it, which can be quite slow and has a bigger disk footprint. We find that job splitting, compression handling (and potentially other low level data file transforms - e.g. handling list files, random sampling of input) are interdependent. Potentially best encapsulated in a lower layer, which would avoid cluttering Galaxy's high level bio-oriented data type ontologies ? Currently (as per earlier post) we are integrating this approach into Galaxy by modifying selected tool config files, longer term we think it could be possible to slot this low level data transform and task splitting layer into the core galaxy stack. John Chilton kindly set up a trello card on the general topic of task splitting - this is at https://trello.com/c/H87LotF7 (Apologies if this repeats some of my earlier post - the point here is that e.g. compress/un-compress and task splitting are interdependent in practice, we find) (Having said the above - encapsulating data transforms such as compress/un-compress even lower down the stack, in the file-system itself , as per the public Galaxy instance (ZFS), could be pretty hard to beat !)
participants (3)
-
McCulloch, Alan
-
Nate Coraor
-
Peter Cock