On Wed, May 16, 2012 at 9:39 PM, Nate Coraor firstname.lastname@example.org wrote:
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
What is the current status in Galaxy for supporting compressed files?
Unfortunately, there's been nothing done on this so far. I'd love to see it happen, but it hasn't been on the top of our priorities.
Where does this sit on the Galaxy team's priorities now, 18 months on? I think I asked about this at the GCC2013, any it was seen as important but not yet at the top of the priorities list.
I know that the main Galaxy instance has moved from Penn State to the iPlant Collaborative at the Texas Advanced Computing Center, and that probably gave you some breathing room for storage - but surely using compressed files would help enormously?
I would still love to see built in support for a range of compressions, starting with none and gzipped as the priorities, but also ideally the option of BGZF (the random access gzip subtype), BZ2 (better compression), XY, blocked XY, etc.
My vision is that in the tool XML we'd have optional tags on input data parameters to define supported compression types as well as supported file types, likewise for the output tags.
For example, consider a FASTQ to FASTQ tool like a read trimmer or filter, like my seq_filter_by_id.xml
... <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> ...
If I modified this tool to handle gzipped files, then I would tell Galaxy that in some way, e.g.
... <param name="input_file" type="data" format="fasta,fastq,sff" compression="none,gzip" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> ...
(For any input file without a compression tag, assume it expects no compression)
In the same way that the Cheetah syntax gives access to the input file's type via $input_file.ext we'd need to access the compression type via somthing like $input_file.compression (e.g. if I needed to add -gzip switch to the command line).
Then, before running the tool, Galaxy would have to compare the current compression of $input_file as stored on disk to the range of compression types the tool can accept. If the tool is happy as is, great - just run it. If not, then Galaxy would have to add a decompression step (e.g. gunzip).
For the output files, Galaxy could probably sniff the files to spot the compression automatically, or it could be specified by the tool wrapper's XML.
If we want Galaxy to automatically store files gzipped/bz2/..., then after running the tool it would have to call gzip/bzip/etc - but the tool and tool wrapper don't need to care about this.
These potential (de)compressions could initially be done via $TEMP (or some configurable scratch space local to each cluster node), but we could avoid creating temp files by using named pipes. This would only work where the tool streamed the input/output file without using random access (seek calls), thus we would need additional XML markup, e.g.
... <param name="input_file" type="data" format="fasta,fastq,sff" compression="none,gzip" stream_file="true" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> ...
Having tools mark input/output files as streamable would open the door to some fancy optimisations for chaining tools together where the intermediate files are never stored directly - but piped from one tool to another. This would be very attractive for running some workflows, see e.g.: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-November/017422.html
However, a simple decompression/compression step with staging files in $TEMP alone would be a big step forward.