Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or BGZF)

21 Nov 2013

      On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <nate@bx.psu.edu> wrote:
...
On May 16, 2012, at 9:47 AM, Peter Cock wrote:
...
Hello all,
What is the current status in Galaxy for supporting compressed files?
Hi Peter,
Unfortunately, there's been nothing done on this so far.  I'd love to
see it happen, but it hasn't been on the top of our priorities.
--nate
Hi Nate,

Where does this sit on the Galaxy team's priorities now, 18 months on?
I think I asked about this at the GCC2013, any it was seen as
important but not yet at the top of the priorities list.

I know that the main Galaxy instance has moved from Penn State to
the iPlant Collaborative at the Texas Advanced Computing Center,
and that probably gave you some breathing room for storage - but
surely using compressed files would help enormously?

I would still love to see built in support for a range of compressions,
starting with none and gzipped as the priorities, but also ideally the
option of BGZF (the random access gzip subtype), BZ2 (better
compression), XY, blocked XY, etc.

My vision is that in the tool XML we'd have optional tags on
input data parameters to define supported compression types
as well as supported file types, likewise for the output tags.

For example, consider a FASTQ to FASTQ tool like a read trimmer
or filter, like my seq_filter_by_id.xml

https://github.com/peterjc/pico_galaxy/blob/master/tools/seq_filter_by_id/se...

...
<param name="input_file" type="data" format="fasta,fastq,sff"
label="Sequence file to filter on the identifiers" help="FASTA, FASTQ,
or SFF format." />
...

If I modified this tool to handle gzipped files, then I would tell
Galaxy that in some way, e.g.

...
<param name="input_file" type="data" format="fasta,fastq,sff"
compression="none,gzip" label="Sequence file to filter on the
identifiers" help="FASTA, FASTQ, or SFF format." />
...

(For any input file without a compression tag, assume it
expects no compression)

In the same way that the Cheetah syntax gives access to the
input file's type via $input_file.ext we'd need to access the
compression type via somthing like $input_file.compression
(e.g. if I needed to add -gzip switch to the command line).

Then, before running the tool, Galaxy would have to compare
the current compression of $input_file as stored on disk to
the range of compression types the tool can accept. If the
tool is happy as is, great - just run it. If not, then Galaxy
would have to add a decompression step (e.g. gunzip).

For the output files, Galaxy could probably sniff the files to
spot the compression automatically, or it could be specified
by the tool wrapper's XML.

If we want Galaxy to automatically store files gzipped/bz2/...,
then after running the tool it would have to call gzip/bzip/etc -
but the tool and tool wrapper don't need to care about this.

--

These potential (de)compressions could initially be done via
$TEMP (or some configurable scratch space local to each
cluster node), but we could avoid creating temp files by
using named pipes. This would only work where the tool
streamed the input/output file without using random access
(seek calls), thus we would need additional XML markup,
e.g.

...
<param name="input_file" type="data" format="fasta,fastq,sff"
compression="none,gzip" stream_file="true" label="Sequence file to
filter on the identifiers" help="FASTA, FASTQ, or SFF format." />
...

Having tools mark input/output files as streamable would
open the door to some fancy optimisations for chaining
tools together where the intermediate files are never stored
directly - but piped from one tool to another. This would
be very attractive for running some workflows, see e.g.:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-November/017422.html

However, a simple decompression/compression step with
staging files in $TEMP alone would be a big step forward.

Regards,

Peter