Re: [galaxy-dev] gzipped fastq reader

9 Jul 2013

      I will implement this if the galaxy team likes the approach. 

We did this in ucsc genome browser code years ago: a single open_helper call handles, gzip, http, ftp and pipes. No need to care about how the data is compressed or where it data resides. 

wouldn't it be great to be able to pipe data between workflow steps rather than writing to disk?  I admit that this will require some work but the first step is to abstract the open.

On Jul 9, 2013, at 10:38 AM, Peter Cock wrote:
...
On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:
...
On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
...
The tools available in Galaxy are written in a range
of languages including C, Perl, R, etc. Yes, some are in Python,
but of those most are independent of Galaxy and can be used
separately from Galaxy.
the helper function would have to ported to R. We are talking
about how galaxy compressed data. Once we decide that, we
can determine how to best implement it.
Individual tools called from Galaxy read and create the files -
and we can't usually control them at this level (modifying them all
to call a Galaxy managed file open mechanism is not an option).
...
Proposal: Do not treat compressed data as a separate data type.
Treat it as an independent attribute that can be applied to any data.
Otherwise you will have to create a gzipped , zip and bz2 type for
every type that you want to compress.
That's what I've been saying - the fact that some people are
already using a new gzipped FASTQ format within their Galaxy
instances is practical, but I view it as a short term solution only.
...
...
...
Encoding the gzip status in the datatype will create an explosion of
datatypes. Compression is not actually a datatype, it tells you nothing
about the content data that is stored in the file.
What we'd previously discussed was a dual system, holding
the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
compression (e.g., None, normal GZIP, BGZF which is a
GZIP variant, BZIP2, etc).
What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?
Note ZIP is a bit different, as it is often a multiple file bundle -
it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that
regard.
But otherwise, yes. As a specific example, the tabix tool used BGZF
compressed tabular data to combine compression and efficient
random access. This would be useful for many annotation files
(e.g. GTF, GFF3).
...
This will quickly get out of hand and create a mess for tool
developers that need to support all thees types.
Why? Individual tool developers don't need to know if Galaxy
is keeping the original data file on disk compressed - unless
the tool XML says otherwise, Galaxy would hide this detail
and call the tool with an uncompressed input file.
(Unix named pipe which decompresses the file on the file would
be a potential alternative - but only if the tool XML was marked
up to say that an input could be streamed. The default must be
to assume potential random access to the input files)
...
The tool code and tool xml should be written to handle uncompressed
data and galaxy should handle the details of decompression. This
is not hard to do.
It isn't trivial either ;)
Peter