On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:
On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.
the helper function would have to ported to R. We are talking about how galaxy compressed data. Once we decide that, we can determine how to best implement it.
Individual tools called from Galaxy read and create the files - and we can't usually control them at this level (modifying them all to call a Galaxy managed file open mechanism is not an option).
Proposal: Do not treat compressed data as a separate data type. Treat it as an independent attribute that can be applied to any data. Otherwise you will have to create a gzipped , zip and bz2 type for every type that you want to compress.
That's what I've been saying - the fact that some people are already using a new gzipped FASTQ format within their Galaxy instances is practical, but I view it as a short term solution only.
Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.
What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc).
What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?
Note ZIP is a bit different, as it is often a multiple file bundle - it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that regard. But otherwise, yes. As a specific example, the tabix tool used BGZF compressed tabular data to combine compression and efficient random access. This would be useful for many annotation files (e.g. GTF, GFF3).
This will quickly get out of hand and create a mess for tool developers that need to support all thees types.
Why? Individual tool developers don't need to know if Galaxy is keeping the original data file on disk compressed - unless the tool XML says otherwise, Galaxy would hide this detail and call the tool with an uncompressed input file. (Unix named pipe which decompresses the file on the file would be a potential alternative - but only if the tool XML was marked up to say that an input could be streamed. The default must be to assume potential random access to the input files)
The tool code and tool xml should be written to handle uncompressed data and galaxy should handle the details of decompression. This is not hard to do.
It isn't trivial either ;) Peter