Re: [galaxy-dev] gzipped fastq reader

9 Jul 2013

      On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
...
On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:
...
I respectfully disagree,  If you want an extensible system, you should
always wrap primitive system level calls.
Any tools that opens a file that could be compressed would be affected.
That is a huge number of tools. Do you really want a cottage industry of
tools that have different methods of dealing with compression?
But defining a Python helper function within the Galaxy Python
libraries doesn't achieve that.
Are you talking about patching the OS level POSIX open functions
or something?
no.
...
The tools available in Galaxy are written in a range
of languages including C, Perl, R, etc. Yes, some are in Python,
but of those most are independent of Galaxy and can be used
separately from Galaxy.
the helper function would have to ported to R. We are talking about how galaxy compressed data. Once we decide that, we can determine how to best implement it.
Proposal: Do not treat compressed data as a separate data type. Treat it as an independent attribute that can be applied to any data. Otherwise you will have to create a gzipped , zip and bz2 type for every type that you want to compress.

people can use the python helpers or write their own in other languages,

We need a galaxy_open function to hide details of compression from tool developers.

We could also open http files or pipes without any changes to tools. (other than changing open() to galaxy_open()
...
...
Encoding the gzip status in the datatype will create an explosion of
datatypes. Compression is not actually a datatype, it tells you nothing
about the content data that is stored in the file.
What we'd previously discussed was a dual system, holding
the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
compression (e.g., None, normal GZIP, BGZF which is a
GZIP variant, BZIP2, etc).
What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also? 

This will quickly get out of hand and create a mess for tool developers that need to support all thees types.

The tool code and tool xml should be written to handle uncompressed data and galaxy should handle the details of decompression. This is not hard to do.
...
Galaxy tool wrappers currently define input files with a list
of file types - they'd also have to give a list of supported
compression types (defaulting to none). Likewise for any
output files - if they are already compressed the XML for
the tool wrapper would have to tell Galaxy this.
...
It is up to the galaxy team to provide a standard way to interact
with compressed files.
That is my preference too - although this could be driven by
the Galaxy community rather than the core team? I see
defining new datatypes like 'gzippedfastq' as a stop gap
special case (but a very practical route for now).
...
My proposed solution, is a very small change that could
be phased in over time. Any tools that uses open would not support
compressed files, but they would not break on uncompressed files.
Do others have an opinion?
Either I don't understand your plan, or it would only help in
a tiny minority of cases.
Regards,
Peter