On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:
I respectfully disagree, If you want an extensible system, you should always wrap primitive system level calls.
Any tools that opens a file that could be compressed would be affected. That is a huge number of tools. Do you really want a cottage industry of tools that have different methods of dealing with compression?
But defining a Python helper function within the Galaxy Python libraries doesn't achieve that.
Are you talking about patching the OS level POSIX open functions or something?
The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.
no. the helper function would have to ported to R. We are talking about how galaxy compressed data. Once we decide that, we can determine how to best implement it. Proposal: Do not treat compressed data as a separate data type. Treat it as an independent attribute that can be applied to any data. Otherwise you will have to create a gzipped , zip and bz2 type for every type that you want to compress. people can use the python helpers or write their own in other languages, We need a galaxy_open function to hide details of compression from tool developers. We could also open http files or pipes without any changes to tools. (other than changing open() to galaxy_open()
Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.
What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc).
What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also? This will quickly get out of hand and create a mess for tool developers that need to support all thees types. The tool code and tool xml should be written to handle uncompressed data and galaxy should handle the details of decompression. This is not hard to do.
Galaxy tool wrappers currently define input files with a list of file types - they'd also have to give a list of supported compression types (defaulting to none). Likewise for any output files - if they are already compressed the XML for the tool wrapper would have to tell Galaxy this.
It is up to the galaxy team to provide a standard way to interact with compressed files.
That is my preference too - although this could be driven by the Galaxy community rather than the core team? I see defining new datatypes like 'gzippedfastq' as a stop gap special case (but a very practical route for now).
My proposed solution, is a very small change that could be phased in over time. Any tools that uses open would not support compressed files, but they would not break on uncompressed files.
Do others have an opinion?
Either I don't understand your plan, or it would only help in a tiny minority of cases.
Regards,
Peter