Peter and Dan,
I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open?

You need to read the first four bytes of the file to see if it is compressed and call gzip.open inside of the function and pass pack the handle. 

For now, it would require a global sweep through the tools to change open() to galaxy_open(), but it is probably a good idea to have tool developers avoid calling open directly.

You would have to have special handling if there are multiple files in the compressed archive but that support could be added later.

-Robert


def galaxy_open(filename, mode="r"):
   compressor = getCompressor(filename, mode)
   if compessor != NULL:
return openCompressed(filename, mode, compressor)
else:
     return open(filename, mode)


def openCompressed(filename, mode):
      4bytes = read4bytes(filename)
      ext = getExtensionFromHdrSig(4bytes)
      if ext == "gz" :
 return gzip.open(filename, mode)
      else if ext == "bz2":
 return bz2.BZ2File(filename, mode)
      else if ext == "zip":
 return zipfile.ZipFile(filename, mode)
          

char *getExtensionFromHdrSig(char *first4bytes)
/* Check if header has signature of supported compression stream,
   and return a phoney filename with extension for it, or NULL if no sig found. */
{
char buf[20];
char *ext=NULL;
if (startsWith("\x1f\x8b",first4bytes)) ext = "gz";
else if (startsWith("\x1f\x9d\x90",first4bytes)) ext = "Z";
else if (startsWith("BZ",first4bytes)) ext = "bz2";
else if (startsWith("PK\x03\x04",first4bytes)) ext = "zip";
if (ext==NULL)
    return NULL;
}
On Jul 8, 2013, at 4:05 AM, Peter Cock wrote:

On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch
<robert.baertsch@gmail.com> wrote:
Dan,
Do these readers support gzip files?

      reader = fastqVerboseErrorReader
       reader = fastqReader

Presumably you are writing a Python script using this library?
The answer is a qualified yes. Instead of passing them a normal
file handle using open("example.fastq") you instead use
gzip.open("example.fastq") via import gzip.

Do I have to define a special type in galaxy for gzipped files or will the fastq type be ok?


This needs a special file format - but you are not the first person to
look at this, some groups have defined custom gzipped variants of
the FASTQ formats within their own Galaxy instances. I've not
done this but there should be some useful emails in the archive.

Note you'd also need to modify any tool definitions to that they
can accept a gzipped FASTQ file.

Ideally, I would like to keep my files zipped and not have galaxy unzip them, since they triple in size when unzipped.

I'm happy to do a push request if you don't support this but I want to make sure I'm in line with your roadmap.

Personally I would like a more general system in Galaxy for
potentially any file type to be held compressed in a range of
formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions
for things like BAM which are already compressed. This way
naive tools would get the gzipped file file uncompressed to a
temporary folder before use (i.e. no change for the tool wrapper),
but if a tool accepts a gzipped file it will get that (less disk IO
and CPU usage, but requires updating tool wrappers).

That idea is quite ambitious through ;)

I have written a simple tool to convert Illumina fastq to mapsplice fastq. Does that already exist already somewhere?


I don't know.

Peter