Storing compressed data files

4 Jun 2008

      Hello,

As users add more and more data files to our galaxy server, disk space 
becomes a problem...
What I'd ultimately like is to store the data files in some compressed 
manner (at least some of the textual files), how would you suggest to do 
that ?

A common scenario is:
1. User uploads a big Fastq/solexa file (=> 1.2 GB)
2. FASTQ file converted to FASTA file (=> 0.6 GB)
3. FASTA file trimmed, clipped, stripped, etc. (=> 100 MB)
4. BLAT, Histograms and other reports (=> ~50 MB)

The first three data sets take about 1.9 GB of disk space - and aren't 
really needed by the user (as he/she is mostly interested in the 
resulting report files). Since these are textual files, they compress 
really well.

Currently, I store the FASTQ gzip'ed in galaxy, and my tools know how to 
read gzip'ed data.

There are two shortcomings with this method:
1. datasets (green squares) of gzip'ed files don't display any data in 
the peek window
2. Other galaxy tools which require FASTQ file as input can't read my file.

Perl has an I/O module (PerlIO::gzip) which makes reading gzipped files 
transparent to the rest of the program. I think python has something 
very similar (http://www.python.org/doc/lib/module-gzip.html).

If it's not too much to ask, would it be possible to add support for 
reading gzip'ed files ? At least in the peek/preview window ?

Comments are welcomed,

Thanks,
     Gordon.

Assaf Gordon

Greg Von Kuster

tags

participants (2)