Hello Assaf, Assaf Gordon wrote:
Hello,
As users add more and more data files to our galaxy server, disk space becomes a problem... What I'd ultimately like is to store the data files in some compressed manner (at least some of the textual files), how would you suggest to do that ?
A common scenario is: 1. User uploads a big Fastq/solexa file (=> 1.2 GB) 2. FASTQ file converted to FASTA file (=> 0.6 GB) 3. FASTA file trimmed, clipped, stripped, etc. (=> 100 MB) 4. BLAT, Histograms and other reports (=> ~50 MB)
The first three data sets take about 1.9 GB of disk space - and aren't really needed by the user (as he/she is mostly interested in the resulting report files). Since these are textual files, they compress really well.
See the http://g2.trac.bx.psu.edu/wiki/PurgeHistoriesAndDatasets wiki. You may find it useful to configure the cleanup_datasets scripts in cron to removed "deleted" datasets from disk after a configured number of days. Let me know if you have any questions about this process.
Currently, I store the FASTQ gzip'ed in galaxy, and my tools know how to read gzip'ed data.
There are two shortcomings with this method: 1. datasets (green squares) of gzip'ed files don't display any data in the peek window
This can be corrected in a fairly easy way. Just add "gzip" as a new data type ( see http://g2.trac.bx.psu.edu/wiki/AddingDatatypes ). In your "GZIP" class, include a "display_peek()" method or a "make_html_table()" method that will display what you want for the "gzip" data type. A close example of what you may need is available in the Binseq() class in ~/lib/galaxy/datatypes/images.py.
2. Other galaxy tools which require FASTQ file as input can't read my file.
Perl has an I/O module (PerlIO::gzip) which makes reading gzipped files transparent to the rest of the program. I think python has something very similar (http://www.python.org/doc/lib/module-gzip.html).
If it's not too much to ask, would it be possible to add support for reading gzip'ed files ? At least in the peek/preview window ?
We'll certainly take this under consideration. Galaxy currently does support retrieving compressed files from external data sources ( UCSC ) as well as uploading them via the upload utility. However, they are currently decompressed on-the-fly. Allowing them to remain compressed would require tools to decompress them - we'll see if maybe this makes sense for some tools.
Comments are welcomed,
Thanks, Gordon. _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user