Storing/Peeking/Downloading compressed files
Hello, I'd like to request/suggest a feature: Semi-Transparent support for compressed files. The feature requires four (tiny) patches (detailed below). With this feature, dataset files (/database/files/NNN/dataset_NNNN.dat) can be stored compressed, and their content will be automatically 'peeked' in the preview window. Additionally, when a user clicks 'save' or 'eye icon', they will be uncompressed on-the-fly - so the user doesn't need to know/care they are compressed. Of course, there's the whole issue of making the different tools read and write compressed files - but that's another story. It's actually not too complicated story: In Python, just call gzip.open instead of open. In shell scripts, pipe the input file through "zcat -f FILE | program". In Perl, use PerlIO::Gzip module. Comments are welcomed, Regards, Gordon. First Patch - Adding a function to "util" module, which returns a Gzip/Bzip2/Zip File object (or a plain File object) based on the file type. File type detection is done using the 'magic' module - I think it is quite standard (in ubuntu I got it with "apt-get install python-magic"). However, to get galaxy to find this module I had to remove the "-ES" from "run.sh" - I'm sure there's a better way to do it. ==================================================================== --- ./lib/galaxy/util/__init__.orig.py 2008-12-26 23:48:40.000000000 -0500 +++ ./lib/galaxy/util/__init__.py 2008-12-27 00:31:44.000000000 -0500 @@ -14,11 +14,41 @@ from galaxy.util.docutils_ext.htmlfrag i pkg_resources.require( 'elementtree' ) from elementtree import ElementTree +import magic # file detection +import gzip # allow peeking into compressed files +import bz2 +import zipfile + log = logging.getLogger(__name__) _lock = threading.RLock() gzip_magic = '\037\213' +# Magic file detection +magic_file = magic.open(magic.MAGIC_MIME) +try: + magic_file.load() +except: + magic_file = None + +def open_file_wrapper(filename): + file_mime = "" + if magic_file is not None: + try: + file_mime = magic_file.file(filename) + except: + file_mime = "" + if file_mime == "application/x-gzip": + return gzip.open(filename) + if file_mime == "application/x-bzip2": + return bz2.BZ2File(filename) + if file_mime == "appication/x-zip": + return zipfile.ZipFile(filename) + + #for all other mime types, return the raw file + return file(filename) + + def synchronized(func): """This wrapper will serialize access to 'func' to a single thread. Use it as a decorator.""" def caller(*params, **kparams): ==================================================================== Second Patch - In the 'display' action of the root web controller, return the file with the appropriate wrapper ==================================================================== --- ./lib/galaxy/web/controllers/root_orig.py 2008-12-26 23:56:01.000000000 -0500 +++ ./lib/galaxy/web/controllers/root.py 2008-12-27 00:35:43.000000000 -0500 @@ -153,7 +153,7 @@ class RootController( BaseController ): m1 = trans.app.memory_usage.memory( m0, pretty=True ) log.info( "End of root/display, memory used increased by %s" % m1 ) try: - return open( data.file_name ) + return util.open_file_wrapper( data.file_name ) except: return "This dataset contains no content" else: ==================================================================== Third patch - In the BaseController object, allow streaming on compressed files (not just types.FileTypes): ==================================================================== --- ./lib/galaxy/web/framework/base_orig.py 2008-12-27 00:41:38.000000000 -0500 +++ ./lib/galaxy/web/framework/base.py 2008-12-27 00:41:37.000000000 -0500 @@ -25,6 +25,11 @@ from paste.response import HeaderDict # For FieldStorage import cgi +# For auto-decompressing files +import gzip +import bz2 +import zipfile + log = logging.getLogger( __name__ ) class WebApplication( object ): @@ -133,7 +138,7 @@ class WebApplication( object ): if callable( body ): # Assume the callable is another WSGI application to run return body( environ, start_response ) - elif isinstance( body, types.FileType ): + elif isinstance( body, (types.FileType, gzip.GzipFile, bz2.BZ2File, zipfile.ZipFile) ): # Stream the file back to the browser return send_file( start_response, trans, body ) else: ==================================================================== Fourth Patch - In the generic Data datatype object, replace the file object with a compressed file object in the peek function: ==================================================================== --- ./lib/galaxy/datatypes/data.py 2008-12-26 23:34:15.000000000 -0500 +++ ./lib/galaxy/datatypes/data_orig.py 2008-12-26 23:21:41.000000000 -0500 @@ -332,7 +332,7 @@ def get_file_peek( file_name, WIDTH=256, count = 0 file_type = '' data_checked = False - for line in util.open_file_wrapper( file_name ): + for line in file( file_name ): line = line[ :WIDTH ] if not data_checked and line: data_checked = True ====================================================================
participants (1)
-
Assaf Gordon