How to retain files compressed
Hi all! I want to load a R-workspace within a galaxy module (.rdat-file, R-Project) and therefore built the galaxy-.rdat datatype (binary). .rdat-files are gzipped and are only recognized within R if they are still zipped. However, the corresponding .dat-file is an uncompressed version of the original .rdat file as I figured out using a hex-editor. I couldn't find any documentation how to change this behaviour, nor answers to similar Questions in this list. Would be happy for any answere that points me in the right direction. Details ##### datatypes_conf.xml: ----------------------------------- <?xml version="1.0"?> <datatypes> <registration converters_path="lib/galaxy/datatypes/converters" display_path="display_applications"> [...] <datatype extension="rdat" type="galaxy.datatypes.binary:Rdat" mimetype="application/octet-stream" display_in_upload="true"/> [...] </registration> <sniffers> [...] <sniffer type="galaxy.datatypes.binary:Rdat"/> [...] </sniffers> </datatypes> binary.py: ------------------ [...] class Rdat( Binary ): """Class describing an rdat binary file (R-workspace)""" file_ext = "rdat" #MetadataElement( name="Rdat", desc="R-workspace", param=metadata.FileParameter, readonly=True, no_value=None, visible=False, optional=True ) """ def __init__( self, **kwd ): Binary.__init__( self, **kwd ) self._name = "Rdat" """ def set_peek( self, dataset, is_multi_byte=False ): if not dataset.dataset.purged: dataset.peek = "Binary rdat file (R-workspace)" dataset.blurb = data.nice_size( dataset.get_size() ) else: dataset.peek = 'file does not exist' dataset.blurb = 'file purged from disk' def display_peek( self, dataset ): try: return dataset.peek except: return "Binary rdat file (%s)" % ( data.nice_size( dataset.get_size() ) ) def get_mime( self ): """Returns the mime type of the datatype""" return 'application/octet-stream' def sniff( self, filename ): # rdat is compressed in the gzip format, and must not be uncompressed in Galaxy. # The first 4 bytes of any rdat file are RDX2 try: header = gzip.open( filename ).read(4) #(4)=>4Bytes if binascii.b2a_hex( header ) == binascii.hexlify( 'RDX2' ): #check if there is the RDX2 signature return True return False except: return False try: header = open( filename ).read(4) #(4)=>4Bytes if binascii.b2a_hex( header ) == binascii.hexlify( 'RDX2' ): #check if there is the RDX2 signature return True return False except: return False -- Dr. Christian Hundsrucker Institute for Functional Genomics Computational Diagnostics Group University of Regensburg Josef Engertstr. 9 93053 Regensburg, Germany
Christian Hundsrucker wrote:
Hi all!
I want to load a R-workspace within a galaxy module (.rdat-file, R-Project) and therefore built the galaxy-.rdat datatype (binary). .rdat-files are gzipped and are only recognized within R if they are still zipped. However, the corresponding .dat-file is an uncompressed version of the original .rdat file as I figured out using a hex-editor. I couldn't find any documentation how to change this behaviour, nor answers to similar Questions in this list.
Would be happy for any answere that points me in the right direction.
Hi Christian, Have a look in the upload tool, tools/data_source/upload.py, which is where the decompression would be occuring. There's a spot where we bypass decompression for certain formats like BAM, and this would need to do the same. Sorry it's a bit of a hack, eventually the goal is to make it more pluggable, but this is the solution for now. --nate
Details #####
datatypes_conf.xml: -----------------------------------
<?xml version="1.0"?> <datatypes> <registration converters_path="lib/galaxy/datatypes/converters" display_path="display_applications"> [...] <datatype extension="rdat" type="galaxy.datatypes.binary:Rdat" mimetype="application/octet-stream" display_in_upload="true"/> [...] </registration> <sniffers> [...] <sniffer type="galaxy.datatypes.binary:Rdat"/> [...] </sniffers> </datatypes>
binary.py: ------------------
[...] class Rdat( Binary ): """Class describing an rdat binary file (R-workspace)""" file_ext = "rdat" #MetadataElement( name="Rdat", desc="R-workspace", param=metadata.FileParameter, readonly=True, no_value=None, visible=False, optional=True )
""" def __init__( self, **kwd ): Binary.__init__( self, **kwd ) self._name = "Rdat" """
def set_peek( self, dataset, is_multi_byte=False ): if not dataset.dataset.purged: dataset.peek = "Binary rdat file (R-workspace)" dataset.blurb = data.nice_size( dataset.get_size() ) else: dataset.peek = 'file does not exist' dataset.blurb = 'file purged from disk' def display_peek( self, dataset ): try: return dataset.peek except: return "Binary rdat file (%s)" % ( data.nice_size( dataset.get_size() ) ) def get_mime( self ): """Returns the mime type of the datatype""" return 'application/octet-stream' def sniff( self, filename ): # rdat is compressed in the gzip format, and must not be uncompressed in Galaxy. # The first 4 bytes of any rdat file are RDX2 try: header = gzip.open( filename ).read(4) #(4)=>4Bytes if binascii.b2a_hex( header ) == binascii.hexlify( 'RDX2' ): #check if there is the RDX2 signature return True return False except: return False try: header = open( filename ).read(4) #(4)=>4Bytes if binascii.b2a_hex( header ) == binascii.hexlify( 'RDX2' ): #check if there is the RDX2 signature return True return False except: return False
-- Dr. Christian Hundsrucker Institute for Functional Genomics Computational Diagnostics Group University of Regensburg Josef Engertstr. 9 93053 Regensburg, Germany
_______________________________________________ To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Christian Hundsrucker
-
Nate Coraor