.gz extension gets stripped off when uploading via data library Upload via filesystem paths but not via Get Data -> Upload File
Hello, We've created a new binary datatype for .fastq.gz files following the same methodology as the BAM files since we don't want our fasta.gz files to be gunzipped. I added the appropriate code in upload.py to make sure of this. This new datatype and extension successfully does not gunzip our files. But when we upload it into a data library via the data library "Upload via filesystem paths" it for some reason automatically strips the .gz part out. When we take the same .fastq.gz file and upload it via Get Data -> Upload File it works fine, nothing is stripped from file name. Where is it doing this and how can prevent from stripping the .gz via the data library menus? thanks, Leandro
On Fri, Jan 20, 2012 at 12:42 PM, Leandro Hermida <softdev@leandrohermida.com> wrote:
Hello,
We've created a new binary datatype for .fastq.gz files following the same methodology as the BAM files since we don't want our fasta.gz files to be gunzipped. I added the appropriate code in upload.py to make sure of this. This new datatype and extension successfully does not gunzip our files. But when we upload it into a data library via the data library "Upload via filesystem paths" it for some reason automatically strips the .gz part out. When we take the same .fastq.gz file and upload it via Get Data -> Upload File it works fine, nothing is stripped from file name. Where is it doing this and how can prevent from stripping the .gz via the data library menus?
I thought Galaxy would usually try to replace the extension with *.dat for any file type when uploaded? Peter P.S. Is anyone working on the more general solution of supporting a gzipped version of (almost) any Galaxy datatype? https://bitbucket.org/galaxy/galaxy-central/issue/666/
Hi Peter, Sorry I wasn't clear, the .gz gets stripped from the name in the Galaxy UI when you upload the files into a data library via the manage data libraries form. When you upload it via Get Data -> Upload File the .gz is preserved which is what one would want since I am not having it gunzipped by specifying its own new datatype and extension and changing binary.py to make sure it doesn't fall through to the elsif where it tries to unzip stuff. On Fri, Jan 20, 2012 at 2:25 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Fri, Jan 20, 2012 at 12:42 PM, Leandro Hermida <softdev@leandrohermida.com> wrote:
Hello,
We've created a new binary datatype for .fastq.gz files following the same methodology as the BAM files since we don't want our fasta.gz files to be gunzipped. I added the appropriate code in upload.py to make sure of this. This new datatype and extension successfully does not gunzip our files. But when we upload it into a data library via the data library "Upload via filesystem paths" it for some reason automatically strips the .gz part out. When we take the same .fastq.gz file and upload it via Get Data -> Upload File it works fine, nothing is stripped from file name. Where is it doing this and how can prevent from stripping the .gz via the data library menus?
I thought Galaxy would usually try to replace the extension with *.dat for any file type when uploaded?
Peter
P.S. Is anyone working on the more general solution of supporting a gzipped version of (almost) any Galaxy datatype? https://bitbucket.org/galaxy/galaxy-central/issue/666/
Hello Leandro, I believe this behavior is due to the make_library_uploaded_dataset() method in the ~/lib/galaxy/web/controllers/library_common controller. The current method looks like this: def make_library_uploaded_dataset( self, trans, cntrller, params, name, path, type, library_bunch, in_folder=None ): library_bunch.replace_dataset = None # not valid for these types of upload uploaded_dataset = util.bunch.Bunch() # Remove compressed file extensions, if any new_name = name if new_name.endswith( '.gz' ): new_name = new_name.rstrip( '.gz' ) elif new_name.endswith( '.zip' ): new_name = new_name.rstrip( '.zip' ) uploaded_dataset.name = new_name uploaded_dataset.path = path uploaded_dataset.type = type uploaded_dataset.ext = None uploaded_dataset.file_type = params.file_type uploaded_dataset.dbkey = params.dbkey uploaded_dataset.space_to_tab = params.space_to_tab if in_folder: uploaded_dataset.in_folder = in_folder uploaded_dataset.data = upload_common.new_upload( trans, cntrller, uploaded_dataset, library_bunch ) link_data_only = params.get( 'link_data_only', 'copy_files' ) uploaded_dataset.link_data_only = link_data_only if link_data_only == 'link_to_files': uploaded_dataset.data.file_name = os.path.abspath( path ) # Since we are not copying the file into Galaxy's managed # default file location, the dataset should never be purgable. uploaded_dataset.data.dataset.purgable = False trans.sa_session.add_all( ( uploaded_dataset.data, uploaded_dataset.data.dataset ) ) trans.sa_session.flush() return uploaded_dataset Here are the code changes that I believe will resolve the issue. However, I have not tested this, so if you wouldn't mind letting me know if this works for you, I'll commit the changes to the central repo. def make_library_uploaded_dataset( self, trans, cntrller, params, name, path, type, library_bunch, in_folder=None ): link_data_only = params.get( 'link_data_only', 'copy_files' ) library_bunch.replace_dataset = None # not valid for these types of upload uploaded_dataset = util.bunch.Bunch() new_name = name # Remove compressed file extensions, if any, but only if # we're copying files into Galaxy's file space. if link_data_only == 'copy_files': if new_name.endswith( '.gz' ): new_name = new_name.rstrip( '.gz' ) elif new_name.endswith( '.zip' ): new_name = new_name.rstrip( '.zip' ) uploaded_dataset.name = new_name uploaded_dataset.path = path uploaded_dataset.type = type uploaded_dataset.ext = None uploaded_dataset.file_type = params.file_type uploaded_dataset.dbkey = params.dbkey uploaded_dataset.space_to_tab = params.space_to_tab if in_folder: uploaded_dataset.in_folder = in_folder uploaded_dataset.data = upload_common.new_upload( trans, cntrller, uploaded_dataset, library_bunch ) uploaded_dataset.link_data_only = link_data_only if link_data_only == 'link_to_files': uploaded_dataset.data.file_name = os.path.abspath( path ) # Since we are not copying the file into Galaxy's managed # default file location, the dataset should never be purgable. uploaded_dataset.data.dataset.purgable = False trans.sa_session.add_all( ( uploaded_dataset.data, uploaded_dataset.data.dataset ) ) trans.sa_session.flush() return uploaded_dataset Thanks! Greg On Jan 20, 2012, at 7:42 AM, Leandro Hermida wrote:
Hello,
We've created a new binary datatype for .fastq.gz files following the same methodology as the BAM files since we don't want our fasta.gz files to be gunzipped. I added the appropriate code in upload.py to make sure of this. This new datatype and extension successfully does not gunzip our files. But when we upload it into a data library via the data library "Upload via filesystem paths" it for some reason automatically strips the .gz part out. When we take the same .fastq.gz file and upload it via Get Data -> Upload File it works fine, nothing is stripped from file name. Where is it doing this and how can prevent from stripping the .gz via the data library menus?
thanks, Leandro ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Hi Greg, Ok this code change to library_common.py works, now when you use the data libraries menu to bring in .fastq.gz files it doesn't cut off the .gz thank you! best, Leandro On Fri, Jan 20, 2012 at 3:32 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hello Leandro,
I believe this behavior is due to the make_library_uploaded_dataset() method in the ~/lib/galaxy/web/controllers/library_common controller. The current method looks like this:
def make_library_uploaded_dataset( self, trans, cntrller, params, name, path, type, library_bunch, in_folder=None ): library_bunch.replace_dataset = None # not valid for these types of upload uploaded_dataset = util.bunch.Bunch() # Remove compressed file extensions, if any new_name = name if new_name.endswith( '.gz' ): new_name = new_name.rstrip( '.gz' ) elif new_name.endswith( '.zip' ): new_name = new_name.rstrip( '.zip' ) uploaded_dataset.name = new_name uploaded_dataset.path = path uploaded_dataset.type = type uploaded_dataset.ext = None uploaded_dataset.file_type = params.file_type uploaded_dataset.dbkey = params.dbkey uploaded_dataset.space_to_tab = params.space_to_tab if in_folder: uploaded_dataset.in_folder = in_folder uploaded_dataset.data = upload_common.new_upload( trans, cntrller, uploaded_dataset, library_bunch ) link_data_only = params.get( 'link_data_only', 'copy_files' ) uploaded_dataset.link_data_only = link_data_only if link_data_only == 'link_to_files': uploaded_dataset.data.file_name = os.path.abspath( path ) # Since we are not copying the file into Galaxy's managed # default file location, the dataset should never be purgable. uploaded_dataset.data.dataset.purgable = False trans.sa_session.add_all( ( uploaded_dataset.data, uploaded_dataset.data.dataset ) ) trans.sa_session.flush() return uploaded_dataset
Here are the code changes that I believe will resolve the issue. However, I have not tested this, so if you wouldn't mind letting me know if this works for you, I'll commit the changes to the central repo.
def make_library_uploaded_dataset( self, trans, cntrller, params, name, path, type, library_bunch, in_folder=None ): link_data_only = params.get( 'link_data_only', 'copy_files' ) library_bunch.replace_dataset = None # not valid for these types of upload uploaded_dataset = util.bunch.Bunch() new_name = name # Remove compressed file extensions, if any, but only if # we're copying files into Galaxy's file space. if link_data_only == 'copy_files': if new_name.endswith( '.gz' ): new_name = new_name.rstrip( '.gz' ) elif new_name.endswith( '.zip' ): new_name = new_name.rstrip( '.zip' ) uploaded_dataset.name = new_name uploaded_dataset.path = path uploaded_dataset.type = type uploaded_dataset.ext = None uploaded_dataset.file_type = params.file_type uploaded_dataset.dbkey = params.dbkey uploaded_dataset.space_to_tab = params.space_to_tab if in_folder: uploaded_dataset.in_folder = in_folder uploaded_dataset.data = upload_common.new_upload( trans, cntrller, uploaded_dataset, library_bunch ) uploaded_dataset.link_data_only = link_data_only if link_data_only == 'link_to_files': uploaded_dataset.data.file_name = os.path.abspath( path ) # Since we are not copying the file into Galaxy's managed # default file location, the dataset should never be purgable. uploaded_dataset.data.dataset.purgable = False trans.sa_session.add_all( ( uploaded_dataset.data, uploaded_dataset.data.dataset ) ) trans.sa_session.flush() return uploaded_dataset
Thanks!
Greg
On Jan 20, 2012, at 7:42 AM, Leandro Hermida wrote:
Hello,
We've created a new binary datatype for .fastq.gz files following the same methodology as the BAM files since we don't want our fasta.gz files to be gunzipped. I added the appropriate code in upload.py to make sure of this. This new datatype and extension successfully does not gunzip our files. But when we upload it into a data library via the data library "Upload via filesystem paths" it for some reason automatically strips the .gz part out. When we take the same .fastq.gz file and upload it via Get Data -> Upload File it works fine, nothing is stripped from file name. Where is it doing this and how can prevent from stripping the .gz via the data library menus?
thanks, Leandro ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
participants (3)
-
Greg Von Kuster
-
Leandro Hermida
-
Peter Cock