Re: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?
Hi everyone, I am experiencing a similar problem to that of Leandro. I have defined a new datatype deriving Data, and having extension "foobar". The simplest and first thing I'd like to do with it is create a dataset of this new type "Foobar" by uploading from my local machine to my Galaxy development server. When I do the upload, I specify the file format as "foobar" rather than "Auto-detect". The file itself is a zip archive containing a folder containing other files, but the extension is ".foobar" and not ".zip". I encounter at least 2 (I believe separate) problems while trying to upload. 1. I can see that on upload the first non-directory entry in the ".foobar" file is being extracted and replacing the original. It seems this is how *unknown* zip archives are "supposed" to be handled by Galaxy, but as to why that is, I haven't a clue. 2. (This is what's similar to Leandro's case.) The dataset_XXX.dat file produced by the upload is not actually the same as the file it is "copied" from in the uploaded archive. The checksums are different, and the sizes are different (the dataset is 1 byte longer). I did a diff of the hexdumps of the dataset file and the corresponding file from the uploaded archive, and discovered the following: 1. Every occurrence of '^M' (aka '\r' aka 0x0D) has been replaced with '\n' (aka 0x0A). 2. A newline ('\n' aka 0x0A) is added to the end of the file. This file from the archive is not a text file, it is binary, so any code in Galaxy that tries to fix line endings shouldn't be doing this. (Where) Are there such places? Leandro, have you solved your problem? If not, what do you see when you do this kind of comparison? I am unable to reproduce those changes by stepping through the code in upload.py which handles zip files (and replaces them by their first file member) using the same python installation used for running Galaxy. This suggests the problem is elsewhere. Does anyone know why this '\r' -> '\n' mapping is affecting this file? Does anyone know why the default behaviour for uploading zip archives is to keep one file arbitrarily and throw out the rest? Even with an argument in favor of this behaviour, why is there not a "unzippable_file_formats" list for exceptions to be made like there is for sniffing? Any enlightenment on these matters would be greatly appreciated. Best, Eric ________________________________________ From: Leandro Hermida [softdev@leandrohermida.com] Sent: Friday, September 16, 2011 10:03 AM To: Paniagua, Eric Subject: Re: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file? Hi Eric, On Fri, Sep 16, 2011 at 3:58 PM, Paniagua, Eric <epaniagu@cshl.edu> wrote:
Hi Leandro,
Is there an entry in your history for the upload? What file format does it show? Is there any chance your original file was zipped? If Galaxy detected it as a zip file on upload, it may have unzipped it and taken the first file in it as the dataset.
Yes there is an history entry for the upload. The format it shows is the new datatype I created (in datatypes_conf.xml, subclassing Binary) which I selected in the drop-down menu before uploading the file in the Get Data form. It is not a zip file.
That's at least the version of your problem that I've run into before. Specifying the file format manually (rather than choosing Auto-detect) may help if it's a similar problem. I suspect the correct solution is to write a sniffer for your datatype to help ensure it is identified correctly by Galaxy, but I haven't tried this yet.
Essentially the basic question is, how do you tell Galaxy not to do or touch absolutely *anything* with an uploaded binary file??? The checksums should always match.
Best of luck, Eric ________________________________________ From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] on behalf of Leandro Hermida [softdev@leandrohermida.com] Sent: Friday, September 16, 2011 9:42 AM To: Galaxy Dev Subject: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?
Hi all,
We tried to find something in the docs and mailing list no luck. We created a new datatype the is a straight subclass of Binary and then when we upload such a file in the Galaxy UI and check the checksums between the original file and the file located in the Galaxy database/files/... directory their checksums are different!
What are we doing wrong? We simply want Galaxy to upload and no touch the file at all.
regards, Leandro ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Paniagua, Eric wrote:
Hi everyone,
I am experiencing a similar problem to that of Leandro. I have defined a new datatype deriving Data, and having extension "foobar". The simplest and first thing I'd like to do with it is create a dataset of this new type "Foobar" by uploading from my local machine to my Galaxy development server. When I do the upload, I specify the file format as "foobar" rather than "Auto-detect". The file itself is a zip archive containing a folder containing other files, but the extension is ".foobar" and not ".zip". I encounter at least 2 (I believe separate) problems while trying to upload.
1. I can see that on upload the first non-directory entry in the ".foobar" file is being extracted and replacing the original. It seems this is how *unknown* zip archives are "supposed" to be handled by Galaxy, but as to why that is, I haven't a clue.
2. (This is what's similar to Leandro's case.) The dataset_XXX.dat file produced by the upload is not actually the same as the file it is "copied" from in the uploaded archive. The checksums are different, and the sizes are different (the dataset is 1 byte longer).
I did a diff of the hexdumps of the dataset file and the corresponding file from the uploaded archive, and discovered the following:
1. Every occurrence of '^M' (aka '\r' aka 0x0D) has been replaced with '\n' (aka 0x0A).
2. A newline ('\n' aka 0x0A) is added to the end of the file.
This file from the archive is not a text file, it is binary, so any code in Galaxy that tries to fix line endings shouldn't be doing this. (Where) Are there such places?
Leandro, have you solved your problem? If not, what do you see when you do this kind of comparison?
I am unable to reproduce those changes by stepping through the code in upload.py which handles zip files (and replaces them by their first file member) using the same python installation used for running Galaxy. This suggests the problem is elsewhere.
Does anyone know why this '\r' -> '\n' mapping is affecting this file?
Does anyone know why the default behaviour for uploading zip archives is to keep one file arbitrarily and throw out the rest? Even with an argument in favor of this behaviour, why is there not a "unzippable_file_formats" list for exceptions to be made like there is for sniffing?
Any enlightenment on these matters would be greatly appreciated.
Hi Eric, It's happening in tools/data_source/upload.py, line 286: line_count, converted_path = sniff.convert_newlines( dataset.path, in_place=in_place ) This should really be checking for any datatype subclassed from binary, not binary itself. As a quick workaround, add a check for your datatype in the if/elif blocks around line 130 to avoid the default processing. The upload tool needs to be rewritten to fix this, but it will be a while before this is done. --nate
Best, Eric
________________________________________ From: Leandro Hermida [softdev@leandrohermida.com] Sent: Friday, September 16, 2011 10:03 AM To: Paniagua, Eric Subject: Re: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?
Hi Eric,
On Fri, Sep 16, 2011 at 3:58 PM, Paniagua, Eric <epaniagu@cshl.edu> wrote:
Hi Leandro,
Is there an entry in your history for the upload? What file format does it show? Is there any chance your original file was zipped? If Galaxy detected it as a zip file on upload, it may have unzipped it and taken the first file in it as the dataset.
Yes there is an history entry for the upload. The format it shows is the new datatype I created (in datatypes_conf.xml, subclassing Binary) which I selected in the drop-down menu before uploading the file in the Get Data form. It is not a zip file.
That's at least the version of your problem that I've run into before. Specifying the file format manually (rather than choosing Auto-detect) may help if it's a similar problem. I suspect the correct solution is to write a sniffer for your datatype to help ensure it is identified correctly by Galaxy, but I haven't tried this yet.
Essentially the basic question is, how do you tell Galaxy not to do or touch absolutely *anything* with an uploaded binary file??? The checksums should always match.
Best of luck, Eric ________________________________________ From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] on behalf of Leandro Hermida [softdev@leandrohermida.com] Sent: Friday, September 16, 2011 9:42 AM To: Galaxy Dev Subject: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?
Hi all,
We tried to find something in the docs and mailing list no luck. We created a new datatype the is a straight subclass of Binary and then when we upload such a file in the Galaxy UI and check the checksums between the original file and the file located in the Galaxy database/files/... directory their checksums are different!
What are we doing wrong? We simply want Galaxy to upload and no touch the file at all.
regards, Leandro ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Nate Coraor
-
Paniagua, Eric