FastQC wrapper not seeing files at gzipped
Hi all - I've got a bunch of fatsq files uploaded into a data library in Galaxy. The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq. When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file. The wrapper creates a symbolic link to the .gz file in a tmp directory. The link is .fastq. When FastQC tries to read this file, it fails because its compressed. So one of two things is going wrong here: 1) It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy. 2) When the file is uploaded into the data library, Galaxy is stripping off the .gz extension. I think #2 is the more correct problem. How can I keep Galaxy from stripping the .gz extension?
Hi Ryan, The problem isn't Galaxy stripping the extension, rather Galaxy is actually decompressing the file as part of the upload process. Unfortunately (and there is an open Trello enhancement request on this), Galaxy does not support sorting any of the defined datatypes in compressed form UNLESS they are defined that way (like BAM files). This has lead some Galaxy Admins to define a new datatype lgzippedfastq (or similar - I'd have to check my old emails for the exact name used as a gripped alternative to the Galaxy sangerfastq datatype) and then modified many/all their tools to handle this. That is a lot of work, but does offer big disk savings for this key datatype. The Galaxy team instead use a compressed file system, so for usegalaxy.org ALL their data files are compressed but Galaxy can ignore this complexity. Peter On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Hi all - I've got a bunch of fatsq files uploaded into a data library in Galaxy. The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq. When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file. The wrapper creates a symbolic link to the .gz file in a tmp directory. The link is .fastq. When FastQC tries to read this file, it fails because its compressed. So one of two things is going wrong here:
1) It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy.
2) When the file is uploaded into the data library, Galaxy is stripping off the .gz extension.
I think #2 is the more correct problem. How can I keep Galaxy from stripping the .gz extension?
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Galaxy is not decompressing the file. The file is linked to on the filesystem. On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Ryan,
The problem isn't Galaxy stripping the extension, rather Galaxy is actually decompressing the file as part of the upload process.
Unfortunately (and there is an open Trello enhancement request on this), Galaxy does not support sorting any of the defined datatypes in compressed form UNLESS they are defined that way (like BAM files).
This has lead some Galaxy Admins to define a new datatype lgzippedfastq (or similar - I'd have to check my old emails for the exact name used as a gripped alternative to the Galaxy sangerfastq datatype) and then modified many/all their tools to handle this. That is a lot of work, but does offer big disk savings for this key datatype.
The Galaxy team instead use a compressed file system, so for usegalaxy.org ALL their data files are compressed but Galaxy can ignore this complexity.
Peter
On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Hi all - I've got a bunch of fatsq files uploaded into a data library in Galaxy. The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq. When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file. The wrapper creates a symbolic link to the .gz file in a tmp directory. The link is .fastq. When FastQC tries to read this file, it fails because its compressed. So one of two things is going wrong here:
1) It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy.
2) When the file is uploaded into the data library, Galaxy is stripping off the .gz extension.
I think #2 is the more correct problem. How can I keep Galaxy from stripping the .gz extension?
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Ah. Then this is more subtle... are you using the library import option where Galaxy just symlinks to existing files? I thought that was not possible with gzipped files (for the reasons given below). Perhaps this is not being blocked, leading to the confused state you're seeing? Peter On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Galaxy is not decompressing the file. The file is linked to on the filesystem.
On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Ryan,
The problem isn't Galaxy stripping the extension, rather Galaxy is actually decompressing the file as part of the upload process.
Unfortunately (and there is an open Trello enhancement request on this), Galaxy does not support sorting any of the defined datatypes in compressed form UNLESS they are defined that way (like BAM files).
This has lead some Galaxy Admins to define a new datatype lgzippedfastq (or similar - I'd have to check my old emails for the exact name used as a gripped alternative to the Galaxy sangerfastq datatype) and then modified many/all their tools to handle this. That is a lot of work, but does offer big disk savings for this key datatype.
The Galaxy team instead use a compressed file system, so for usegalaxy.org ALL their data files are compressed but Galaxy can ignore this complexity.
Peter
On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Hi all - I've got a bunch of fatsq files uploaded into a data library in Galaxy. The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq. When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file. The wrapper creates a symbolic link to the .gz file in a tmp directory. The link is .fastq. When FastQC tries to read this file, it fails because its compressed. So one of two things is going wrong here:
1) It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy.
2) When the file is uploaded into the data library, Galaxy is stripping off the .gz extension.
I think #2 is the more correct problem. How can I keep Galaxy from stripping the .gz extension?
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Yes, I'm doing a link to file on file system when doing a library import. Does this mean I should link to the the uncompressed file? On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Ah. Then this is more subtle... are you using the library import option where Galaxy just symlinks to existing files? I thought that was not possible with gzipped files (for the reasons given below). Perhaps this is not being blocked, leading to the confused state you're seeing?
Peter
On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Galaxy is not decompressing the file. The file is linked to on the filesystem.
On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Ryan,
The problem isn't Galaxy stripping the extension, rather Galaxy is actually decompressing the file as part of the upload process.
Unfortunately (and there is an open Trello enhancement request on this), Galaxy does not support sorting any of the defined datatypes in compressed form UNLESS they are defined that way (like BAM files).
This has lead some Galaxy Admins to define a new datatype lgzippedfastq (or similar - I'd have to check my old emails for the exact name used as a gripped alternative to the Galaxy sangerfastq datatype) and then modified many/all their tools to handle this. That is a lot of work, but does offer big disk savings for this key datatype.
The Galaxy team instead use a compressed file system, so for usegalaxy.org ALL their data files are compressed but Galaxy can ignore this complexity.
Peter
On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Hi all - I've got a bunch of fatsq files uploaded into a data library
in
Galaxy. The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq. When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file. The wrapper creates a symbolic link to the .gz file in a tmp directory. The link is .fastq. When FastQC tries to read this file, it fails because its compressed. So one of two things is going wrong here:
1) It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy.
2) When the file is uploaded into the data library, Galaxy is stripping off the .gz extension.
I think #2 is the more correct problem. How can I keep Galaxy from stripping the .gz extension?
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
To (I think) fix this, I changed line 50 in rgFastQC.py from infname = self.opts.inputfilename to infname = self.opts.input This will force FastQC to look at the "real" file and not the renamed dataset. On Mon, Jan 12, 2015 at 12:20 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Yes, I'm doing a link to file on file system when doing a library import. Does this mean I should link to the the uncompressed file?
On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Ah. Then this is more subtle... are you using the library import option where Galaxy just symlinks to existing files? I thought that was not possible with gzipped files (for the reasons given below). Perhaps this is not being blocked, leading to the confused state you're seeing?
Peter
Galaxy is not decompressing the file. The file is linked to on the filesystem.
On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <p.j.a.cock@googlemail.com
wrote:
Hi Ryan,
The problem isn't Galaxy stripping the extension, rather Galaxy is actually decompressing the file as part of the upload process.
Unfortunately (and there is an open Trello enhancement request on this), Galaxy does not support sorting any of the defined datatypes in compressed form UNLESS they are defined that way (like BAM files).
This has lead some Galaxy Admins to define a new datatype lgzippedfastq (or similar - I'd have to check my old emails for the exact name used as a gripped alternative to the Galaxy sangerfastq datatype) and then modified many/all their tools to handle this. That is a lot of work, but does offer big disk savings for this key datatype.
The Galaxy team instead use a compressed file system, so for usegalaxy.org ALL their data files are compressed but Galaxy can ignore this complexity.
Peter
On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Hi all - I've got a bunch of fatsq files uploaded into a data
On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <ngsbioinformatics@gmail.com> wrote: library in
Galaxy. The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq. When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file. The wrapper creates a symbolic link to the .gz file in a tmp directory. The link is .fastq. When FastQC tries to read this file, it fails because its compressed. So one of two things is going wrong here:
1) It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy.
2) When the file is uploaded into the data library, Galaxy is stripping off the .gz extension.
I think #2 is the more correct problem. How can I keep Galaxy from stripping the .gz extension?
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi Ryan, That is the workaround I am using, which means keeping an uncompressed copy of the FASTQ file on our main storage from where Galaxy can see it (for people to use within their histories).
From a long term storage perspective this is not ideal - so I am keen for better handling of gzipped files within Galaxy (particularly within libraries which we use for raw data).
Peter On Mon, Jan 12, 2015 at 5:20 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Yes, I'm doing a link to file on file system when doing a library import. Does this mean I should link to the the uncompressed file?
On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Ah. Then this is more subtle... are you using the library import option where Galaxy just symlinks to existing files? I thought that was not possible with gzipped files (for the reasons given below). Perhaps this is not being blocked, leading to the confused state you're seeing?
Peter
On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Galaxy is not decompressing the file. The file is linked to on the filesystem.
Agreed. On Mon, Jan 12, 2015 at 10:24 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Ryan,
That is the workaround I am using, which means keeping an uncompressed copy of the FASTQ file on our main storage from where Galaxy can see it (for people to use within their histories).
From a long term storage perspective this is not ideal - so I am keen for better handling of gzipped files within Galaxy (particularly within libraries which we use for raw data).
Peter
On Mon, Jan 12, 2015 at 5:20 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Yes, I'm doing a link to file on file system when doing a library import. Does this mean I should link to the the uncompressed file?
On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Ah. Then this is more subtle... are you using the library import option where Galaxy just symlinks to existing files? I thought that was not possible with gzipped files (for the reasons given below). Perhaps this is not being blocked, leading to the confused state you're seeing?
Peter
On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Galaxy is not decompressing the file. The file is linked to on the filesystem.
Peter has already voted and if I recall correctly Ryan cannot access Trello - so this might be a waste to bring up - but here is a Trello card for voting on this issue and tracking progress https://trello.com/c/3RkTDnIn. To summarize previous discussion - this would be fantastic to have and Galaxy needs this - but we solve this on usegalaxy.org by using a compressed file system - a more elegant solution when it is a possibility - so it has never been a tier one priority for the devteam. The only update on this is that I don't think we are using a compressed file system anymore so this might become and issue again someday soon. This would be non-trivial to implement - but I have always felt this would be a fairly fun project to work on if anyone really tight on space locally wants to try to tackle it :). -John On Tue, Jan 13, 2015 at 9:54 AM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Agreed.
On Mon, Jan 12, 2015 at 10:24 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Ryan,
That is the workaround I am using, which means keeping an uncompressed copy of the FASTQ file on our main storage from where Galaxy can see it (for people to use within their histories).
From a long term storage perspective this is not ideal - so I am keen for better handling of gzipped files within Galaxy (particularly within libraries which we use for raw data).
Peter
On Mon, Jan 12, 2015 at 5:20 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Yes, I'm doing a link to file on file system when doing a library import. Does this mean I should link to the the uncompressed file?
On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Ah. Then this is more subtle... are you using the library import option where Galaxy just symlinks to existing files? I thought that was not possible with gzipped files (for the reasons given below). Perhaps this is not being blocked, leading to the confused state you're seeing?
Peter
On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <ngsbioinformatics@gmail.com> wrote:
Galaxy is not decompressing the file. The file is linked to on the filesystem.
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (3)
-
John Chilton
-
Peter Cock
-
Ryan G