Fastq_filter fails on large files
Hello, The tool fastq_filter worked on 1M reads, but fails (hangs) on 15M reads. I had to kill the job after the user let it run for a whole day. The debug.txt file containing a python function "fastq_read_pass_filter" is created in the files/000/dataset_xxx_files directory. I am getting no error from the galaxy server. I wonder what could cause fastq_filter to fail? The fastx equivalent tool works, but it misses all the options of fastq_filter. I'd be grateful for any hints to help me get fastq_filter to work on large fastq files. Thanks Isabelle
Hi Isabelle, There are no known limits, but perhaps you have found something new. We can explore two areas: Your Galaxy instance config: Is metadata being set externally? Specifically, we are wondering whether you have have optional metadata configured to not count fastq blocks if the file is larger than a specified size or similar. Example data & filter options: 1 - If you could load some sample input files into a history on Galaxy main and share the link, that would be helpful. Just a sample of sequences that are representative of the entire dataset. 2 - Note the specific filter options used in the fastq_filter tool. We can scale the data up, run with your filters, and try to see what is causing the problem. We look forward to your reply, Jen Galaxy team On 9/13/10 3:31 PM, Isabelle Phan wrote:
Hello,
The tool fastq_filter worked on 1M reads, but fails (hangs) on 15M reads. I had to kill the job after the user let it run for a whole day. The debug.txt file containing a python function "fastq_read_pass_filter" is created in the files/000/dataset_xxx_files directory. I am getting no error from the galaxy server.
I wonder what could cause fastq_filter to fail? The fastx equivalent tool works, but it misses all the options of fastq_filter.
I'd be grateful for any hints to help me get fastq_filter to work on large fastq files.
Thanks
Isabelle
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- Jennifer Jackson http://usegalaxy.org
Hello Jen, Apologies for the delay in responding, I got side-tracked with other projects. Now back to galaxy!
Is metadata being set externally?
I have it commented out in universe_wsgi.ini: #set_metadata_externally = False
1 - If you could load some sample input files into a history on Galaxy main and share the link, that would be helpful. Just a sample of sequences that are representative of the entire dataset.
http://main.g2.bx.psu.edu/u/iphan/h/iphan-test This contains the first 100 fastq sequences of the raw file (before grooming).
2 - Note the specific filter options used in the fastq_filter tool. We can scale the data up, run with your filters, and try to see what is causing the problem.
Analysis steps: - groom with fastq_groomer (default options) - fastq_filter: trim 3 bases from 5'end, set minimum quality to 30 1M sequences run fine, 15M sequences crash the fastq_filter as described below. The FASTX tools work, but requires to run 2 different tools (i.e. produces 1 more intermediary file), which is not scaling up with the number and size of files we are dealing with. I'd rather use fastq_filter, if possible. We are using a local install of Galaxy for exploratory analysis of NGS data and so far are very happy with it, kudos to your team. Isabelle -- Isabelle Phan, DPhil Seattle Biomedical Research Institute +1(206)256 7113
-----Original Message----- From: Jennifer Jackson [mailto:jen@bx.psu.edu] Sent: Friday, October 01, 2010 12:00 PM To: Isabelle Phan Cc: 'galaxy-dev@lists.bx.psu.edu' Subject: Re: [galaxy-dev] Fastq_filter fails on large files
Hi Isabelle,
There are no known limits, but perhaps you have found something new. We can explore two areas:
Your Galaxy instance config: Is metadata being set externally? Specifically, we are wondering whether you have have optional metadata configured to not count fastq blocks if the file is larger than a specified size or similar.
Example data & filter options: 1 - If you could load some sample input files into a history on Galaxy main and share the link, that would be helpful. Just a sample of sequences that are representative of the entire dataset. 2 - Note the specific filter options used in the fastq_filter tool. We can scale the data up, run with your filters, and try to see what is causing the problem.
We look forward to your reply,
Jen Galaxy team
Hello,
The tool fastq_filter worked on 1M reads, but fails (hangs) on 15M reads. I had to kill the job after the user let it run for a whole day. The debug.txt file containing a
On 9/13/10 3:31 PM, Isabelle Phan wrote: python function "fastq_read_pass_filter" is created in the files/000/dataset_xxx_files directory. I am getting no error from the galaxy server.
I wonder what could cause fastq_filter to fail? The fastx equivalent tool works, but it
misses all the options of fastq_filter.
I'd be grateful for any hints to help me get fastq_filter to work on large fastq files.
Thanks
Isabelle
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- Jennifer Jackson http://usegalaxy.org
participants (2)
-
Isabelle Phan
-
Jennifer Jackson