Hello all,

we were having the same issues with groomer taking up to 12 hours for large files. I had a look at the code and saw it was only using the single core. I changed the code to split the fastq input into multiple file parts and process it in parallel and reassemble the results. It also reassembles the aggregator data (which prints the final summary).

For using 8 cores we saw a 7x  improvement. Naturally the data-output is identical. One limitation is that it does not support fastq that has multiple lines per single sequence. I have read that this practice is discouraged anyway as it was problematic (though it was in the original spec) and I haven't seen this occur in our data so far.

I believe there is still room to improve as the Python readLine has suboptimal performance as it will do too much file I/O without enough buffering.

I'm new to bioinformatics, though i come from a history of R&D comp eng. If anyone is at the Chicago Galaxy conference, you can talk to Warren Kaplan about this. I can provide the code.

regards
Kenny

------
Bioinformatics Architect
Garvan Institute

On Wed, Jul 25, 2012 at 5:54 AM, Langhorst, Brad <Langhorst@neb.com> wrote:
galaxy just wraps existing tools...  so it's probably not galaxy that is slow per se, but the fastqgroomer too.  Each tool has its own performance characteristics.

I don't use fastqgroomer, so I don't know how it can be expected to perform.

Are you sure you need it?

If you know that your error is scaled in sanger units (iontorrent and casava  1.8 fastqs are), then you may not.

If you look at your activity monitor you can see if CPU or disk is the limiting factor for the work you are doing.


Brad
On Jul 24, 2012, at 3:41 PM, Di Nguyen wrote:

> Dear All,
>
> I successfully install Galaxy onto my new MBP with 16Gb or Ram but when I tried to use Galaxy, it is painfully slow! The first test I did was to create Admin and import data (RNA-seq fastq, about 6Gb in size) into database and then history and it worked fine. The second test was to run fasqgroomer on this fasq and it took forever (3 hours+).
>
> Anybody got in idea of why it is so slow? Would it be possible that Galaxy was set up to run a single process instead of 8-core processor? If that is the case, how to fix it?
>
> Please help!
>
> Di Nguyen
> Postdoc, U of W, Seattle, WA
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
> http://lists.bx.psu.edu/

--
Brad Langhorst
langhorst@neb.com
978-380-7564





___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/