parallel processing of multiple files on a multi-processor box?
Hi, I have an RNA-seq study with 6 runs. The output is 6 Illumina fastq files. I would like to be able to run FASTQ Groomer and Bowtie on all 6 in parallel, utilizing a multi-processor box. Is this possible? It appears (to our IT group) that Galaxy won't be able to utilize multiple processors on an SMP box for the same user session, and we would need to set up a cluster to do things like this. Is this so? Is there a work-around? Thanks. Yury -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu
Hi Yury, My apologies if I missed something or misunderstood you question, but I am not sure the IT group is correct saying Galaxy won't be able to utilize multiple processors on an SMP box. When running a tool the Galaxy server will fork off an asynchronous sub process, and if you run that tool multiple times in your user session in parallel it will have that many sub processes running in parallel and on an SMP box the kernel will properly utilize the available CPUs. So you would upload the 6 FASTQ files into your history and then you would run Groomer on each (you don't have to wait for one to finish in order to launch the next) and then the same for Bowtie. hth, Leandro On Mon, Jan 3, 2011 at 10:52 PM, Yury Bukhman <ybukhman@glbrc.wisc.edu>wrote:
Hi,
I have an RNA-seq study with 6 runs. The output is 6 Illumina fastq files. I would like to be able to run FASTQ Groomer and Bowtie on all 6 in parallel, utilizing a multi-processor box. Is this possible? It appears (to our IT group) that Galaxy won't be able to utilize multiple processors on an SMP box for the same user session, and we would need to set up a cluster to do things like this. Is this so? Is there a work-around?
Thanks.
Yury
Thanks, Leandro! Sorry for the naive questions: we're just getting started with our own Galaxy server. What about running tools that can themselves utilize multiple CPUs? Would that cause any problems? We have noticed that bowtie under Galaxy doesn't have the multithreaded option. What is the reason for that? Yury On 01/05/11, Leandro Hermida wrote:
Hi Yury,
My apologies if I missed something or misunderstood you question, but I am not sure the IT group is correct saying Galaxy won't be able to utilize multiple processors on an SMP box. When running a tool the Galaxy server will fork off an asynchronous sub process, and if you run that tool multiple times in your user session in parallel it will have that many sub processes running in parallel and on an SMP box the kernel will properly utilize the available CPUs.
So you would upload the 6 FASTQ files into your history and then you would run Groomer on each (you don't have to wait for one to finish in order to launch the next) and then the same for Bowtie.
hth, Leandro
On Mon, Jan 3, 2011 at 10:52 PM, Yury Bukhman <ybukhman@glbrc.wisc.edu(javascript:main.compose()> wrote:
Hi,
I have an RNA-seq study with 6 runs. The output is 6 Illumina fastq files. I would like to be able to run FASTQ Groomer and Bowtie on all 6 in parallel, utilizing a multi-processor box. Is this possible? It appears (to our IT group) that Galaxy won't be able to utilize multiple processors on an SMP box for the same user session, and we would need to set up a cluster to do things like this. Is this so? Is there a work-around?
Thanks.
Yury
-- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu
Yury, If the software has this option its no problem to use them! Have a look at Peter's last blast+ wrappers. Blast+ of ncbi has the ability to specify a number of cores to use...and so can you by configuring it in a tool config. Regarding the parallelisation..no expert in this. Have a look in the tool shed for the signalp and TMHMM wrappers. There you find a piece of python to split large jobs in batches, process them in parallel and merge them back. No experience with cluster or grid jobs myself... Alex ________________________________________ Van: galaxy-user-bounces@lists.bx.psu.edu [galaxy-user-bounces@lists.bx.psu.edu] namens Yury Bukhman [ybukhman@glbrc.wisc.edu] Verzonden: woensdag 5 januari 2011 23:42 Aan: Leandro Hermida CC: galaxy-user@lists.bx.psu.edu Onderwerp: Re: [galaxy-user] parallel processing of multiple files on a multi-processor box? Thanks, Leandro! Sorry for the naive questions: we're just getting started with our own Galaxy server. What about running tools that can themselves utilize multiple CPUs? Would that cause any problems? We have noticed that bowtie under Galaxy doesn't have the multithreaded option. What is the reason for that? Yury On 01/05/11, Leandro Hermida wrote:
Hi Yury,
My apologies if I missed something or misunderstood you question, but I am not sure the IT group is correct saying Galaxy won't be able to utilize multiple processors on an SMP box. When running a tool the Galaxy server will fork off an asynchronous sub process, and if you run that tool multiple times in your user session in parallel it will have that many sub processes running in parallel and on an SMP box the kernel will properly utilize the available CPUs.
So you would upload the 6 FASTQ files into your history and then you would run Groomer on each (you don't have to wait for one to finish in order to launch the next) and then the same for Bowtie.
hth, Leandro
On Mon, Jan 3, 2011 at 10:52 PM, Yury Bukhman <ybukhman@glbrc.wisc.edu(javascript:main.compose()> wrote:
Hi,
I have an RNA-seq study with 6 runs. The output is 6 Illumina fastq files. I would like to be able to run FASTQ Groomer and Bowtie on all 6 in parallel, utilizing a multi-processor box. Is this possible? It appears (to our IT group) that Galaxy won't be able to utilize multiple processors on an SMP box for the same user session, and we would need to set up a cluster to do things like this. Is this so? Is there a work-around?
Thanks.
Yury
-- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu _______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
On Thu, Jan 6, 2011 at 7:52 AM, Bossers, Alex <Alex.Bossers@wur.nl> wrote:
Yury,
If the software has this option its no problem to use them! Have a look at Peter's last blast+ wrappers. Blast+ of ncbi has the ability to specify a number of cores to use...and so can you by configuring it in a tool config. Regarding the parallelisation.. no expert in this. Have a look in the tool shed for the signalp and TMHMM wrappers. There you find a piece of python to split large jobs in batches, process them in parallel and merge them back.
No experience with cluster or grid jobs myself...
Alex
Hi Yury (& Alex), For a little clarification, like many computationally intensive command line tools the NCBI BLAST+ tools have a switch for the number of processors. Currently (like most of the other Galaxy wrappers) this is specified in the XML wrappers, in this case hard coded at 8. Some of the other tools XML files are hard coded with 4 threads (e.g. bwa). In the case of TMHMM and SignalP, the tools themselves are single threaded but I wrote a wrapper script (in Python) which divides the input FASTA file into chunks and runs multiple instances of the tool and then collates the output. Again, my wrapper tools is told how many threads to use via the XML wrapper. You can find my Galaxy wrappers for TMHMM and SignalP here at the "Galaxy Community Tool Shed" (Alex has been testing them - thanks!): http://community.g2.bx.psu.edu/ Some of the provided Galaxy wrappers have a note in the XML saying the number of threads should be configurable, perhaps via a loc file. I have suggested to the Galaxy developers there should be a general setting for number of threads per tool accessible via the XML, so that this can be configured centrally (maybe I should file an enhancement issue for this): http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-September/003393.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003407.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003408.html (I've CC'd the galaxy-dev list, since this discussion is heading in that direction) Peter
Two parts to this probably. 1) It should definitely be possible to have parameters in tool configs that are set in a global configuration file, I actually thought tool_conf.xml might be a good place (inside the tool element). 2) For the particular case of processor cores, ideally we would be able to have the batch management system set this information (if running on an 8 core node, use 8). On Jan 6, 2011, at 5:10 AM, Peter wrote:
Some of the provided Galaxy wrappers have a note in the XML saying the number of threads should be configurable, perhaps via a loc file. I have suggested to the Galaxy developers there should be a general setting for number of threads per tool accessible via the XML, so that this can be configured centrally (maybe I should file an enhancement issue for this):
participants (5)
-
Bossers, Alex
-
James Taylor
-
Leandro Hermida
-
Peter
-
Yury Bukhman