On Thu, Jan 6, 2011 at 7:52 AM, Bossers, Alex Alex.Bossers@wur.nl wrote:
Yury,
If the software has this option its no problem to use them! Have a look at Peter's last blast+ wrappers. Blast+ of ncbi has the ability to specify a number of cores to use...and so can you by configuring it in a tool config. Regarding the parallelisation.. no expert in this. Have a look in the tool shed for the signalp and TMHMM wrappers. There you find a piece of python to split large jobs in batches, process them in parallel and merge them back.
No experience with cluster or grid jobs myself...
Alex
Hi Yury (& Alex),
For a little clarification, like many computationally intensive command line tools the NCBI BLAST+ tools have a switch for the number of processors. Currently (like most of the other Galaxy wrappers) this is specified in the XML wrappers, in this case hard coded at 8. Some of the other tools XML files are hard coded with 4 threads (e.g. bwa).
In the case of TMHMM and SignalP, the tools themselves are single threaded but I wrote a wrapper script (in Python) which divides the input FASTA file into chunks and runs multiple instances of the tool and then collates the output. Again, my wrapper tools is told how many threads to use via the XML wrapper. You can find my Galaxy wrappers for TMHMM and SignalP here at the "Galaxy Community Tool Shed" (Alex has been testing them - thanks!): http://community.g2.bx.psu.edu/
Some of the provided Galaxy wrappers have a note in the XML saying the number of threads should be configurable, perhaps via a loc file. I have suggested to the Galaxy developers there should be a general setting for number of threads per tool accessible via the XML, so that this can be configured centrally (maybe I should file an enhancement issue for this):
http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-September/003393.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003407.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003408.html
(I've CC'd the galaxy-dev list, since this discussion is heading in that direction)
Peter
Two parts to this probably.
1) It should definitely be possible to have parameters in tool configs that are set in a global configuration file, I actually thought tool_conf.xml might be a good place (inside the tool element).
2) For the particular case of processor cores, ideally we would be able to have the batch management system set this information (if running on an 8 core node, use 8).
On Jan 6, 2011, at 5:10 AM, Peter wrote:
Some of the provided Galaxy wrappers have a note in the XML saying the number of threads should be configurable, perhaps via a loc file. I have suggested to the Galaxy developers there should be a general setting for number of threads per tool accessible via the XML, so that this can be configured centrally (maybe I should file an enhancement issue for this):
On Jan 6, 2011, at 12:35 PM, James Taylor wrote:
Two parts to this probably.
- It should definitely be possible to have parameters in tool configs
that are set in a global configuration file, I actually thought tool_conf.xml might be a good place (inside the tool element).
- For the particular case of processor cores, ideally we would be
able to have the batch management system set this information (if running on an 8 core node, use 8).
I'm not sure of a cross platform way to figure this out, and I admit this is kind of a hack, but if running in a TORQUE batch environment at run time a tool wrapper could parse the file specified in $PBS_NODEFILE to figure out how many threads it should be using (how many have been allocated to the job). This would require a wrapper script that parses this and then passes along the right flag on the command line to instruct the executable to use the right number of cores (not necessarily how many are on the node, but how many are allocated by the batch system to the job). MPI jobs are easier if the MPI implementation integrates with the batch system, since by default they will run on all the nodes/cores allocated to the job. It would be possible to write a script that knew how to do this for a variety of batch systems and could auto detect which environment it is running in. This could then be used to plug in the value for a command line switch specifying a number of threads for the actual executable.
But then there still is the problem of requesting the proper number of nodes and cores per node from the batch system. For our local setup we have a default pbs job runner that submits to the default server and default queue and requests a single core for some upper limit of time:
pbs:////-l nodes=1:ppn=1,walltime=HH:MM:SS
and then for every threaded tool we specify a job runner specific for that tool, like this
bowtie_wrapper = pbs:////-l nodes=1:ppn=N,walltime=HH:MM:SS/
where in our case N <= 32 (32 cores per node)
I would be nice to have a parameterized job runner where the tool itself had some control over the number of nodes and ppn it requested, but there could be system specified bounds on the values.
A runner specified like pbs:////-l nodes=${NODES}:ppn=${CORES_PER_NODE},walltime=HH:MM:SS/ where galaxy knew how to fill in NODES and CORES_PER_NODE based on information in the tool configuration and system specified limits would be nice. Then we wouldn't need to define a new job runner for any tool that we don't want using our default single node/core runner.
On Jan 6, 2011, at 5:10 AM, Peter wrote:
Some of the provided Galaxy wrappers have a note in the XML saying the number of threads should be configurable, perhaps via a loc file. I have suggested to the Galaxy developers there should be a general setting for number of threads per tool accessible via the XML, so that this can be configured centrally (maybe I should file an enhancement issue for this):
galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153
galaxy-dev@lists.galaxyproject.org