Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

older
Changing temporary directory for...

David Kovalic

3 May 2016 3 May '16

5:58 p.m.

Hello, We would like to split fasta query files and run multiple concurrent jobs to minimize our processing wall clock time for large jobs. After chatting with folks at GCC 2015 I understand this is possible, my problem is I cant find instructions on hos to configure CloudMan/ncbi_blast_plus to do this. For those of you who know me it probably goes without saying that I can't figure it out myself ;) Peter/Enis/others, can you help us out with this question? Thanks, David

Attachments:

attachment.htm (text/html — 641 bytes)

Show replies by date

Peter Cock

3 May 3 May

7:13 p.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

Hi David, The NCBI BLAST+ wrappers have a <parallelism> tag setup, which becomes active if you have use_tasked_jobs = True in your config/galaxy.ini file (aka universe_wsgi.ini). Specifically, the wrappers use this:  <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" merge_outputs="output1" /> This is hard coded to break up the query FASTA file into batches of 1000 sequences (e.g. a transcriptome of 20k genes becomes 20 jobs), which has worked nicely on our cluster. Separately, each job uses -num_threads "\${GALAXY_SLOTS:-8}" in the command line string, i.e. uses the $GALAXY_SLOTS environment variable (set via the Galaxy job configuration), or if not set, defaults to using 8 threads. I've essentially rephrased the README file here - did you see that, or does it need more information added? Thanks, Peter On Tue, May 3, 2016 at 6:58 PM, David Kovalic <kovalic@analome.com> wrote:

...

Hello,

We would like to split fasta query files and run multiple concurrent jobs to minimize our processing wall clock time for large jobs.

After chatting with folks at GCC 2015 I understand this is possible, my problem is I cant find instructions on hos to configure CloudMan/ncbi_blast_plus to do this. For those of you who know me it probably goes without saying that I can't figure it out myself ;)

Peter/Enis/others, can you help us out with this question?

Thanks,

David

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

David Kovalic

7:19 p.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

Peter, Thanks, I didn't see that, I was reading the paper and searching online. Appreciate the help, we'll give it a go! David On Tue, May 3, 2016 at 2:16 PM Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Hi David,

The NCBI BLAST+ wrappers have a <parallelism> tag setup, which becomes active if you have use_tasked_jobs = True in your config/galaxy.ini file (aka universe_wsgi.ini).

Specifically, the wrappers use this:

 <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" merge_outputs="output1" />

This is hard coded to break up the query FASTA file into batches of 1000 sequences (e.g. a transcriptome of 20k genes becomes 20 jobs), which has worked nicely on our cluster.

Separately, each job uses -num_threads "\${GALAXY_SLOTS:-8}" in the command line string, i.e. uses the $GALAXY_SLOTS environment variable (set via the Galaxy job configuration), or if not set, defaults to using 8 threads.

I've essentially rephrased the README file here - did you see that, or does it need more information added?

Thanks,

Peter

On Tue, May 3, 2016 at 6:58 PM, David Kovalic <kovalic@analome.com> wrote:

...
Hello,

We would like to split fasta query files and run multiple concurrent jobs to minimize our processing wall clock time for large jobs.

After chatting with folks at GCC 2015 I understand this is possible, my problem is I cant find instructions on hos to configure CloudMan/ncbi_blast_plus to do this. For those of you who know me it probably goes without saying that I can't figure it out myself ;)

Peter/Enis/others, can you help us out with this question?

Thanks,

David

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

David Kovalic

8:23 p.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

Peter, We made the modification to the config file, restarted galaxy and things seem to be working from the galaxy end. We see sub-job directories being created in /mnt/galaxy/tmp/job_working_directory. We think all of the required job chunks have been created (i.e. total sequences/1000 sub-job directories now with no more being created now) Now we have what may be a CloudMan question: our working cluster has a head node and 4 workers. The head node is loaded up but the workers are idle. I would have thought jobs should be pushing out to the workers but we don't see any load on these machines. Any advice? Thanks. David PS. what is the path of the file which contains the split_size="1000" configuration? On Tue, May 3, 2016 at 2:19 PM David Kovalic <kovalic@analome.com> wrote:

...

Peter,

Thanks, I didn't see that, I was reading the paper and searching online.

Appreciate the help, we'll give it a go!

David

On Tue, May 3, 2016 at 2:16 PM Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi David,

The NCBI BLAST+ wrappers have a <parallelism> tag setup, which becomes active if you have use_tasked_jobs = True in your config/galaxy.ini file (aka universe_wsgi.ini).

Specifically, the wrappers use this:

 <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" merge_outputs="output1" />

This is hard coded to break up the query FASTA file into batches of 1000 sequences (e.g. a transcriptome of 20k genes becomes 20 jobs), which has worked nicely on our cluster.

Separately, each job uses -num_threads "\${GALAXY_SLOTS:-8}" in the command line string, i.e. uses the $GALAXY_SLOTS environment variable (set via the Galaxy job configuration), or if not set, defaults to using 8 threads.

I've essentially rephrased the README file here - did you see that, or does it need more information added?

Thanks,

Peter

On Tue, May 3, 2016 at 6:58 PM, David Kovalic <kovalic@analome.com> wrote:

...
Hello,

We would like to split fasta query files and run multiple concurrent jobs to minimize our processing wall clock time for large jobs.

After chatting with folks at GCC 2015 I understand this is possible, my problem is I cant find instructions on hos to configure CloudMan/ncbi_blast_plus to do this. For those of you who know me it probably goes without saying that I can't figure it out myself ;)

Peter/Enis/others, can you help us out with this question?

Thanks,

David

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

10:08 p.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

On Tue, May 3, 2016 at 9:23 PM, David Kovalic <kovalic@analome.com> wrote:

...

Peter,

We made the modification to the config file, restarted galaxy and things seem to be working from the galaxy end. We see sub-job directories being created in /mnt/galaxy/tmp/job_working_directory. We think all of the required job chunks have been created (i.e. total sequences/1000 sub-job directories now with no more being created now)

Now we have what may be a CloudMan question: our working cluster has a head node and 4 workers. The head node is loaded up but the workers are idle. I would have thought jobs should be pushing out to the workers but we don't see any load on these machines.

Any advice? Thanks.

Wait a bit longer? The downside of the job splitting is the extra disk I/O overhead of splitting the files (here FASTA inputs) and then merging the output (e.g. BLAST tabular, XML, etc). IIRC, this happens on the head node only. I've not used CloudMan so I have no specific advice here, other than ask did you confirm that jobs were getting sent to the worker nodes before turning on use_tasked_jobs = True in your config/galaxy.ini file?

...

David

PS. what is the path of the file which contains the split_size="1000" configuration?

This is currently defined by the tool wrapper author in the tool wrapper XML file. Setting up something which will work well on a broad range of input file sizes is a bit of an art - simply always dividing the input into 8 chunks does not scale well. With BLAST+ I found chunks of 1000 queries was a good balance, while for other tools processing FASTA inputs I used chunks of 2000 queries. I'll link to the latest files on GitHub, but you can browse this on the Galaxy Tool Shed too - it also ought to show the README text quite prominently: https://github.com/peterjc/galaxy_blast/tree/master/tools/ncbi_blast_plus In a simple tool, you would see the <parallelism> tag directly in the wrapper XML file, usually near the top by convention. e.g. https://github.com/peterjc/pico_galaxy/blob/master/tools/protein_analysis/pr... https://github.com/peterjc/pico_galaxy/blob/master/tools/protein_analysis/tm... However, with the BLAST+ wrappers we use macros. So, using BLASTX as an example, the wrapper is this XML file: https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/nc... No sign of the <parallelism> tag directly, but it is pulled in from: https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/nc... This happens via: ... <macros> ... <import>ncbi_macros.xml</import> </macros> <expand macro="parallelism" /> ... This is a bit more complex, but means avoiding repeating the XML snippet in almost all the BLAST+ wrapper files. Peter

David Kovalic

10:24 p.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

Peter, Thanks for the great information. I see where to tune the "split_size" variable and also the READMEs :) I'll do so more sleuthing, let the job run and observe. So far it is ~3hr after job launch and still no load on the workers. I think from looking at the /mnt/galaxy/tmp/job_working_directory all of the sub-job directories are prepared. Maybe an issue with the job scheduler/dispatch. Thanks again. David On Tue, May 3, 2016 at 5:11 PM Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

...
Peter,

We made the modification to the config file, restarted galaxy and things seem to be working from the galaxy end. We see sub-job directories being created in /mnt/galaxy/tmp/job_working_directory. We think all of the required job chunks have been created (i.e. total sequences/1000 sub-job directories now with no more being created now)

Now we have what may be a CloudMan question: our working cluster has a

On Tue, May 3, 2016 at 9:23 PM, David Kovalic <kovalic@analome.com> wrote: head

...
node and 4 workers. The head node is loaded up but the workers are idle. I would have thought jobs should be pushing out to the workers but we don't see any load on these machines.

Any advice? Thanks.

Wait a bit longer? The downside of the job splitting is the extra disk I/O overhead of splitting the files (here FASTA inputs) and then merging the output (e.g. BLAST tabular, XML, etc). IIRC, this happens on the head node only.

I've not used CloudMan so I have no specific advice here, other than ask did you confirm that jobs were getting sent to the worker nodes before turning on use_tasked_jobs = True in your config/galaxy.ini file?

...
David

PS. what is the path of the file which contains the split_size="1000" configuration?

This is currently defined by the tool wrapper author in the tool wrapper XML file. Setting up something which will work well on a broad range of input file sizes is a bit of an art - simply always dividing the input into 8 chunks does not scale well. With BLAST+ I found chunks of 1000 queries was a good balance, while for other tools processing FASTA inputs I used chunks of 2000 queries.

I'll link to the latest files on GitHub, but you can browse this on the Galaxy Tool Shed too - it also ought to show the README text quite prominently:

https://github.com/peterjc/galaxy_blast/tree/master/tools/ncbi_blast_plus

In a simple tool, you would see the <parallelism> tag directly in the wrapper XML file, usually near the top by convention. e.g.

https://github.com/peterjc/pico_galaxy/blob/master/tools/protein_analysis/pr...

https://github.com/peterjc/pico_galaxy/blob/master/tools/protein_analysis/tm...

However, with the BLAST+ wrappers we use macros. So, using BLASTX as an example, the wrapper is this XML file:

https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/nc...

No sign of the <parallelism> tag directly, but it is pulled in from:

https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/nc...

This happens via:

... <macros> ... <import>ncbi_macros.xml</import> </macros> <expand macro="parallelism" /> ...

This is a bit more complex, but means avoiding repeating the XML snippet in almost all the BLAST+ wrapper files.

Peter

Enis Afgan

5 May 5 May

1:06 a.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

...

Now we have what may be a CloudMan question: our working cluster has a head node and 4 workers. The head node is loaded up but the workers are idle. I would have thought jobs should be pushing out to the workers but we don't see any load on these machines.

So are the jobs queued just not being scheduled on the workers? If that's

the case, it is definitely not the expected behavior. What about submitting an additional job from Galaxy or the command line - where does it go? I've never tried the splitting thing but just wonder if there's some affinity assigned that makes all jobs run on the same node, although it seems that wold defeat the whole purpose of splitting...

David Kovalic

4:05 p.m.

New subject: Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

Enis, FYI, Unfortunately I am traveling out of the country today and further investigation will need to wait until I return. I will set a note to follow-up on this when I get back and let you know more details so we can see just what is happening in more detail. Expect to hear back from me in a couple of weeks. Thanks for the help, David On Wed, May 4, 2016 at 8:10 PM Enis Afgan <enis.afgan@irb.hr> wrote:

...

...
Now we have what may be a CloudMan question: our working cluster has a head node and 4 workers. The head node is loaded up but the workers are idle. I would have thought jobs should be pushing out to the workers but we don't see any load on these machines.

So are the jobs queued just not being scheduled on the workers? If that's the case, it is definitely not the expected behavior. What about submitting an additional job from Galaxy or the command line - where does it go? I've never tried the splitting thing but just wonder if there's some affinity assigned that makes all jobs run on the same node, although it seems that wold defeat the whole purpose of splitting...

3500

Age (days ago)

3502

Last active (days ago)

List overview

Download

7 comments

3 participants

participants (3)

David Kovalic
Enis Afgan
Peter Cock

Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

tags

participants (3)