DRMAA runner weirdness

newer
Can't see history settings, reload...

Kyle Ellrott

9 Jan 2013 9 Jan '13

5:18 a.m.

I'm running a test Galaxy system on a cluster (merged galaxy-dist on Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. I'm running a multithread system, one web server, one job_manager, and three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath. My test involves trying to run three different Tophat2 jobs. The first two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early, probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running. Any ideas? Kyle

Attachments:

attachment.htm (text/html — 1.4 KB)

Show replies by date

Nate Coraor

10 Jan 10 Jan

4:30 p.m.

On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:

...

I'm running a test Galaxy system on a cluster (merged galaxy-dist on Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. I'm running a multithread system, one web server, one job_manager, and three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath.

My test involves trying to run three different Tophat2 jobs. The first two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early, probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running.

Hi Kyle, It sounds like there are bunch of issues here. Do you have any limits set as to the number of concurrent jobs allowed? If not, you may need to add a bit of debugging information to the manager or handler code to figure out why the 'new' job is not being dispatched for execution. For the 'error' job, more information about output collection should be available from the Galaxy server log. If you have general SGE problems this may not be Galaxy's fault. You do need to make sure that the stdout/stderr files are able to be properly copied back to the Galaxy server upon job completion. For the 'running' job, make sure you've got 'set_metadata_externally = True' in your Galaxy config. --nate

...

Any ideas?

Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Kyle Ellrott

11 Jan 11 Jan

12:44 a.m.

I did a merge of galaxy-central that included the patch you posted today. The scheduling problem seems to have gone away. Although I'm still getting back 'Job output not returned from cluster' for errors. This seems odd, as the system previously would output stderr correctly. Kyle On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <nate@bx.psu.edu> wrote:

...

On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:

...
I'm running a test Galaxy system on a cluster (merged galaxy-dist on Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. I'm running a multithread system, one web server, one job_manager, and three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath.

My test involves trying to run three different Tophat2 jobs. The first two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early, probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running.

Hi Kyle,

It sounds like there are bunch of issues here. Do you have any limits set as to the number of concurrent jobs allowed? If not, you may need to add a bit of debugging information to the manager or handler code to figure out why the 'new' job is not being dispatched for execution.

For the 'error' job, more information about output collection should be available from the Galaxy server log. If you have general SGE problems this may not be Galaxy's fault. You do need to make sure that the stdout/stderr files are able to be properly copied back to the Galaxy server upon job completion.

For the 'running' job, make sure you've got 'set_metadata_externally = True' in your Galaxy config.

--nate

...
Any ideas?

Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Liisa Koski

2:32 p.m.

...

I'm running a test Galaxy system on a cluster (merged galaxy-dist on Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. I'm running a multithread system, one web server, one job_manager, and

Hello, Can you please post the link to this patch? I do not see it in the mail thread and I too have noticed some issues with the DRMAA job running since updating to the Oct. 23rd distribution. I don't know if it is related yet but I'd like to try the patch to see. I have two local instances of Galaxy (prod and dev). On my dev instance (which is fully up to date), when I run the same job multiple times, sometimes it finishes and sometimes it dies, this is independent of which node it runs on. My prod instance is still at the Oct. 03 distribution and does not experience this problem. So I am afraid to update our production instance. Thanks in advance, Liisa From: Kyle Ellrott <kellrott@soe.ucsc.edu> To: Nate Coraor <nate@bx.psu.edu> Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Date: 10/01/2013 07:44 PM Subject: Re: [galaxy-dev] DRMAA runner weirdness Sent by: galaxy-dev-bounces@lists.bx.psu.edu I did a merge of galaxy-central that included the patch you posted today. The scheduling problem seems to have gone away. Although I'm still getting back 'Job output not returned from cluster' for errors. This seems odd, as the system previously would output stderr correctly. Kyle On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <nate@bx.psu.edu> wrote: On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote: three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath.

...

My test involves trying to run three different Tophat2 jobs. The first

two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early, probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running. Hi Kyle, It sounds like there are bunch of issues here. Do you have any limits set as to the number of concurrent jobs allowed? If not, you may need to add a bit of debugging information to the manager or handler code to figure out why the 'new' job is not being dispatched for execution. For the 'error' job, more information about output collection should be available from the Galaxy server log. If you have general SGE problems this may not be Galaxy's fault. You do need to make sure that the stdout/stderr files are able to be properly copied back to the Galaxy server upon job completion. For the 'running' job, make sure you've got 'set_metadata_externally = True' in your Galaxy config. --nate

...

Any ideas?

Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

Kyle Ellrott

13 Jan 13 Jan

8:40 p.m.

My theory of about what was happening... As Nate suggested there were multiple issues. First, the patch he posted ( https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5...), was to stop "attempt to stop jobs that have no external ID set". So maybe I was running into some sort of occurrences issue, perhaps by running multiple job runners. That may have been causing the jobs with null commands lines or the ones that never returned, despite not being on the queue (at least those issues disappeared after his patch) Second, something seems to have changed in our cluster setup since the last time I ran Tophat2. SGE's behavior in regards to the 'ulimit -v' setting got a lot more stringent (maybe the admins changed some settings...). So previously I could set 'ulimit -v unlimited' in my profile and just be done with it. But SGE started blocking that so ulimit operations started returning 'cannot modify limit: Operation not permitted'. So when bowtie (under tophat) used up too much memory, the whole job was unceremoniously killed, without even letting the runner script dump the log files to stderr. The work around for this is to use the resource flag to allocate 'v_mem' (ie drmaa://-l v_mem=10G/). Kyle On Fri, Jan 11, 2013 at 6:32 AM, Liisa Koski <liisa.koski@basf.com> wrote:

...

Hello, Can you please post the link to this patch? I do not see it in the mail thread and I too have noticed some issues with the DRMAA job running since updating to the Oct. 23rd distribution. I don't know if it is related yet but I'd like to try the patch to see. I have two local instances of Galaxy (prod and dev). On my dev instance (which is fully up to date), when I run the same job multiple times, sometimes it finishes and sometimes it dies, this is independent of which node it runs on. My prod instance is still at the Oct. 03 distribution and does not experience this problem. So I am afraid to update our production instance.

Thanks in advance, Liisa

From: Kyle Ellrott <kellrott@soe.ucsc.edu> To: Nate Coraor <nate@bx.psu.edu> Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Date: 10/01/2013 07:44 PM Subject: Re: [galaxy-dev] DRMAA runner weirdness Sent by: galaxy-dev-bounces@lists.bx.psu.edu ------------------------------

I did a merge of galaxy-central that included the patch you posted today. The scheduling problem seems to have gone away. Although I'm still getting back 'Job output not returned from cluster' for errors. This seems odd, as the system previously would output stderr correctly.

Kyle

On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <*nate@bx.psu.edu*<nate@bx.psu.edu>> wrote: On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:

...
I'm running a test Galaxy system on a cluster (merged galaxy-dist on Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. I'm running a multithread system, one web server, one job_manager, and three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath.

My test involves trying to run three different Tophat2 jobs. The first two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early, probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running.

Hi Kyle,

It sounds like there are bunch of issues here. Do you have any limits set as to the number of concurrent jobs allowed? If not, you may need to add a bit of debugging information to the manager or handler code to figure out why the 'new' job is not being dispatched for execution.

For the 'error' job, more information about output collection should be available from the Galaxy server log. If you have general SGE problems this may not be Galaxy's fault. You do need to make sure that the stdout/stderr files are able to be properly copied back to the Galaxy server upon job completion.

For the 'running' job, make sure you've got 'set_metadata_externally = True' in your Galaxy config.

--nate

...
Any ideas?

Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

*http://lists.bx.psu.edu/* <http://lists.bx.psu.edu/>

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Nate Coraor

14 Jan 14 Jan

3:45 p.m.

On Jan 11, 2013, at 9:32 AM, Liisa Koski wrote:

...

Hello, Can you please post the link to this patch? I do not see it in the mail thread and I too have noticed some issues with the DRMAA job running since updating to the Oct. 23rd distribution. I don't know if it is related yet but I'd like to try the patch to see. I have two local instances of Galaxy (prod and dev). On my dev instance (which is fully up to date), when I run the same job multiple times, sometimes it finishes and sometimes it dies, this is independent of which node it runs on. My prod instance is still at the Oct. 03 distribution and does not experience this problem. So I am afraid to update our production instance.

Thanks in advance, Liisa

Hi Liisa, Here's the one that Kyle is referring to: https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5... However, this patch should only fix the problem of the server segfaulting when deleting certain jobs (ones that have not yet been dispatched to the cluster). --nate

...

From: Kyle Ellrott <kellrott@soe.ucsc.edu> To: Nate Coraor <nate@bx.psu.edu> Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Date: 10/01/2013 07:44 PM Subject: Re: [galaxy-dev] DRMAA runner weirdness Sent by: galaxy-dev-bounces@lists.bx.psu.edu

I did a merge of galaxy-central that included the patch you posted today. The scheduling problem seems to have gone away. Although I'm still getting back 'Job output not returned from cluster' for errors. This seems odd, as the system previously would output stderr correctly.

Kyle

On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <nate@bx.psu.edu> wrote: On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:

...
I'm running a test Galaxy system on a cluster (merged galaxy-dist on Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. I'm running a multithread system, one web server, one job_manager, and three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath.

My test involves trying to run three different Tophat2 jobs. The first two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early, probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running.

Hi Kyle,

It sounds like there are bunch of issues here. Do you have any limits set as to the number of concurrent jobs allowed? If not, you may need to add a bit of debugging information to the manager or handler code to figure out why the 'new' job is not being dispatched for execution.

For the 'error' job, more information about output collection should be available from the Galaxy server log. If you have general SGE problems this may not be Galaxy's fault. You do need to make sure that the stdout/stderr files are able to be properly copied back to the Galaxy server upon job completion.

For the 'running' job, make sure you've got 'set_metadata_externally = True' in your Galaxy config.

--nate

...
Any ideas?

Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Liisa Koski

15 Jan 15 Jan

4:03 p.m.

...

Hello, Can you please post the link to this patch? I do not see it in the mail

In our case someone had installed and started a second development instance of galaxy but used the same database as the first development instance. So the ids were mixed up and and causing some jobs to crash. Yuck! Thanks, Liisa From: Nate Coraor <nate@bx.psu.edu> To: Liisa Koski <liisa.koski@basf.com> Cc: kellrott@soe.ucsc.edu, "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu>, galaxy-dev-bounces@lists.bx.psu.edu Date: 14/01/2013 10:48 AM Subject: Re: [galaxy-dev] DRMAA runner weirdness On Jan 11, 2013, at 9:32 AM, Liisa Koski wrote: thread and I too have noticed some issues with the DRMAA job running since updating to the Oct. 23rd distribution. I don't know if it is related yet but I'd like to try the patch to see. I have two local instances of Galaxy (prod and dev). On my dev instance (which is fully up to date), when I run the same job multiple times, sometimes it finishes and sometimes it dies, this is independent of which node it runs on. My prod instance is still at the Oct. 03 distribution and does not experience this problem. So I am afraid to update our production instance.

...

Thanks in advance, Liisa

...

From: Kyle Ellrott <kellrott@soe.ucsc.edu> To: Nate Coraor <nate@bx.psu.edu> Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Date: 10/01/2013 07:44 PM Subject: Re: [galaxy-dev] DRMAA runner weirdness Sent by: galaxy-dev-bounces@lists.bx.psu.edu

I did a merge of galaxy-central that included the patch you posted

today. The scheduling problem seems to have gone away. Although I'm still getting back 'Job output not returned from cluster' for errors. This seems odd, as the system previously would output stderr correctly.

...

Kyle

On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <nate@bx.psu.edu> wrote: On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:

...
I'm running a test Galaxy system on a cluster (merged galaxy-dist on

...

...
I'm running a multithread system, one web server, one job_manager, and

...

...
My test involves trying to run three different Tophat2 jobs. The first

two seem to start up (and get put on the SGE queue), but the third stays grey, with the job manager listing it in state 'new' with command line 'None'. It doesn't seem to leave this state. Both of the jobs that actually got onto the queue die (reasons unknown, but much to early,

Janurary 4th). And I've noticed some odd behavior from the DRMAA job runner. three job_handlers. DRMAA is the default job runner (the command for tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being the engine underneath. probably some tophat/bowtie problem), but one job is listed in error state with stderr as 'Job output not returned from cluster', while the other job (which is no longer in the SGE queue) is still listed as running.

...

Hi Kyle,

It sounds like there are bunch of issues here. Do you have any limits

set as to the number of concurrent jobs allowed? If not, you may need to add a bit of debugging information to the manager or handler code to figure out why the 'new' job is not being dispatched for execution.

...

For the 'error' job, more information about output collection should be

available from the Galaxy server log. If you have general SGE problems this may not be Galaxy's fault. You do need to make sure that the stdout/stderr files are able to be properly copied back to the Galaxy server upon job completion.

...

For the 'running' job, make sure you've got 'set_metadata_externally =

True' in your Galaxy config.

...

--nate

...
Any ideas?

Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

4663

Age (days ago)

4669

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Kyle Ellrott
Liisa Koski
Nate Coraor