DRMAA job will now be errored -> Segmentation fault

Peter Cock

15 Nov 2012 15 Nov '12

3:36 p.m.

Hi all, Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: Starting server in PID 12180. serving on http://127.0.0.1:8081 galaxy.jobs.runners.drmaa ERROR 2012-11-15 09:56:28,192 (320/None) Unable to check job status Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 296, in check_watched_items state = self.ds.jobStatus( job_id ) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InvalidArgumentException: code 4: Job id, "None", is not a valid job id galaxy.jobs.runners.drmaa WARNING 2012-11-15 09:56:28,193 (320/None) job will now be errored ./run.sh: line 86: 12180 Segmentation fault (core dumped) python ./scripts/paster.py serve universe_wsgi.ini $@ I restarted and it happened again, third time lucky. I presume this was one segmentation fault for each orphaned/zombie job (since I'd tried two cluster jobs which got stuck). I was running with revision 340438c62171, https://bitbucket.org/galaxy/galaxy-central/changeset/340438c62171578078323d... as merged into my tools branch, https://bitbucket.org/peterjc/galaxy-central/changeset/d49200df0707579f41fc4... Any thoughts? Thanks, Peter

Show replies by date

Peter Cock

15 Nov 15 Nov

3:42 p.m.

On Thu, Nov 15, 2012 at 10:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Hi all,

Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: ... (segmentation fault)

Looking into the problem with submitting the jobs, there seems to be a problem with task splitting somehow recursing - the same file is split four times, the filename getting longer and longer: galaxy.jobs DEBUG 2012-11-15 10:08:33,510 (321) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/321 galaxy.jobs.handler DEBUG 2012-11-15 10:08:33,510 dispatching job 321 to tasks runner galaxy.jobs.handler INFO 2012-11-15 10:08:33,714 (321) Job dispatched galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:34,457 Split /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into batches of 1000 records... galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:34,457 Attemping to split FASTA file /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into chunks of 1000 sequences galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:34,458 Writing /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat part to /mnt/galaxy/galaxy-central/database/job_working_directory/000/321/task_0/dataset_344.dat galaxy.jobs.splitters.multi DEBUG 2012-11-15 10:08:34,458 do_split created 1 parts galaxy.jobs DEBUG 2012-11-15 10:08:34,558 (321) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/321 galaxy.jobs.handler DEBUG 2012-11-15 10:08:34,558 dispatching task 823, of job 321, to tasks runner 127.0.0.1 - - [15/Nov/2012:10:08:35 +0100] "POST /root/history_item_updates HTTP/1.1" 200 - "http://127.0.0.1:8081/history" "Mozilla/5.0 (X11; Linux x86_64; rv:10.0.8) Gecko/20121012 Firefox/10.0.8" galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:35,458 Split /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into batches of 1000 records... galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:35,459 Attemping to split FASTA file /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into chunks of 1000 sequences galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:35,459 Writing /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat part to /mnt/galaxy/galaxy-central/database/job_working_directory/000/321/task_0/task_0/dataset_344.dat galaxy.jobs.splitters.multi DEBUG 2012-11-15 10:08:35,459 do_split created 1 parts galaxy.jobs DEBUG 2012-11-15 10:08:35,541 (321) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/321 galaxy.jobs.handler DEBUG 2012-11-15 10:08:35,542 dispatching task 824, of job 321, to tasks runner galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:36,171 Split /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into batches of 1000 records... galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:36,171 Attemping to split FASTA file /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into chunks of 1000 sequences galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:36,171 Writing /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat part to /mnt/galaxy/galaxy-central/database/job_working_directory/000/321/task_0/task_0/task_0/dataset_344.dat galaxy.jobs.splitters.multi DEBUG 2012-11-15 10:08:36,172 do_split created 1 parts galaxy.jobs DEBUG 2012-11-15 10:08:36,232 (321) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/321 galaxy.jobs.handler DEBUG 2012-11-15 10:08:36,232 dispatching task 825, of job 321, to tasks runner galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:36,843 Split /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into batches of 1000 records... galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:36,843 Attemping to split FASTA file /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat into chunks of 1000 sequences galaxy.datatypes.sequence DEBUG 2012-11-15 10:08:36,843 Writing /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat part to /mnt/galaxy/galaxy-central/database/job_working_directory/000/321/task_0/task_0/task_0/task_0/dataset_344.dat galaxy.jobs.splitters.multi DEBUG 2012-11-15 10:08:36,844 do_split created 1 parts galaxy.jobs DEBUG 2012-11-15 10:08:36,906 (321) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/321 galaxy.jobs.handler DEBUG 2012-11-15 10:08:36,906 dispatching task 826, of job 321, to tasks runner Hmm. Peter

Peter Cock

4:51 p.m.

On Thu, Nov 15, 2012 at 10:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Thu, Nov 15, 2012 at 10:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi all,

Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: ... (segmentation fault)

Looking into the problem with submitting the jobs, there seems to be a problem with task splitting somehow recursing - the same file is split four times, the filename getting longer and longer:

Turning off task splitting I could run the same job OK on SGE. So, the good news is the problems seem to be specific to the task splitting code. Also I have reproduced the segmentation fault when restarting Galaxy (after stopping Galaxy with one of these broken jobs). Starting server in PID 17996. serving on http://127.0.0.1:8081 galaxy.jobs.runners.drmaa ERROR 2012-11-15 11:07:27,762 (327/None) Unable to check job status Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 296, in check_watched_items state = self.ds.jobStatus( job_id ) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InvalidArgumentException: code 4: Job id, "None", is not a valid job id galaxy.jobs.runners.drmaa WARNING 2012-11-15 11:07:27,764 (327/None) job will now be errored ./run.sh: line 86: 17996 Segmentation fault (core dumped) python ./scripts/paster.py serve universe_wsgi.ini $@ The problem is the job_id variable is "None" (note this is a string, not the Python special object None) in check_watched_items(). Peter

Peter Cock

20 Nov 20 Nov

6:45 p.m.

On Thu, Nov 15, 2012 at 11:21 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Thu, Nov 15, 2012 at 10:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Thu, Nov 15, 2012 at 10:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi all,

Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: ... (segmentation fault)

Looking into the problem with submitting the jobs, there seems to be a problem with task splitting somehow recursing - the same file is split four times, the filename getting longer and longer:

Turning off task splitting I could run the same job OK on SGE.

So, the good news is the problems seem to be specific to the task splitting code. Also I have reproduced the segmentation fault when restarting Galaxy (after stopping Galaxy with one of these broken jobs).

Starting server in PID 17996. serving on http://127.0.0.1:8081 galaxy.jobs.runners.drmaa ERROR 2012-11-15 11:07:27,762 (327/None) Unable to check job status Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 296, in check_watched_items state = self.ds.jobStatus( job_id ) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InvalidArgumentException: code 4: Job id, "None", is not a valid job id galaxy.jobs.runners.drmaa WARNING 2012-11-15 11:07:27,764 (327/None) job will now be errored ./run.sh: line 86: 17996 Segmentation fault (core dumped) python ./scripts/paster.py serve universe_wsgi.ini $@

The problem is the job_id variable is "None" (note this is a string, not the Python special object None) in check_watched_items().

Peter

Is anyone else seeing this? I am wary of applying the update to our production Galaxy until I know how to resolve this (other than just be disabling task splitting). Thanks, Peter

Nate Coraor

27 Nov 27 Nov

10:20 p.m.

On Nov 20, 2012, at 8:15 AM, Peter Cock wrote:

...

On Thu, Nov 15, 2012 at 11:21 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Thu, Nov 15, 2012 at 10:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Thu, Nov 15, 2012 at 10:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi all,

Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: ... (segmentation fault)

Looking into the problem with submitting the jobs, there seems to be a problem with task splitting somehow recursing - the same file is split four times, the filename getting longer and longer:

Turning off task splitting I could run the same job OK on SGE.

So, the good news is the problems seem to be specific to the task splitting code. Also I have reproduced the segmentation fault when restarting Galaxy (after stopping Galaxy with one of these broken jobs).

Starting server in PID 17996. serving on http://127.0.0.1:8081 galaxy.jobs.runners.drmaa ERROR 2012-11-15 11:07:27,762 (327/None) Unable to check job status Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 296, in check_watched_items state = self.ds.jobStatus( job_id ) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InvalidArgumentException: code 4: Job id, "None", is not a valid job id galaxy.jobs.runners.drmaa WARNING 2012-11-15 11:07:27,764 (327/None) job will now be errored ./run.sh: line 86: 17996 Segmentation fault (core dumped) python ./scripts/paster.py serve universe_wsgi.ini $@

The problem is the job_id variable is "None" (note this is a string, not the Python special object None) in check_watched_items().

Peter

Is anyone else seeing this? I am wary of applying the update to our production Galaxy until I know how to resolve this (other than just be disabling task splitting).

Hi Peter, These look like two issues - in one, you've got task(s) in the database that do not have an external runner ID set, causing the drmaa runner to attempt to check the status of "None", resulting in the segfault. If you update the state of these tasks to something terminal, that should fix the issue with them. Of course, if the same things happens with new jobs, then there's another issue. I'm trying to reproduce the working directory behavior but have been unsuccessful. Do you have any local modifications to the splitting or jobs code? --nate

...

Thanks,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Peter Cock

10:33 p.m.

On Tue, Nov 27, 2012 at 4:50 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...

On Nov 20, 2012, at 8:15 AM, Peter Cock wrote:

...
Is anyone else seeing this? I am wary of applying the update to our production Galaxy until I know how to resolve this (other than just be disabling task splitting).

Hi Peter,

These look like two issues - in one, you've got task(s) in the database that do not have an external runner ID set, causing the drmaa runner to attempt to check the status of "None", resulting in the segfault.

So a little defensive coding could prevent the segfault then (leaving the separate issue of why the jobs lack this information)?

...

If you update the state of these tasks to something terminal, that should fix the issue with them.

You mean manually in the database? Restarting Galaxy seemed to achieve that in a round-about way.

...

Of course, if the same things happens with new jobs, then there's another issue.

This was a week ago, but yes, at the time it was reproducible with new jobs.

...

I'm trying to reproduce the working directory behavior but have been unsuccessful. Do you have any local modifications to the splitting or jobs code?

This was running on my tools branch, which shouldn't be changing Galaxy itself in any meaningful way (a few local variables did get accidentally checked into my run.sh file etc but otherwise I only try to modify new files specific to my individual tool wrappers): https://bitbucket.org/peterjc/galaxy-central/src/tools [galaxy@ppserver galaxy-central]$ hg branch tools [galaxy@ppserver galaxy-central]$ hg log -b tools | head -n 8 changeset: 8807:d49200df0707 branch: tools tag: tip parent: 8712:959ee7c79fd2 parent: 8806:340438c62171 user: peterjc <p.j.a.cock@googlemail.com> date: Thu Nov 15 09:38:57 2012 +0000 summary: Merged default into my tools branch The only deliberate change was to try and debug this, [galaxy@ppserver galaxy-central]$ hg diff diff -r d49200df0707 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Thu Nov 15 09:38:57 2012 +0000 +++ b/lib/galaxy/jobs/runners/drmaa.py Tue Nov 27 17:00:04 2012 +0000 @@ -291,8 +291,15 @@ for drm_job_state in self.watched: job_id = drm_job_state.job_id galaxy_job_id = drm_job_state.job_wrapper.job_id + if job_id is None or job_id=="None": + log.exception("(%s/%r) Unable to check job status none" % ( galaxy_job_id, job_id ) ) + #drm_job_state.fail_message = "Cluster could not complete job (job_id None)" + #Ignore it? + #self.work_queue.put( ( 'fail', drm_job_state ) ) + continue old_state = drm_job_state.old_state try: + assert job_id is not None and job_id != "None" state = self.ds.jobStatus( job_id ) # InternalException was reported to be necessary on some DRMs, but # this could cause failures to be detected as completion! Please I'm about to go home for the day but should be able to look into this tomorrow, e.g. update to the latest default branch. Thanks, Peter

Nate Coraor

10:49 p.m.

On Nov 27, 2012, at 12:03 PM, Peter Cock wrote:

...

On Tue, Nov 27, 2012 at 4:50 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
On Nov 20, 2012, at 8:15 AM, Peter Cock wrote:

...
Is anyone else seeing this? I am wary of applying the update to our production Galaxy until I know how to resolve this (other than just be disabling task splitting).

Hi Peter,

These look like two issues - in one, you've got task(s) in the database that do not have an external runner ID set, causing the drmaa runner to attempt to check the status of "None", resulting in the segfault.

So a little defensive coding could prevent the segfault then (leaving the separate issue of why the jobs lack this information)?

Indeed, I pushed a check for this in 4a95ae9a26d9.

...

...
If you update the state of these tasks to something terminal, that should fix the issue with them.

You mean manually in the database? Restarting Galaxy seemed to achieve that in a round-about way.

...
Of course, if the same things happens with new jobs, then there's another issue.

This was a week ago, but yes, at the time it was reproducible with new jobs.

Is that to say it's still happening and you've simply worked around it (by disabling tasks), or that it is no longer happening?

...

...
I'm trying to reproduce the working directory behavior but have been unsuccessful. Do you have any local modifications to the splitting or jobs code?

This was running on my tools branch, which shouldn't be changing Galaxy itself in any meaningful way (a few local variables did get accidentally checked into my run.sh file etc but otherwise I only try to modify new files specific to my individual tool wrappers):

https://bitbucket.org/peterjc/galaxy-central/src/tools

[galaxy@ppserver galaxy-central]$ hg branch tools

[galaxy@ppserver galaxy-central]$ hg log -b tools | head -n 8 changeset: 8807:d49200df0707 branch: tools tag: tip parent: 8712:959ee7c79fd2 parent: 8806:340438c62171 user: peterjc <p.j.a.cock@googlemail.com> date: Thu Nov 15 09:38:57 2012 +0000 summary: Merged default into my tools branch

The only deliberate change was to try and debug this,

[galaxy@ppserver galaxy-central]$ hg diff diff -r d49200df0707 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Thu Nov 15 09:38:57 2012 +0000 +++ b/lib/galaxy/jobs/runners/drmaa.py Tue Nov 27 17:00:04 2012 +0000 @@ -291,8 +291,15 @@ for drm_job_state in self.watched: job_id = drm_job_state.job_id galaxy_job_id = drm_job_state.job_wrapper.job_id + if job_id is None or job_id=="None": + log.exception("(%s/%r) Unable to check job status none" % ( galaxy_job_id, job_id ) ) + #drm_job_state.fail_message = "Cluster could not complete job (job_id None)" + #Ignore it? + #self.work_queue.put( ( 'fail', drm_job_state ) ) + continue old_state = drm_job_state.old_state try: + assert job_id is not None and job_id != "None" state = self.ds.jobStatus( job_id ) # InternalException was reported to be necessary on some DRMs, but # this could cause failures to be detected as completion! Please

I'm about to go home for the day but should be able to look into this tomorrow, e.g. update to the latest default branch.

Great, thanks. --nate

...

Thanks,

Peter

Peter Cock

28 Nov 28 Nov

3:36 a.m.

On Tue, Nov 27, 2012 at 5:19 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...

...
So a little defensive coding could prevent the segfault then (leaving the separate issue of why the jobs lack this information)?

Indeed, I pushed a check for this in 4a95ae9a26d9.

Great. That will help.

...

...
This was a week ago, but yes, at the time it was reproducible with new jobs.

Is that to say it's still happening and you've simply worked around it (by disabling tasks), or that it is no longer happening?

I've not tried it for a week - it was my development install that tracks galaxy-central which showed the problem, so I avoided updated our production install until resolving it. I'll see how it behaves later this week (although there is a mass job hogging the cluster queue which may complicate matters and reduce the turn around). Thanks, Peter

4745

Age (days ago)

4757

Last active (days ago)

List overview

Download

7 comments

2 participants

participants (2)

Nate Coraor
Peter Cock

DRMAA job will now be errored -> Segmentation fault

tags

participants (2)