[galaxy-dev] DRMAA job will now be errored -> Segmentation fault

15 Nov 2012

      Hi all,

Something has changed in the job handling, and in a bad way. On my
development machine submitting jobs to the cluster didn't seem to be
working anymore (never sent to SGE). I killed Galaxy and restarted:

Starting server in PID 12180.
serving on http://127.0.0.1:8081
galaxy.jobs.runners.drmaa ERROR 2012-11-15 09:56:28,192 (320/None)
Unable to check job status
Traceback (most recent call last):
  File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py",
line 296, in check_watched_items
    state = self.ds.jobStatus( job_id )
  File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py",
line 522, in jobStatus
    _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
  File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py",
line 213, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py",
line 90, in error_check
    raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
InvalidArgumentException: code 4: Job id, "None", is not a valid job id
galaxy.jobs.runners.drmaa WARNING 2012-11-15 09:56:28,193 (320/None)
job will now be errored
./run.sh: line 86: 12180 Segmentation fault      (core dumped) python
./scripts/paster.py serve universe_wsgi.ini $@

I restarted and it happened again, third time lucky. I presume this was
one segmentation fault for each orphaned/zombie job (since I'd tried
two cluster jobs which got stuck).

I was running with revision 340438c62171,
https://bitbucket.org/galaxy/galaxy-central/changeset/340438c62171578078323d...
as merged into my tools branch,
https://bitbucket.org/peterjc/galaxy-central/changeset/d49200df0707579f41fc4...

Any thoughts?

Thanks,

Peter

[galaxy-dev] DRMAA job will now be errored -> Segmentation fault

Peter Cock