On Thu, Nov 15, 2012 at 11:21 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Nov 15, 2012 at 10:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Nov 15, 2012 at 10:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi all,
Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: ... (segmentation fault)
Looking into the problem with submitting the jobs, there seems to be a problem with task splitting somehow recursing - the same file is split four times, the filename getting longer and longer:
Turning off task splitting I could run the same job OK on SGE.
So, the good news is the problems seem to be specific to the task splitting code. Also I have reproduced the segmentation fault when restarting Galaxy (after stopping Galaxy with one of these broken jobs).
Starting server in PID 17996. serving on http://127.0.0.1:8081 galaxy.jobs.runners.drmaa ERROR 2012-11-15 11:07:27,762 (327/None) Unable to check job status Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 296, in check_watched_items state = self.ds.jobStatus( job_id ) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InvalidArgumentException: code 4: Job id, "None", is not a valid job id galaxy.jobs.runners.drmaa WARNING 2012-11-15 11:07:27,764 (327/None) job will now be errored ./run.sh: line 86: 17996 Segmentation fault (core dumped) python ./scripts/paster.py serve universe_wsgi.ini $@
The problem is the job_id variable is "None" (note this is a string, not the Python special object None) in check_watched_items().
Peter
Is anyone else seeing this? I am wary of applying the update to our production Galaxy until I know how to resolve this (other than just be disabling task splitting). Thanks, Peter