On Nov 20, 2012, at 8:15 AM, Peter Cock wrote:
On Thu, Nov 15, 2012 at 11:21 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Nov 15, 2012 at 10:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Nov 15, 2012 at 10:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi all,
Something has changed in the job handling, and in a bad way. On my development machine submitting jobs to the cluster didn't seem to be working anymore (never sent to SGE). I killed Galaxy and restarted: ... (segmentation fault)
Looking into the problem with submitting the jobs, there seems to be a problem with task splitting somehow recursing - the same file is split four times, the filename getting longer and longer:
Turning off task splitting I could run the same job OK on SGE.
So, the good news is the problems seem to be specific to the task splitting code. Also I have reproduced the segmentation fault when restarting Galaxy (after stopping Galaxy with one of these broken jobs).
Starting server in PID 17996. serving on http://127.0.0.1:8081 galaxy.jobs.runners.drmaa ERROR 2012-11-15 11:07:27,762 (327/None) Unable to check job status Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 296, in check_watched_items state = self.ds.jobStatus( job_id ) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InvalidArgumentException: code 4: Job id, "None", is not a valid job id galaxy.jobs.runners.drmaa WARNING 2012-11-15 11:07:27,764 (327/None) job will now be errored ./run.sh: line 86: 17996 Segmentation fault (core dumped) python ./scripts/paster.py serve universe_wsgi.ini $@
The problem is the job_id variable is "None" (note this is a string, not the Python special object None) in check_watched_items().
Peter
Is anyone else seeing this? I am wary of applying the update to our production Galaxy until I know how to resolve this (other than just be disabling task splitting).
Hi Peter, These look like two issues - in one, you've got task(s) in the database that do not have an external runner ID set, causing the drmaa runner to attempt to check the status of "None", resulting in the segfault. If you update the state of these tasks to something terminal, that should fix the issue with them. Of course, if the same things happens with new jobs, then there's another issue. I'm trying to reproduce the working directory behavior but have been unsuccessful. Do you have any local modifications to the splitting or jobs code? --nate
Thanks,
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: