Galaxy not killing split cluster jobs
Hi all, We're running our Galaxy with an SGE cluster, using the DRMAA support in Galaxy, and job splitting. I've noticed if the user cancels a job (that was running or queued on the cluster) while the job is shows as deleted in Galaxy, looking at the queue on the cluster with qstat shows it persists. I've not seen anything similar reported except for this PBS issue: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003633.html When I don't use job splitting, cancelling jobs seems to work: galaxy.jobs.handler DEBUG 2012-05-01 14:46:47,755 stopping job 57 in drmaa runner galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:47,756 (57/26504) Being killed... galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:47,757 (57/26504) Removed from DRM queue at user's request galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:48,441 (57/26504) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:48,441 Job output not returned from cluster When I am using job splitting, cancelling jobs fails: galaxy.jobs.handler DEBUG 2012-05-01 14:28:30,364 stopping job 56 in tasks runner galaxy.jobs.runners.tasks WARNING 2012-05-01 14:28:30,386 stop_job(): 56: no PID in database for job, unable to stop That warning comes from lib/galaxy/jobs/runners/tasks.py which starts: def stop_job( self, job ): # DBTODO Call stop on all of the tasks. #if our local job has JobExternalOutputMetadata associated, then our primary job has to have already finished if job.external_output_metadata: pid = job.external_output_metadata[0].job_runner_external_pid #every JobExternalOutputMetadata has a pid set, we just need to take from one of them else: pid = job.job_runner_external_id if pid in [ None, '' ]: log.warning( "stop_job(): %s: no PID in database for job, unable to stop" % job.id ) return pid = int( pid ) ... I'm a little confused about tasks.py vs drmaa.py but that TODO comment looks pertinent. Is that the problem here? Regards, Peter
On May 1, 2012, at 9:51 AM, Peter Cock wrote:
I'm a little confused about tasks.py vs drmaa.py but that TODO comment looks pertinent. Is that the problem here?
The runner in tasks.py is what executes the primary job, splitting and creating the tasks. The tasks themselves are actually injected back into the regular job queue and run as normal jobs with the usual runners (in your case drmaa). And, yes, it should be fairly straightforward to add, but this just hasn't been implemented yet. -Dannon
On Tue, May 1, 2012 at 3:03 PM, Dannon Baker <dannonbaker@me.com> wrote:
On May 1, 2012, at 9:51 AM, Peter Cock wrote:
I'm a little confused about tasks.py vs drmaa.py but that TODO comment looks pertinent. Is that the problem here?
The runner in tasks.py is what executes the primary job, splitting and creating the tasks. The tasks themselves are actually injected back into the regular job queue and run as normal jobs with the usual runners (in your case drmaa).
And, yes, it should be fairly straightforward to add, but this just hasn't been implemented yet.
So the stop_job method for the runner in task.py needs to call the stop_job method of each of the child tasks it created for that job (which in this case are drmaa jobs - but could be pbs etc jobs). I'm not really clear how all that works. Should I open an issue on this? Peter
I'll take care of it. Thanks for reminding me about the TODO! On May 1, 2012, at 10:03 AM, Dannon Baker <dannonbaker@me.com> wrote:
On May 1, 2012, at 9:51 AM, Peter Cock wrote:
I'm a little confused about tasks.py vs drmaa.py but that TODO comment looks pertinent. Is that the problem here?
The runner in tasks.py is what executes the primary job, splitting and creating the tasks. The tasks themselves are actually injected back into the regular job queue and run as normal jobs with the usual runners (in your case drmaa).
And, yes, it should be fairly straightforward to add, but this just hasn't been implemented yet.
-Dannon ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker <dannonbaker@me.com> wrote:
I'll take care of it. Thanks for reminding me about the TODO!
On a related point, I've noticed sometimes one child job from a split task can fail, yet the rest of the child jobs continue to run on the cluster wasting CPU time. As soon as one child job dies (assuming there are no plans for attempting a retry), I would like the parent task to kill all the other children, and fail itself. I suppose you could merge the output of any children which did finish... but it would be simpler not to bother. Regards, Peter
On a related point, I've noticed sometimes one child job from a split task can fail, yet the rest of the child jobs continue to run on the cluster wasting CPU time. As soon as one child job dies (assuming there are no plans for attempting a retry), I would like the parent task to kill all the other children, and fail itself. I suppose you could merge the output of any children which did finish... but it would be simpler not to bother.
Right now, yes, this would make sense- I'll see about adding it. Ultimately we want to build in a mechanism for retrying child tasks that fail due to cluster errors, etc, so it isn't necessary to rerun the entire job. -Dannon
On Thu, May 3, 2012 at 3:54 PM, Dannon Baker <dannonbaker@me.com> wrote:
On a related point, I've noticed sometimes one child job from a split task can fail, yet the rest of the child jobs continue to run on the cluster wasting CPU time. As soon as one child job dies (assuming there are no plans for attempting a retry), I would like the parent task to kill all the other children, and fail itself. I suppose you could merge the output of any children which did finish... but it would be simpler not to bother.
Right now, yes, this would make sense- I'll see about adding it.
Great.
Ultimately we want to build in a mechanism for retrying child tasks that fail due to cluster errors, etc, so it isn't necessary to rerun the entire job.
That could be helpful - but also rather fiddly for detecting when it is appropriate to retry a job or now. For the split-tasks, right now I'm finding some child-jobs fail when the OS kills them due to running out of RAM - in which case a neat idea would be to further sub-divide the jobs and resubmit. This is probably over-engineering though... KISS principle. Peter
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker <dannonbaker@me.com> wrote:
On Tue, May 1, 2012 at 3:10 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On May 1, 2012, at 10:03 AM, Dannon Baker <dannonbaker@me.com> wrote:
On May 1, 2012, at 9:51 AM, Peter Cock wrote:
I'm a little confused about tasks.py vs drmaa.py but that TODO comment looks pertinent. Is that the problem here?
The runner in tasks.py is what executes the primary job, splitting and creating the tasks. The tasks themselves are actually injected back into the regular job queue and run as normal jobs with the usual runners (in your case drmaa).
And, yes, it should be fairly straightforward to add, but this just hasn't been implemented yet.
-Dannon
So the stop_job method for the runner in task.py needs to call the stop_job method of each of the child tasks it created for that job (which in this case are drmaa jobs - but could be pbs etc jobs). I'm not really clear how all that works.
Should I open an issue on this?
Peter
I'll take care of it. Thanks for reminding me about the TODO!
Hi Dannon, Is this any nearer the top of your TODO list? I was reminded by having to manually log onto our cluster today and issue a bunch of SGE qdel commands to manually kill a job which was hogging the queue, but had bee deleted in Galaxy. Thanks, Peter
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker <dannonbaker@me.com> wrote:
I'll take care of it. Thanks for reminding me about the TODO!
This seems to have reached galaxy-central now: https://bitbucket.org/galaxy/galaxy-central/changeset/dc20a7b5b6ce i.e. When Galaxy creates sub-jobs from tools using the <parallelism> tag to split tasks over the cluster, if the user kills the parent job the child jobs should get kill too. That will be appreciated next time our cluster is heavily loaded :) Thanks, Peter
A suggested change will be coming down the pipe shortly, but it's good to hear that it will be useful! -Scott ----- Original Message -----
On Tue, May 1, 2012 at 3:46 PM, Dannon Baker <dannonbaker@me.com> wrote:
I'll take care of it. Thanks for reminding me about the TODO!
This seems to have reached galaxy-central now: https://bitbucket.org/galaxy/galaxy-central/changeset/dc20a7b5b6ce
i.e. When Galaxy creates sub-jobs from tools using the <parallelism> tag to split tasks over the cluster, if the user kills the parent job the child jobs should get kill too.
That will be appreciated next time our cluster is heavily loaded :)
Thanks,
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (3)
-
Dannon Baker
-
Peter Cock
-
Scott McManus