Galaxy not killing split cluster jobs

1 May 2012

      Hi all,

We're running our Galaxy with an SGE cluster, using the DRMAA
support in Galaxy, and job splitting. I've noticed if the user cancels
a job (that was running or queued on the cluster) while the job is
shows as deleted in Galaxy, looking at the queue on the cluster
with qstat shows it persists.

I've not seen anything similar reported except for this PBS issue:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-October/003633.html

When I don't use job splitting, cancelling jobs seems to work:

galaxy.jobs.handler DEBUG 2012-05-01 14:46:47,755 stopping job 57 in
drmaa runner
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:47,756 (57/26504)
Being killed...
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:47,757 (57/26504)
Removed from DRM queue at user's request
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:48,441 (57/26504)
state change: job finished, but failed
galaxy.jobs.runners.drmaa DEBUG 2012-05-01 14:46:48,441 Job output not
returned from cluster

When I am using job splitting, cancelling jobs fails:

galaxy.jobs.handler DEBUG 2012-05-01 14:28:30,364 stopping job 56 in
tasks runner
galaxy.jobs.runners.tasks WARNING 2012-05-01 14:28:30,386 stop_job():
56: no PID in database for job, unable to stop

That warning comes from lib/galaxy/jobs/runners/tasks.py which starts:

    def stop_job( self, job ):
        # DBTODO Call stop on all of the tasks.
        #if our local job has JobExternalOutputMetadata associated,
then our primary job has to have already finished
        if job.external_output_metadata:
            pid =
job.external_output_metadata[0].job_runner_external_pid #every
JobExternalOutputMetadata has a pid set, we just need to take from one
of them
        else:
            pid = job.job_runner_external_id
        if pid in [ None, '' ]:
            log.warning( "stop_job(): %s: no PID in database for job,
unable to stop" % job.id )
            return
        pid = int( pid )
        ...

I'm a little confused about tasks.py vs drmaa.py but that TODO
comment looks pertinent. Is that the problem here?

Regards,

Peter

Peter Cock

Dannon Baker

Peter Cock

Dannon Baker

Peter Cock

Dannon Baker

Peter Cock

Peter Cock

Peter Cock

Scott McManus

tags

participants (3)