Galaxy not canceling cluster jobs
If we cancel a job within Galaxy it does not seem to cancel the cluster job on the back end. For example, if run the "Map with BWA" too and immediately decide I made a mistake and want to run it with different options I cancel the job in Galaxy but if I run "qstat" on our cluster I see that the job is actually still running, and the pbs_server log shows no attempt by the galaxy user to qdel the job. Is this expected behavior? I noticed there is a python function called "stop_job" in lib/galaxy/jobs/runners/pbs.py, so it seems Galaxy should be calling this if a user cancels a job in Galaxy. I've also notice that in "stop_job" there is a potential problem. After doing a little testing I've found this call to pbs_deljob would require that the galaxy user have "manager" privileges in TORQUE: pbs.pbs_deljob( c, str( job.job_runner_external_id ), 'NULL' ) Noticed that if you call pbs_deljob this way as a non-manager you'll see a message to the effect of "the -m option requires manager privileges". It seems the intention of passing 'NULL' to pbs_deljob as the message parameter is to not set a message, so manager privileges should not be needed (since Galaxy is only deleting its own cluster jobs). Changing the code to this allows it to be run sucessfully as a non-manager: pbs.pbs_deljob( c, str( job.job_runner_external_id ), "") -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153
Glen Beane wrote:
If we cancel a job within Galaxy it does not seem to cancel the cluster job on the back end. For example, if run the "Map with BWA" too and immediately decide I made a mistake and want to run it with different options I cancel the job in Galaxy but if I run "qstat" on our cluster I see that the job is actually still running, and the pbs_server log shows no attempt by the galaxy user to qdel the job. Is this expected behavior?
I noticed there is a python function called "stop_job" in lib/galaxy/jobs/runners/pbs.py, so it seems Galaxy should be calling this if a user cancels a job in Galaxy.
I've also notice that in "stop_job" there is a potential problem. After doing a little testing I've found this call to pbs_deljob would require that the galaxy user have "manager" privileges in TORQUE:
pbs.pbs_deljob( c, str( job.job_runner_external_id ), 'NULL' )
Noticed that if you call pbs_deljob this way as a non-manager you'll see a message to the effect of "the -m option requires manager privileges". It seems the intention of passing 'NULL' to pbs_deljob as the message parameter is to not set a message, so manager privileges should not be needed (since Galaxy is only deleting its own cluster jobs). Changing the code to this allows it to be run sucessfully as a non-manager:
pbs.pbs_deljob( c, str( job.job_runner_external_id ), "")
Hi Glen, We've seen cases where the job is not terminated - of course, in my own testing it's always worked, but I am a PBS manager so this finally explains why it's hasn't been reproducible for me. Thanks much for looking in to this, I'll commit it ASAP. --nate
-- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
participants (2)
-
Glen Beane
-
Nate Coraor