galaxy does not get kill signal from drmaa (sge6)
Hi, it looks like bug. When job is terminated by exceeded wallclock Galaxy assumes that job finished normaly and has green status in history. What confuses user while there are no results or results are broken. Report from SGE: 05/22/2016 20:16:26| main|o3|W|job 1643894.1 exceeded hard wallclock time - initiate terminate method 05/22/2016 20:16:28|worker|ooo|W|job 1643894.1 failed on host o3 qmaster enforced h_rt, h_cpu, or h_vmem limit because: job 1643894.1 died through signal KILL (9) Logs from Galaxy: galaxy.jobs.runners.drmaa DEBUG 2016-05-22 20:16:29,425 (502/1643894) state change: job finished normally Job was submited with following native parameters: galaxy.jobs.runners.drmaa DEBUG 2016-05-22 20:11:19,837 (502) native specification is: -cwd -l h_rt=300 Tested on Galaxy 15.11, 16.01, 16.04 -- Nick
It seems likely you are correct and a catch for this hasn't been implemented in Galaxy or more likely this is a limitation of the DRMAA library for SGE. Ideally the job state would be coming back as FAILED or something like that: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/drm... If you can debug the drmaa runner and verify that the SGE drmaa library is indeed sending back some signal that the job failed with a memory timeout and what that status code is - then we can catch and update the drmaa runner. If not, probably the right place to address this is in the drmaa library (I doubt that is still maintained - but I honestly have no clue). We have added special logic to the SLURM runner which extends the drmaa runner and gives more informative errors for various error conditions. An SGE runner that does something similar might be possible: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/slu... -John On Sun, May 22, 2016 at 3:24 PM, Nick <mykp@rzg.mpg.de> wrote:
Hi,
it looks like bug. When job is terminated by exceeded wallclock Galaxy assumes that job finished normaly and has green status in history. What confuses user while there are no results or results are broken.
Report from SGE: 05/22/2016 20:16:26| main|o3|W|job 1643894.1 exceeded hard wallclock time - initiate terminate method
05/22/2016 20:16:28|worker|ooo|W|job 1643894.1 failed on host o3 qmaster enforced h_rt, h_cpu, or h_vmem limit because: job 1643894.1 died through signal KILL (9)
Logs from Galaxy: galaxy.jobs.runners.drmaa DEBUG 2016-05-22 20:16:29,425 (502/1643894) state change: job finished normally
Job was submited with following native parameters: galaxy.jobs.runners.drmaa DEBUG 2016-05-22 20:11:19,837 (502) native specification is: -cwd -l h_rt=300
Tested on Galaxy 15.11, 16.01, 16.04
-- Nick ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (2)
-
John Chilton
-
Nick