I am also using a SGE cluster and the DRMAA runner for my Galaxy install. I am also having
the same issue for jobs that were killed.
How did you define the run-time or memory/runtime configurations in your DRMAA URLs?
I had to add "-w n" in the DRMAA URLs in order for my jobs to be dispatched to
the cluster. However, someone said (on another thread) that doing so might hide the
errors. I am not sure if this is the cause since my jobs won't be dispatched at all if
"-w n" was not in the DRMAA URLs.
From: galaxy-dev-bounces(a)lists.bx.psu.edu [galaxy-dev-bounces(a)lists.bx.psu.edu] On Behalf
Of Shantanu Pavgi [pavgi(a)uab.edu]
Sent: July 29, 2011 1:56 PM
To: galaxydev psu
Subject: [galaxy-dev] job status when SGE kills/aborts job
We are using SGE cluster with our galaxy install. We have specified resource and run-time
limits for certain tools using tool specific drmaa URL configuration, e.g.:
- run-time (h_rt, s_rt)
- memory (vf, h_vmem).
This helps scheduler in submitting jobs to an appropriate node and also prevent node from
crashing because of excessive memory consumption. However, sometimes a job needs more
resources and/or run-time than specified in the drmaa URL configuration. In such cases SGE
kills particular job and we get email notification with appropriate job summary. However,
the galaxy web interface doesn't show any error for such failures. The job table
doesn't contain any related state/info as well. The jobs are shown in green-boxes
meaning they completed without any failure. In reality these jobs have been killed/aborted
by the scheduler. This is really confusing as there is inconsistency between job status
indicated by the galaxy and SGE/drmaa. Has anyone else experienced and/or addressed this
issue? Any comments or suggestions will be really helpful.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: