On 9/12/13 10:35 AM, "Peter Cock" firstname.lastname@example.org wrote:
On Thu, Sep 12, 2013 at 2:01 PM, Mathieu Bahin email@example.com wrote:
We have been developing our own Galaxy instance for a while now. We have a cluster on which the job are sent to be executed, it is managed through SGE. Usually, communication between SGE and DRMAA is ok and we don't have any problem with that.
When a job is deleted by the user, most of the times, the job disappears but sometimes, we don't know why, the job stays and has the status 'dr' within SGE. If we don't kill it 'manually', it stays forever. It is not always the same tools which produces this error. Have you any idea why how manage it ?
I have noticed problem with our DRMMA/SGE setup where a user can cancel a large job (using the job splitter in at least some cases), but Galaxy does not seem to cancel the jobs on the cluster. I've not tried to diagnose this yet - it could be a similar issue though.
Also, in our DRMAA/LSF setup (using a fork of the latest galaxy-dist) jobs generated by the current workflow step continue running on the cluster after history is deleted.