job status when SGE kills/aborts job

29 Jul 2011

      We are using SGE cluster with our galaxy install. We have specified resource and run-time limits for certain tools using tool specific drmaa URL configuration, e.g.:  
- run-time (h_rt, s_rt) 
- memory (vf, h_vmem).

This helps scheduler in submitting jobs to an appropriate node and also prevent node from crashing because of excessive memory consumption. However, sometimes a job needs more resources and/or run-time than specified in the drmaa URL configuration. In such cases SGE kills particular job and we get email notification with appropriate job summary. However, the galaxy web interface doesn't show any error for such failures. The job table doesn't contain any related state/info as well. The jobs are shown in green-boxes meaning they completed without any failure. In reality these jobs have been killed/aborted by the scheduler. This is really confusing as there is inconsistency between job status indicated by the galaxy and SGE/drmaa. Has anyone else experienced and/or addressed this issue? Any comments or suggestions will be really helpful. 

Thanks,
Shantanu.

Shantanu Pavgi

Ka Ming Nip

ambarish biswas

Shantanu Pavgi

Peter Cock

Chris Fields

Shantanu Pavgi

tags

participants (5)