Galaxy dropping jobs?

31 Oct 2013

      Hi,

I discovered a weird issue in the job behaviour : Galaxy is running a 
long job on a cluster (more than 24h), about 15 hours later it misses 
the connection with SLURM on the cluster and throws the following 
message :

[root@galaxy-prod01 galaxy-dist]# grep 3715200 paster.log
galaxy.jobs.runners.drmaa INFO 2013-10-30 10:51:54,149 (555) queued as 
3715200
galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:51:55,149 (555/3715200) 
state change: job is queued and active
galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:52:13,516 (555/3715200) 
state change: job is running
galaxy.jobs.runners.drmaa INFO 2013-10-31 03:29:33,090 (555/3715200) 
job left DRM queue with following message: code 1: slurm_load_jobs 
error: Unable to contact slurm controller (connect failure),job_id: 
3715200

Is there a timeout in Galaxy for contacting slurm? Yet, the job is 
still running properly on the cluster ...

Thanks for help, it's really urgent :)

Nikolay

-- 
Nikolay Vazov, PhD
Research Computing Centre - http://hpc.uio.no
USIT, University of Oslo

Nikolai Vazov

Nate Coraor

Nikolai Vazov

tags

participants (2)