SLURM timeouts

24 Oct 2014

      Hi,

I'm running a fork of galaxy-central latest_2014.08.11. The instance is
configured to run jobs on a SLURM cluster. The problem is that the SLURM
controller sometimes becomes too busy which results in errors like:

galaxy.jobs.runners.drmaa INFO 2014-10-23 21:10:47,768 (1813/22896754) job
left DRM queue with following message: code 1: slurm_load_jobs error:
Socket timed out on send/recv operation,job_id: 22896754

This causes Galaxy to assume that the job has failed:

galaxy.jobs.runners ERROR 2014-10-23 21:10:47,881 (1813/22896754) Job
output not returned from cluster: [Errno 2] No such file or directory:
'/n/regal/stemcellcommons/galaxy-stage/job_working_directory/001/1813/galax
y_1813.o'

This happens with both galaxy.jobs.runners.drmaa:DRMAAJobRunner and
galaxy.jobs.runners.slurm:SlurmJobRunner. Is there any way to handle this
condition in Galaxy?

Thanks,
Ilya

Sytchev, Ilya

tags

participants (1)