Hi,
I'm running a fork of galaxy-central latest_2014.08.11. The instance is configured to run jobs on a SLURM cluster. The problem is that the SLURM controller sometimes becomes too busy which results in errors like:
galaxy.jobs.runners.drmaa INFO 2014-10-23 21:10:47,768 (1813/22896754) job left DRM queue with following message: code 1: slurm_load_jobs error: Socket timed out on send/recv operation,job_id: 22896754
This causes Galaxy to assume that the job has failed:
galaxy.jobs.runners ERROR 2014-10-23 21:10:47,881 (1813/22896754) Job output not returned from cluster: [Errno 2] No such file or directory: '/n/regal/stemcellcommons/galaxy-stage/job_working_directory/001/1813/galax y_1813.o'
This happens with both galaxy.jobs.runners.drmaa:DRMAAJobRunner and galaxy.jobs.runners.slurm:SlurmJobRunner. Is there any way to handle this condition in Galaxy?
Thanks, Ilya
galaxy-dev@lists.galaxyproject.org