We're having a slight issue with some of the jobs our Galaxy instance is submitting to the cluster via the DRMAA runner. Our cluster admin thinks it's not Galaxy's fault, but the problem results in some odd behavior in Galaxy, as follows. From galaxy.log: galaxy.jobs.runners.drmaa INFO 2014-03-25 12:17:46,950 (1667/5806401) job left DRM queue with following message: code 1: slurm_load_jobs error: Socket timed out on send/recv operation,job_id: 5806401 galaxy.jobs.runners ERROR 2014-03-25 12:17:50,272 (1667/5806401) Job output not returned from cluster: [Errno 2] No such file or directory: '/galaxy/database/job_working_directory/001/ 1667/galaxy_1667.o' galaxy.jobs.output_checker INFO 2014-03-25 12:17:50,379 Job 1667: Log: tool progress galaxy.jobs.output_checker INFO 2014-03-25 12:17:50,379 Job 1667: Log: tool progress galaxy.jobs DEBUG 2014-03-25 12:17:51,123 job 1667 ended 1. As best we can tell, for some reason the socket Galaxy's using to keep track of the job times out (this is the part we don't think is Galaxy's fault). The job continues to run on the cluster, but: 2. When the socket times out, Galaxy checks to see if the redirected stdout file exists, but since the job's still running, it doesn't exist, and so it throws an error... 3. ... and eventually Galaxy just gives up on the job entirely. This seems strange, considering Galaxy so gracefully recovers jobs when it's restarted; I admit I'm not familiar with the mechanics of tracking a job, but would a few retries spaced out over a few minutes be appropriate here, if that's possible? Failing that, is there a way to manually recover it, or could such a way be added? This job will probably run for days, and it would be nice not to have to manually collect the outputs and send them back to the user! What ends up happening in the Galaxy interface is that the history items turn green as if they completed successfully, but of course they're completely empty, which was a bit of a confusing thing for the user that was having this problem. Thanks! -- Brian Claywell, Systems Analyst/Programmer Fred Hutchinson Cancer Research Center bclaywel@fhcrc.org
participants (1)
-
Brian Claywell