We're having a slight issue with some of the jobs our Galaxy instance
is submitting to the cluster via the DRMAA runner. Our cluster admin
thinks it's not Galaxy's fault, but the problem results in some odd
behavior in Galaxy, as follows. From galaxy.log:
galaxy.jobs.runners.drmaa INFO 2014-03-25 12:17:46,950 (1667/5806401)
job left DRM queue with following message: code 1: slurm_load_jobs
error: Socket timed out on send/recv operation,job_id: 5806401
galaxy.jobs.runners ERROR 2014-03-25 12:17:50,272 (1667/5806401) Job
output not returned from cluster: [Errno 2] No such file or directory:
'/galaxy/database/job_working_directory/001/
1667/galaxy_1667.o'
galaxy.jobs.output_checker INFO 2014-03-25 12:17:50,379 Job 1667: Log:
tool progress
galaxy.jobs.output_checker INFO 2014-03-25 12:17:50,379 Job 1667: Log:
tool progress
galaxy.jobs DEBUG 2014-03-25 12:17:51,123 job 1667 ended
1. As best we can tell, for some reason the socket Galaxy's using to
keep track of the job times out (this is the part we don't think is
Galaxy's fault). The job continues to run on the cluster, but:
2. When the socket times out, Galaxy checks to see if the redirected
stdout file exists, but since the job's still running, it doesn't
exist, and so it throws an error...
3. ... and eventually Galaxy just gives up on the job entirely.
This seems strange, considering Galaxy so gracefully recovers jobs
when it's restarted; I admit I'm not familiar with the mechanics of
tracking a job, but would a few retries spaced out over a few minutes
be appropriate here, if that's possible? Failing that, is there a way
to manually recover it, or could such a way be added? This job will
probably run for days, and it would be nice not to have to manually
collect the outputs and send them back to the user!
What ends up happening in the Galaxy interface is that the history
items turn green as if they completed successfully, but of course
they're completely empty, which was a bit of a confusing thing for the
user that was having this problem.
Thanks!
--
Brian Claywell, Systems Analyst/Programmer
Fred Hutchinson Cancer Research Center
bclaywel(a)fhcrc.org