On Jan 11, 2012, at 12:18 PM, Ann Black wrote:
Good Morning galaxy group!
I was hoping that someone might have some ideas on a problem we have experienced a
handful of times running galaxy on our local cluster.
Occasionally we experience some communication timeouts between out cluster head node and
a compute node which will self heal. However, this in turn will hang galaxy. Below you
will see output from our galaxy log file. When the ERROR happens (which is not often) it
consistently seems to hang galaxy. We have to kill it off and restart it. We are running
galaxy as a single PID at this time (we are still just testing it out, etc) and it is
running on our head node (which we plan to move off of in the future).
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075) state change: job
is running
galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075) Unable to check job
status
Traceback (most recent call last):
File "/data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 236, in
check_watched_items
state = self.ds.jobStatus( job_id )
File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py", line
522, in jobStatus
_h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py", line
213, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py", line 90,
in error_check
raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
DrmCommunicationException: code 2: failed receiving gdi request response for mid=24442
(got syncron message receive timeout error).
galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075) job will now be
errored
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075) User killed running
job, but error encountered removing from DRM queue: code 2: failed receiving gdi request
response for mid=24444 (got syncron message receive timeout error).
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up external metadata
files
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup
MetadataTempFile temp files from
database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON object could be
decoded: line 1 column 0 (char 0)
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up external metadata
files
galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or more inputs in
error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up external metadata
files
galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or more inputs in
error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up external metadata
files
galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or more inputs in
error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up external metadata
files
galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or more inputs in
error state
Has anyone else experienced this or have some ideas on how we can further debug to figure
out why galaxy hangs?
Hi Ann,
The cause of the exception aside, this should be caught by the except block below it in
drmaa.py (in check_watched_items()):
except Exception, e:
# so we don't kill the monitor thread
log.exception("(%s/%s) Unable to check job status" % (
galaxy_job_id, job_id ) )
What changeset are you running?
--nate
Thanks much!
Ann Black-Ziegelbein
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/