[galaxy-dev] Galaxy Hang after DrmCommunicationException

11 Jan 2012

      Good Morning galaxy group!

I was hoping that someone might have some ideas on a problem we have
experienced a handful of times running galaxy on our local cluster.

Occasionally we experience some communication timeouts between out cluster
head node and a compute node which will self heal. However, this in turn
will hang galaxy.  Below you will see output from our galaxy log file.  When
the ERROR happens (which is not often) it consistently seems to hang galaxy.
We have to kill it off and restart it. We are running galaxy as a single PID
at this time (we are still just testing it out, etc) and it is running on
our head node (which we plan to move off of in the future).

galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075) state
change: job is running
galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075) Unable
to check job status
Traceback (most recent call last):
  File "/data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 236, in
check_watched_items
    state = self.ds.jobStatus( job_id )
  File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py",
line 522, in jobStatus
    _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
  File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py", line
213, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py", line
90, in error_check
    raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
DrmCommunicationException: code 2: failed receiving gdi request response for
mid=24442 (got syncron message receive timeout error).
galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075) job
will now be errored
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075) User
killed running job, but error encountered removing from DRM queue: code 2:
failed receiving gdi request response for mid=24444 (got syncron message
receive timeout error).
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up external
metadata files
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup
MetadataTempFile temp files from
database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON
object could be decoded: line 1 column 0 (char 0)
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or more
inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or more
inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or more
inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or more
inputs in error state

Has anyone else experienced this or have some ideas on how we can further
debug to figure out why galaxy hangs?

Thanks much!

Ann Black-Ziegelbein