Thanks so much! I have applied the fix to our env.
In looking over the logic, there was an existing exception block that would
have caught the communication exception generically but the job moved (in
this scenario erroneously) the job into a failure workflow. I would like to
understand what ended up hanging galaxy so it must be related to a valid
job being moved into failure state ? Did you follow it down the rabbit hole
by any chance to see what caused the hang in your env ?
Thanks again,
Ann
From: Edward Kirton <eskirton(a)lbl.gov>
Date: Thu, 12 Jan 2012 13:00:27 -0800
To: Ann Black <annblack(a)eng.uiowa.edu>
Cc: "galaxy-dev(a)lists.bx.psu.edu" <galaxy-dev(a)lists.bx.psu.edu>
Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException
sometimes the scheduler can't keep up with all the work in it's 15sec cycle,
so it doesn't respond to some messages. here's a fix i've been trying that
seems to work.
in lib/galaxy/jobs/runners/drmaa.py:
def check_watched_items( self ):
"""
Called by the monitor thread to look at each watched job and deal
with state changes.
"""
new_watched = []
for drm_job_state in self.watched:
job_id = drm_job_state.job_id
galaxy_job_id = drm_job_state.job_wrapper.job_id
old_state = drm_job_state.old_state
try:
state = self.ds.jobStatus( job_id )
# InternalException was reported to be necessary on some DRMs,
but
# this could cause failures to be detected as completion!
Please
# report if you experience problems with this.
except ( drmaa.InvalidJobException, drmaa.InternalException ),
e:
# we should only get here if an orphaned job was put into
the queue at app star
tup
log.debug("(%s/%s) job left DRM queue with following
message: %s" % ( galaxy_jo
b_id, job_id, e ) )
self.work_queue.put( ( 'finish', drm_job_state ) )
continue
# BEGIN DRM TIMEOUT: Don't fail on scheduler communication error
(probably just too busy)
except ( drmaa.DrmCommunicationException ), e:
log.warning("(%s/%s) DRM Communication Exception" % (
galaxy_job_id, job_id
))
continue
# END DRM TIMEOUT
On Wed, Jan 11, 2012 at 9:18 AM, Ann Black <annblack(a)eng.uiowa.edu> wrote:
Good Morning galaxy group!
I was hoping that someone might have some ideas on a problem we have
experienced a handful of times running galaxy on our local cluster.
Occasionally we experience some communication timeouts between out cluster
head node and a compute node which will self heal. However, this in turn will
hang galaxy. Below you will see output from our galaxy log file. When the
ERROR happens (which is not often) it consistently seems to hang galaxy. We
have to kill it off and restart it. We are running galaxy as a single PID at
this time (we are still just testing it out, etc) and it is running on our
head node (which we plan to move off of in the future).
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19 <tel:2012-01-10%2019> :19:58,800
(1654/698075) state change: job is running
galaxy.jobs.runners.drmaa ERROR 2012-01-10 20 <tel:2012-01-10%2020> :57:47,021
(1654/698075) Unable to check job status
Traceback (most recent call last):
File "/data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 236, in
check_watched_items
state = self.ds.jobStatus( job_id )
File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py", line
522, in jobStatus
_h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py", line
213, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py", line
90, in error_check
raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
DrmCommunicationException: code 2: failed receiving gdi request response for
mid=24442 (got syncron message receive timeout error).
galaxy.jobs.runners.drmaa WARNING 2012-01-10 20 <tel:2012-01-10%2020>
:58:05,090 (1654/698075) job will now be errored
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:06,396
(1654/698075) User killed running job, but error encountered removing from DRM
queue: code 2: failed receiving gdi request response for mid=24444 (got
syncron message receive timeout error).
galaxy.datatypes.metadata DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:06,896
Cleaning up external metadata files
galaxy.datatypes.metadata DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:06,947
Failed to cleanup MetadataTempFile temp files from
database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON
object could be decoded: line 1 column 0 (char 0)
galaxy.datatypes.metadata DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:09,640
Cleaning up external metadata files
galaxy.jobs <
http://galaxy.jobs> INFO 2012-01-10 20 <tel:2012-01-10%2020>
:59:09,697 job 1656 unable to run: one or more inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:10,121
Cleaning up external metadata files
galaxy.jobs <
http://galaxy.jobs> INFO 2012-01-10 20 <tel:2012-01-10%2020>
:59:10,159 job 1655 unable to run: one or more inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:12,076
Cleaning up external metadata files
galaxy.jobs <
http://galaxy.jobs> INFO 2012-01-10 20 <tel:2012-01-10%2020>
:59:12,126 job 1657 unable to run: one or more inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20 <tel:2012-01-10%2020> :59:13,601
Cleaning up external metadata files
galaxy.jobs <
http://galaxy.jobs> INFO 2012-01-10 20 <tel:2012-01-10%2020>
:59:13,650 job 1658 unable to run: one or more inputs in error state
Has anyone else experienced this or have some ideas on how we can further
debug to figure out why galaxy hangs?
Thanks much!
Ann Black-Ziegelbein
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/