Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException
excellent, thanks for the correction! On Wed, Jan 18, 2012 at 8:59 AM, Shantanu Pavgi <pavgi@uab.edu> wrote:
Ed,
I think you may want to add job with DrmCommunicationException error back into the watched_jobs list. {{{ except ( drmaa.DrmCommunicationException ), e: # Catch drmaa communication exception, log a warning message and # continue to monitor jobs. log.warning("(%s/%s) Couldn't communicate with the cluster scheduler to check job status." %( galaxy_job_id, job_id )) # keep old job state new_watched.append( drm_job_state ) continue }}}
Here is a patch and related discussion on it: https://bitbucket.org/galaxy/galaxy-central/pull-request/23/catch-drmcommuni...
-- Shantanu
On Jan 13, 2012, at 2:11 PM, Edward Kirton wrote:
i had seen the job process die with this error:
if state != old_state: UnboundLocalError: local variable 'state' referenced before assignment
since the DRM timeout is an intermittent error, i'm not absolutely positive i have nailed it (a bit more time will tell -- i'm keeping an eye on it with debug log messages), but it seems so. i intended to share this as a patched clone when i became sure but when i read your email i just sent what i had immediately. let us know if that seems to solve the problem for you, so we'll have some confirmation.
glad to help, ed
p.s. i have another patch for "output not returned from cluster" errors that i'm also still validating, due to NFS race since we don't have inode metadata caching turned off as the galaxy developers suggest.
On Fri, Jan 13, 2012 at 8:06 AM, Ann Black <annblack@eng.uiowa.edu> wrote:
Thanks so much! I have applied the fix to our env.
In looking over the logic, there was an existing exception block that would have caught the communication exception generically – but the job moved (in this scenario erroneously) the job into a failure workflow. I would like to understand what ended up hanging galaxy – so it must be related to a valid job being moved into failure state ? Did you follow it down the rabbit hole by any chance to see what caused the hang in your env ?
Thanks again,
Ann
From: Edward Kirton <eskirton@lbl.gov> Date: Thu, 12 Jan 2012 13:00:27 -0800 To: Ann Black <annblack@eng.uiowa.edu> Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException
sometimes the scheduler can't keep up with all the work in it's 15sec cycle, so it doesn't respond to some messages. here's a fix i've been trying that seems to work.
in lib/galaxy/jobs/runners/drmaa.py:
def check_watched_items( self ): """ Called by the monitor thread to look at each watched job and deal with state changes. """ new_watched = [] for drm_job_state in self.watched: job_id = drm_job_state.job_id galaxy_job_id = drm_job_state.job_wrapper.job_id old_state = drm_job_state.old_state try: state = self.ds.jobStatus( job_id ) # InternalException was reported to be necessary on some DRMs, but # this could cause failures to be detected as completion! Please # report if you experience problems with this. except ( drmaa.InvalidJobException, drmaa.InternalException ), e: # we should only get here if an orphaned job was put into the queue at app star tup log.debug("(%s/%s) job left DRM queue with following message: %s" % ( galaxy_jo b_id, job_id, e ) ) self.work_queue.put( ( 'finish', drm_job_state ) ) continue # BEGIN DRM TIMEOUT: Don't fail on scheduler communication error (probably just too busy) except ( drmaa.DrmCommunicationException ), e: log.warning("(%s/%s) DRM Communication Exception" % ( galaxy_job_id, job_id )) continue # END DRM TIMEOUT
On Wed, Jan 11, 2012 at 9:18 AM, Ann Black <annblack@eng.uiowa.edu>wrote:
Good Morning galaxy group!
I was hoping that someone might have some ideas on a problem we have experienced a handful of times running galaxy on our local cluster.
Occasionally we experience some communication timeouts between out cluster head node and a compute node which will self heal. However, this in turn will hang galaxy. Below you will see output from our galaxy log file. When the ERROR happens (which is not often) it consistently seems to hang galaxy. We have to kill it off and restart it. We are running galaxy as a single PID at this time (we are still just testing it out, etc) and it is running on our head node (which we plan to move off of in the future).
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075) state change: job is running galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075) Unable to check job status Traceback (most recent call last): File "/data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 236, in check_watched_items state = self.ds.jobStatus( job_id ) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DrmCommunicationException: code 2: failed receiving gdi request response for mid=24442 (got syncron message receive timeout error). galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075) job will now be errored galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075) User killed running job, but error encountered removing from DRM queue: code 2: failed receiving gdi request response for mid=24444 (got syncron message receive timeout error). galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup MetadataTempFile temp files from database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON object could be decoded: line 1 column 0 (char 0) galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or more inputs in error state
Has anyone else experienced this or have some ideas on how we can further debug to figure out why galaxy hangs?
Thanks much!
Ann Black-Ziegelbein
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
perhaps a 15sec sleep should also be added, since the scheduler is overwhelmed {{{ except ( drmaa.DrmCommunicationException ), e: # Catch drmaa communication exception, log a warning message and # continue to monitor jobs. log.warning("(%s/%s) Couldn't communicate with the cluster scheduler to check job status." %( galaxy_job_id, job_id )) # give scheduler time to catch up time.sleep( 15 ) # keep old job state new_watched.append( drm_job_state ) continue }}} On Wed, Jan 18, 2012 at 1:07 PM, Edward Kirton <eskirton@lbl.gov> wrote:
excellent, thanks for the correction!
On Wed, Jan 18, 2012 at 8:59 AM, Shantanu Pavgi <pavgi@uab.edu> wrote:
Ed,
I think you may want to add job with DrmCommunicationException error back into the watched_jobs list. {{{ except ( drmaa.DrmCommunicationException ), e: # Catch drmaa communication exception, log a warning message and # continue to monitor jobs. log.warning("(%s/%s) Couldn't communicate with the cluster scheduler to check job status." %( galaxy_job_id, job_id )) # keep old job state new_watched.append( drm_job_state ) continue }}}
Here is a patch and related discussion on it: https://bitbucket.org/galaxy/galaxy-central/pull-request/23/catch-drmcommuni...
-- Shantanu
On Jan 13, 2012, at 2:11 PM, Edward Kirton wrote:
i had seen the job process die with this error:
if state != old_state: UnboundLocalError: local variable 'state' referenced before assignment
since the DRM timeout is an intermittent error, i'm not absolutely positive i have nailed it (a bit more time will tell -- i'm keeping an eye on it with debug log messages), but it seems so. i intended to share this as a patched clone when i became sure but when i read your email i just sent what i had immediately. let us know if that seems to solve the problem for you, so we'll have some confirmation.
glad to help, ed
p.s. i have another patch for "output not returned from cluster" errors that i'm also still validating, due to NFS race since we don't have inode metadata caching turned off as the galaxy developers suggest.
On Fri, Jan 13, 2012 at 8:06 AM, Ann Black <annblack@eng.uiowa.edu>wrote:
Thanks so much! I have applied the fix to our env.
In looking over the logic, there was an existing exception block that would have caught the communication exception generically – but the job moved (in this scenario erroneously) the job into a failure workflow. I would like to understand what ended up hanging galaxy – so it must be related to a valid job being moved into failure state ? Did you follow it down the rabbit hole by any chance to see what caused the hang in your env ?
Thanks again,
Ann
From: Edward Kirton <eskirton@lbl.gov> Date: Thu, 12 Jan 2012 13:00:27 -0800 To: Ann Black <annblack@eng.uiowa.edu> Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException
sometimes the scheduler can't keep up with all the work in it's 15sec cycle, so it doesn't respond to some messages. here's a fix i've been trying that seems to work.
in lib/galaxy/jobs/runners/drmaa.py:
def check_watched_items( self ): """ Called by the monitor thread to look at each watched job and deal with state changes. """ new_watched = [] for drm_job_state in self.watched: job_id = drm_job_state.job_id galaxy_job_id = drm_job_state.job_wrapper.job_id old_state = drm_job_state.old_state try: state = self.ds.jobStatus( job_id ) # InternalException was reported to be necessary on some DRMs, but # this could cause failures to be detected as completion! Please # report if you experience problems with this. except ( drmaa.InvalidJobException, drmaa.InternalException ), e: # we should only get here if an orphaned job was put into the queue at app star tup log.debug("(%s/%s) job left DRM queue with following message: %s" % ( galaxy_jo b_id, job_id, e ) ) self.work_queue.put( ( 'finish', drm_job_state ) ) continue # BEGIN DRM TIMEOUT: Don't fail on scheduler communication error (probably just too busy) except ( drmaa.DrmCommunicationException ), e: log.warning("(%s/%s) DRM Communication Exception" % ( galaxy_job_id, job_id )) continue # END DRM TIMEOUT
On Wed, Jan 11, 2012 at 9:18 AM, Ann Black <annblack@eng.uiowa.edu>wrote:
Good Morning galaxy group!
I was hoping that someone might have some ideas on a problem we have experienced a handful of times running galaxy on our local cluster.
Occasionally we experience some communication timeouts between out cluster head node and a compute node which will self heal. However, this in turn will hang galaxy. Below you will see output from our galaxy log file. When the ERROR happens (which is not often) it consistently seems to hang galaxy. We have to kill it off and restart it. We are running galaxy as a single PID at this time (we are still just testing it out, etc) and it is running on our head node (which we plan to move off of in the future).
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075) state change: job is running galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075) Unable to check job status Traceback (most recent call last): File "/data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 236, in check_watched_items state = self.ds.jobStatus( job_id ) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DrmCommunicationException: code 2: failed receiving gdi request response for mid=24442 (got syncron message receive timeout error). galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075) job will now be errored galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075) User killed running job, but error encountered removing from DRM queue: code 2: failed receiving gdi request response for mid=24444 (got syncron message receive timeout error). galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup MetadataTempFile temp files from database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON object could be decoded: line 1 column 0 (char 0) galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up external metadata files galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or more inputs in error state
Has anyone else experienced this or have some ideas on how we can further debug to figure out why galaxy hangs?
Thanks much!
Ann Black-Ziegelbein
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
I haven't added any such delay after DrmCommunicationException however I have thought about increasing it in the job monitor function itself (site specific change, not suggesting it in the blessed galaxy repo). https://bitbucket.org/galaxy/galaxy-central/src/cb8d6a29ea22/lib/galaxy/jobs... -- Shantanu On Jan 18, 2012, at 5:25 PM, Edward Kirton wrote: perhaps a 15sec sleep should also be added, since the scheduler is overwhelmed {{{ except ( drmaa.DrmCommunicationException ), e: # Catch drmaa communication exception, log a warning message and # continue to monitor jobs. log.warning("(%s/%s) Couldn't communicate with the cluster scheduler to check job status." %( galaxy_job_id, job_id )) # give scheduler time to catch up time.sleep( 15 ) # keep old job state new_watched.append( drm_job_state ) continue }}} On Wed, Jan 18, 2012 at 1:07 PM, Edward Kirton <eskirton@lbl.gov<mailto:eskirton@lbl.gov>> wrote: excellent, thanks for the correction! On Wed, Jan 18, 2012 at 8:59 AM, Shantanu Pavgi <pavgi@uab.edu<mailto:pavgi@uab.edu>> wrote: Ed, I think you may want to add job with DrmCommunicationException error back into the watched_jobs list. {{{ except ( drmaa.DrmCommunicationException ), e: # Catch drmaa communication exception, log a warning message and # continue to monitor jobs. log.warning("(%s/%s) Couldn't communicate with the cluster scheduler to check job status." %( galaxy_job_id, job_id )) # keep old job state new_watched.append( drm_job_state ) continue }}} Here is a patch and related discussion on it: https://bitbucket.org/galaxy/galaxy-central/pull-request/23/catch-drmcommuni... -- Shantanu On Jan 13, 2012, at 2:11 PM, Edward Kirton wrote: i had seen the job process die with this error: if state != old_state: UnboundLocalError: local variable 'state' referenced before assignment since the DRM timeout is an intermittent error, i'm not absolutely positive i have nailed it (a bit more time will tell -- i'm keeping an eye on it with debug log messages), but it seems so. i intended to share this as a patched clone when i became sure but when i read your email i just sent what i had immediately. let us know if that seems to solve the problem for you, so we'll have some confirmation. glad to help, ed p.s. i have another patch for "output not returned from cluster" errors that i'm also still validating, due to NFS race since we don't have inode metadata caching turned off as the galaxy developers suggest. On Fri, Jan 13, 2012 at 8:06 AM, Ann Black <annblack@eng.uiowa.edu<mailto:annblack@eng.uiowa.edu>> wrote: Thanks so much! I have applied the fix to our env. In looking over the logic, there was an existing exception block that would have caught the communication exception generically – but the job moved (in this scenario erroneously) the job into a failure workflow. I would like to understand what ended up hanging galaxy – so it must be related to a valid job being moved into failure state ? Did you follow it down the rabbit hole by any chance to see what caused the hang in your env ? Thanks again, Ann From: Edward Kirton <eskirton@lbl.gov<mailto:eskirton@lbl.gov>> Date: Thu, 12 Jan 2012 13:00:27 -0800 To: Ann Black <annblack@eng.uiowa.edu<mailto:annblack@eng.uiowa.edu>> Cc: "galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu>" <galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu>> Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException sometimes the scheduler can't keep up with all the work in it's 15sec cycle, so it doesn't respond to some messages. here's a fix i've been trying that seems to work. in lib/galaxy/jobs/runners/drmaa.py: def check_watched_items( self ): """ Called by the monitor thread to look at each watched job and deal with state changes. """ new_watched = [] for drm_job_state in self.watched: job_id = drm_job_state.job_id galaxy_job_id = drm_job_state.job_wrapper.job_id old_state = drm_job_state.old_state try: state = self.ds.jobStatus( job_id ) # InternalException was reported to be necessary on some DRMs, but # this could cause failures to be detected as completion! Please # report if you experience problems with this. except ( drmaa.InvalidJobException, drmaa.InternalException ), e: # we should only get here if an orphaned job was put into the queue at app star tup log.debug("(%s/%s) job left DRM queue with following message: %s" % ( galaxy_jo b_id, job_id, e ) ) self.work_queue.put( ( 'finish', drm_job_state ) ) continue # BEGIN DRM TIMEOUT: Don't fail on scheduler communication error (probably just too busy) except ( drmaa.DrmCommunicationException ), e: log.warning("(%s/%s) DRM Communication Exception" % ( galaxy_job_id, job_id )) continue # END DRM TIMEOUT On Wed, Jan 11, 2012 at 9:18 AM, Ann Black <annblack@eng.uiowa.edu<mailto:annblack@eng.uiowa.edu>> wrote: Good Morning galaxy group! I was hoping that someone might have some ideas on a problem we have experienced a handful of times running galaxy on our local cluster. Occasionally we experience some communication timeouts between out cluster head node and a compute node which will self heal. However, this in turn will hang galaxy. Below you will see output from our galaxy log file. When the ERROR happens (which is not often) it consistently seems to hang galaxy. We have to kill it off and restart it. We are running galaxy as a single PID at this time (we are still just testing it out, etc) and it is running on our head node (which we plan to move off of in the future). galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19<tel:2012-01-10%2019>:19:58,800 (1654/698075) state change: job is running galaxy.jobs.runners.drmaa ERROR 2012-01-10 20<tel:2012-01-10%2020>:57:47,021 (1654/698075) Unable to check job status Traceback (most recent call last): File "/data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 236, in check_watched_items state = self.ds.jobStatus( job_id ) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py", line 522, in jobStatus _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status)) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DrmCommunicationException: code 2: failed receiving gdi request response for mid=24442 (got syncron message receive timeout error). galaxy.jobs.runners.drmaa WARNING 2012-01-10 20<tel:2012-01-10%2020>:58:05,090 (1654/698075) job will now be errored galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:06,396 (1654/698075) User killed running job, but error encountered removing from DRM queue: code 2: failed receiving gdi request response for mid=24444 (got syncron message receive timeout error). galaxy.datatypes.metadata DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:06,896 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:06,947 Failed to cleanup MetadataTempFile temp files from database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON object could be decoded: line 1 column 0 (char 0) galaxy.datatypes.metadata DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:09,640 Cleaning up external metadata files galaxy.jobs<http://galaxy.jobs/> INFO 2012-01-10 20<tel:2012-01-10%2020>:59:09,697 job 1656 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:10,121 Cleaning up external metadata files galaxy.jobs<http://galaxy.jobs/> INFO 2012-01-10 20<tel:2012-01-10%2020>:59:10,159 job 1655 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:12,076 Cleaning up external metadata files galaxy.jobs<http://galaxy.jobs/> INFO 2012-01-10 20<tel:2012-01-10%2020>:59:12,126 job 1657 unable to run: one or more inputs in error state galaxy.datatypes.metadata DEBUG 2012-01-10 20<tel:2012-01-10%2020>:59:13,601 Cleaning up external metadata files galaxy.jobs<http://galaxy.jobs/> INFO 2012-01-10 20<tel:2012-01-10%2020>:59:13,650 job 1658 unable to run: one or more inputs in error state Has anyone else experienced this or have some ideas on how we can further debug to figure out why galaxy hangs? Thanks much! Ann Black-Ziegelbein ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
participants (2)
-
Edward Kirton
-
Shantanu Pavgi