On Tue, Oct 8, 2013 at 5:03 PM, Adhemar <azneto@gmail.com> wrote:
Hi, After the last update I'm getting the following error. The job is submitted to SGE e executed, but galaxy doesn't get the result and keeps showing the job is executing (yellow box). Any clues? Thanks, Adhemar
galaxy.jobs.runners ERROR 2013-10-08 13:01:18,488 Unhandled exception checking active jobs Traceback (most recent call last): File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name'
Same here, running galaxy-central with an SGE cluster (actually UGE but the same DRMAA wrapper etc) when cancelling several jobs via qdel at the command line: Galaxy.jobs.runners ERROR 2013-10-10 15:16:35,731 Unhandled exception checking active jobs Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name' $ hg branch default [galaxy@ppserver galaxy-central]$ hg heads | more changeset: 11871:c8b55344e779 tag: tip user: Ross Lazarus <ross.lazarus@gmail.com> date: Tue Oct 08 16:30:54 2013 +1100 summary: Proper removal of rgenetics deprecated tool wrappers changeset: 11818:1f0e7ae9e324 branch: stable parent: 11761:a477486bf18e user: Daniel Blankenberg <dan@bx.psu.edu> date: Sun Sep 29 16:04:31 2013 +1000 summary: Add additional check and slice to _sniffnfix_pg9_hex(). Fixes issue seen when attempting to view saved visualizations. Further investigation may be needed. ... Killing Galaxy and restarting didn't fix this, the errors persist. I tried this fix to solve the attribute error in the logging call: $ hg diff /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py diff -r c8b55344e779 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Tue Oct 08 16:30:54 2013 +1100 +++ b/lib/galaxy/jobs/runners/drmaa.py Thu Oct 10 15:21:56 2013 +0100 @@ -214,7 +214,10 @@ state = self.ds.jobStatus( external_job_id ) # TODO: probably need to keep track of InvalidJobException count and remove after it exceeds some configurable except ( drmaa.DrmCommunicationException, drmaa.InternalException, drmaa.InvalidJobException ), e: - log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) + if hasattr(e.__class__, "name"): + log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) + else: + log.warning( "(%s/%s) job check resulted in: %s", galaxy_id_tag, external_job_id, e ) new_watched.append( ajs ) continue except Exception, e: Now I get lots of these lines instead: galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:16,489 (251/11372) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:16,533 (252/11373) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,580 (253/11374) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,624 (254/11375) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,668 (255/11376) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,712 (256/11377) job check resulted in: code 18: The job specified by the 'jobid' does not exist. (this seems to repeat, endlessly) I manually killed the jobs from the Galaxy history, and restarted Galaxy again. That seemed to fix this. If the DRMAA layer says the job was invalid (which is what I am assuming InvalidJobException means) then surely it failed? Perhaps something like this (untested)? $ hg diff /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py diff -r c8b55344e779 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Tue Oct 08 16:30:54 2013 +1100 +++ b/lib/galaxy/jobs/runners/drmaa.py Thu Oct 10 15:27:28 2013 +0100 @@ -213,10 +213,15 @@ assert external_job_id not in ( None, 'None' ), '(%s/%s) Invalid job id' % ( galaxy_id_tag, external_job_id ) state = self.ds.jobStatus( external_job_id ) # TODO: probably need to keep track of InvalidJobException count and remove after it exceeds some configurable - except ( drmaa.DrmCommunicationException, drmaa.InternalException, drmaa.InvalidJobException ), e: + except ( drmaa.DrmCommunicationException, drmaa.InternalException ), e: log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) new_watched.append( ajs ) continue + except drmaa.InvalidJobException, e: + log.warning( "(%s/%s) job check resulted in: %s", galaxy_id_tag, external_job_id, e ) + ajs.fail_message = str(e) + self.work_queue.put( ( self.fail_job, ajs ) ) + continue except Exception, e: # so we don't kill the monitor thread log.exception( "(%s/%s) Unable to check job status: %s" % ( galaxy_id_tag, external_job_id, str( e ) ) ) Peter