AttributeError: type object 'InvalidJobException' has no attribute 'name'
Hi, After the last update I'm getting the following error. The job is submitted to SGE e executed, but galaxy doesn't get the result and keeps showing the job is executing (yellow box). Any clues? Thanks, Adhemar galaxy.jobs.runners ERROR 2013-10-08 13:01:18,488 Unhandled exception checking active jobs Traceback (most recent call last): File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name'
In order to test that, I've just downloaded a new galaxy-central and configured it to submit jobs in our SGE cluster. Same problem. The job starts and finishes, but galaxy keeps informing it's still running. I've also attached the job_conf.xml file. Need help, please! -Adhemar galaxy.jobs.runners ERROR 2013-10-08 15:29:16,721 Unhandled exception checking active jobs Traceback (most recent call last): File "/opt/bioinformatics/share/galaxy20131008/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/opt/bioinformatics/share/galaxy20131008/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name' <?xml version="1.0"?> <job_conf> <plugins workers="10"> <!-- <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="10"/> --> <plugin id="sge" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="10"/> </plugins> <handlers default="handlers"> <handler id="main"/> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <handler id="handler4" tags="handlers"/> <handler id="handler5" tags="handlers"/> <handler id="handler6" tags="handlers"/> <handler id="handler7" tags="handlers"/> <handler id="handler8" tags="handlers"/> <handler id="handler9" tags="handlers"/> </handlers> <destinations default="sge_cluster"> <!-- <destination id="local" runner="local"/> --> <destination id="sge_cluster" runner="sge" tags="longjobs"> <param id="nativeSpecification">-V -q galaxy.q</param> </destination> </destinations> </job_conf> 2013/10/8 Adhemar <azneto@gmail.com>
Hi, After the last update I'm getting the following error. The job is submitted to SGE e executed, but galaxy doesn't get the result and keeps showing the job is executing (yellow box). Any clues? Thanks, Adhemar
galaxy.jobs.runners ERROR 2013-10-08 13:01:18,488 Unhandled exception checking active jobs Traceback (most recent call last): File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name'
On Tue, Oct 8, 2013 at 5:03 PM, Adhemar <azneto@gmail.com> wrote:
Hi, After the last update I'm getting the following error. The job is submitted to SGE e executed, but galaxy doesn't get the result and keeps showing the job is executing (yellow box). Any clues? Thanks, Adhemar
galaxy.jobs.runners ERROR 2013-10-08 13:01:18,488 Unhandled exception checking active jobs Traceback (most recent call last): File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name'
Same here, running galaxy-central with an SGE cluster (actually UGE but the same DRMAA wrapper etc) when cancelling several jobs via qdel at the command line: Galaxy.jobs.runners ERROR 2013-10-10 15:16:35,731 Unhandled exception checking active jobs Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name' $ hg branch default [galaxy@ppserver galaxy-central]$ hg heads | more changeset: 11871:c8b55344e779 tag: tip user: Ross Lazarus <ross.lazarus@gmail.com> date: Tue Oct 08 16:30:54 2013 +1100 summary: Proper removal of rgenetics deprecated tool wrappers changeset: 11818:1f0e7ae9e324 branch: stable parent: 11761:a477486bf18e user: Daniel Blankenberg <dan@bx.psu.edu> date: Sun Sep 29 16:04:31 2013 +1000 summary: Add additional check and slice to _sniffnfix_pg9_hex(). Fixes issue seen when attempting to view saved visualizations. Further investigation may be needed. ... Killing Galaxy and restarting didn't fix this, the errors persist. I tried this fix to solve the attribute error in the logging call: $ hg diff /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py diff -r c8b55344e779 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Tue Oct 08 16:30:54 2013 +1100 +++ b/lib/galaxy/jobs/runners/drmaa.py Thu Oct 10 15:21:56 2013 +0100 @@ -214,7 +214,10 @@ state = self.ds.jobStatus( external_job_id ) # TODO: probably need to keep track of InvalidJobException count and remove after it exceeds some configurable except ( drmaa.DrmCommunicationException, drmaa.InternalException, drmaa.InvalidJobException ), e: - log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) + if hasattr(e.__class__, "name"): + log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) + else: + log.warning( "(%s/%s) job check resulted in: %s", galaxy_id_tag, external_job_id, e ) new_watched.append( ajs ) continue except Exception, e: Now I get lots of these lines instead: galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:16,489 (251/11372) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:16,533 (252/11373) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,580 (253/11374) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,624 (254/11375) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,668 (255/11376) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,712 (256/11377) job check resulted in: code 18: The job specified by the 'jobid' does not exist. (this seems to repeat, endlessly) I manually killed the jobs from the Galaxy history, and restarted Galaxy again. That seemed to fix this. If the DRMAA layer says the job was invalid (which is what I am assuming InvalidJobException means) then surely it failed? Perhaps something like this (untested)? $ hg diff /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py diff -r c8b55344e779 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Tue Oct 08 16:30:54 2013 +1100 +++ b/lib/galaxy/jobs/runners/drmaa.py Thu Oct 10 15:27:28 2013 +0100 @@ -213,10 +213,15 @@ assert external_job_id not in ( None, 'None' ), '(%s/%s) Invalid job id' % ( galaxy_id_tag, external_job_id ) state = self.ds.jobStatus( external_job_id ) # TODO: probably need to keep track of InvalidJobException count and remove after it exceeds some configurable - except ( drmaa.DrmCommunicationException, drmaa.InternalException, drmaa.InvalidJobException ), e: + except ( drmaa.DrmCommunicationException, drmaa.InternalException ), e: log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) new_watched.append( ajs ) continue + except drmaa.InvalidJobException, e: + log.warning( "(%s/%s) job check resulted in: %s", galaxy_id_tag, external_job_id, e ) + ajs.fail_message = str(e) + self.work_queue.put( ( self.fail_job, ajs ) ) + continue except Exception, e: # so we don't kill the monitor thread log.exception( "(%s/%s) Unable to check job status: %s" % ( galaxy_id_tag, external_job_id, str( e ) ) ) Peter
Hi all, The recent changes to the DRMAA runner are for better handling of job-ending conditions under slurm, but it looks like SGE has different behavior when a job finishes. I'll provide a fix for this shortly, in the meantime, it's fine to use a slightly older version of drmaa.py. --nate On Oct 10, 2013, at 10:31 AM, Peter Cock wrote:
On Tue, Oct 8, 2013 at 5:03 PM, Adhemar <azneto@gmail.com> wrote:
Hi, After the last update I'm getting the following error. The job is submitted to SGE e executed, but galaxy doesn't get the result and keeps showing the job is executing (yellow box). Any clues? Thanks, Adhemar
galaxy.jobs.runners ERROR 2013-10-08 13:01:18,488 Unhandled exception checking active jobs Traceback (most recent call last): File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/opt/bioinformatics/share/galaxy20130410/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name'
Same here, running galaxy-central with an SGE cluster (actually UGE but the same DRMAA wrapper etc) when cancelling several jobs via qdel at the command line:
Galaxy.jobs.runners ERROR 2013-10-10 15:16:35,731 Unhandled exception checking active jobs Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/__init__.py", line 362, in monitor self.check_watched_items() File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 217, in check_watched_items log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) AttributeError: type object 'InvalidJobException' has no attribute 'name'
$ hg branch default [galaxy@ppserver galaxy-central]$ hg heads | more changeset: 11871:c8b55344e779 tag: tip user: Ross Lazarus <ross.lazarus@gmail.com> date: Tue Oct 08 16:30:54 2013 +1100 summary: Proper removal of rgenetics deprecated tool wrappers
changeset: 11818:1f0e7ae9e324 branch: stable parent: 11761:a477486bf18e user: Daniel Blankenberg <dan@bx.psu.edu> date: Sun Sep 29 16:04:31 2013 +1000 summary: Add additional check and slice to _sniffnfix_pg9_hex(). Fixes issue seen when attempting to view saved visualizations. Further investigation may be needed. ...
Killing Galaxy and restarting didn't fix this, the errors persist. I tried this fix to solve the attribute error in the logging call:
$ hg diff /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py diff -r c8b55344e779 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Tue Oct 08 16:30:54 2013 +1100 +++ b/lib/galaxy/jobs/runners/drmaa.py Thu Oct 10 15:21:56 2013 +0100 @@ -214,7 +214,10 @@ state = self.ds.jobStatus( external_job_id ) # TODO: probably need to keep track of InvalidJobException count and remove after it exceeds some configurable except ( drmaa.DrmCommunicationException, drmaa.InternalException, drmaa.InvalidJobException ), e: - log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) + if hasattr(e.__class__, "name"): + log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) + else: + log.warning( "(%s/%s) job check resulted in: %s", galaxy_id_tag, external_job_id, e ) new_watched.append( ajs ) continue except Exception, e:
Now I get lots of these lines instead:
galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:16,489 (251/11372) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:16,533 (252/11373) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,580 (253/11374) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,624 (254/11375) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,668 (255/11376) job check resulted in: code 18: The job specified by the 'jobid' does not exist. galaxy.jobs.runners.drmaa WARNING 2013-10-10 15:22:17,712 (256/11377) job check resulted in: code 18: The job specified by the 'jobid' does not exist. (this seems to repeat, endlessly)
I manually killed the jobs from the Galaxy history, and restarted Galaxy again. That seemed to fix this.
If the DRMAA layer says the job was invalid (which is what I am assuming InvalidJobException means) then surely it failed? Perhaps something like this (untested)?
$ hg diff /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py diff -r c8b55344e779 lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Tue Oct 08 16:30:54 2013 +1100 +++ b/lib/galaxy/jobs/runners/drmaa.py Thu Oct 10 15:27:28 2013 +0100 @@ -213,10 +213,15 @@ assert external_job_id not in ( None, 'None' ), '(%s/%s) Invalid job id' % ( galaxy_id_tag, external_job_id ) state = self.ds.jobStatus( external_job_id ) # TODO: probably need to keep track of InvalidJobException count and remove after it exceeds some configurable - except ( drmaa.DrmCommunicationException, drmaa.InternalException, drmaa.InvalidJobException ), e: + except ( drmaa.DrmCommunicationException, drmaa.InternalException ), e: log.warning( "(%s/%s) job check resulted in %s: %s", galaxy_id_tag, external_job_id, e.__class__.name, e ) new_watched.append( ajs ) continue + except drmaa.InvalidJobException, e: + log.warning( "(%s/%s) job check resulted in: %s", galaxy_id_tag, external_job_id, e ) + ajs.fail_message = str(e) + self.work_queue.put( ( self.fail_job, ajs ) ) + continue except Exception, e: # so we don't kill the monitor thread log.exception( "(%s/%s) Unable to check job status: %s" % ( galaxy_id_tag, external_job_id, str( e ) ) )
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On Thu, Oct 10, 2013 at 8:20 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Hi all,
The recent changes to the DRMAA runner are for better handling of job-ending conditions under slurm, but it looks like SGE has different behavior when a job finishes. I'll provide a fix for this shortly, in the meantime, it's fine to use a slightly older version of drmaa.py.
--nate
Thanks Nate, So far I've only seen this once so it isn't urgent for me. Peter
On Fri, Oct 11, 2013 at 9:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Oct 10, 2013 at 8:20 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Hi all,
The recent changes to the DRMAA runner are for better handling of job-ending conditions under slurm, but it looks like SGE has different behavior when a job finishes. I'll provide a fix for this shortly, in the meantime, it's fine to use a slightly older version of drmaa.py.
--nate
Thanks Nate,
So far I've only seen this once so it isn't urgent for me.
Peter
Hi Nate, I see you've fix the name attribute error: https://bitbucket.org/galaxy/galaxy-central/commits/ff76fd33b81cdde1fb270de6... However it seems the underlying problem (job check resulted in: code 18: The job specified by the 'jobid' does not exist.) is affecting other people now: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-October/017002.html Peter
On Oct 14, 2013, at 6:07 AM, Peter Cock wrote:
On Fri, Oct 11, 2013 at 9:12 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Oct 10, 2013 at 8:20 PM, Nate Coraor <nate@bx.psu.edu> wrote:
Hi all,
The recent changes to the DRMAA runner are for better handling of job-ending conditions under slurm, but it looks like SGE has different behavior when a job finishes. I'll provide a fix for this shortly, in the meantime, it's fine to use a slightly older version of drmaa.py.
--nate
Thanks Nate,
So far I've only seen this once so it isn't urgent for me.
Peter
Hi Nate,
I see you've fix the name attribute error: https://bitbucket.org/galaxy/galaxy-central/commits/ff76fd33b81cdde1fb270de6...
However it seems the underlying problem (job check resulted in: code 18: The job specified by the 'jobid' does not exist.) is affecting other people now: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-October/017002.html
Peter
Hi Peter, I'm working on refactoring the DRMAA runner to allow for these different DRM behaviors without duplicating the code. In the interim, I've reverted the change to the DRMAA runner that resulted in the observed behavior in changeset d46b64f12c52. Thanks, --nate
participants (3)
-
Adhemar
-
Nate Coraor
-
Peter Cock