Hi List,
I've sprouted some grays in the last week after my Galaxy instances all simultaneously ceased to submit jobs to our main cluster.
Some Galaxy instances are running the PBS job runner, and others use DRMAA. For the DRMAA runner I was getting: galaxy.jobs.runners ERROR 2013-10-15 08:40:14,942 (1024) Unhandled exception calling queue_job Traceback (most recent call last): File "galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job external_job_id = self.ds.runJob(jt) File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InternalException: code 1: (qsub) cannot access script file: Unauthorized Request MSG=can not authorize request (0-Success)
And in my PBS runner: galaxy.jobs.runners.pbs WARNING 2013-10-14 17:13:07,319 (550) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable
To give some background, I had recently requested a new virtual machine to put my test/dev Galaxy on. I copied our production Galaxy to this new VM. I secured a new domain name for it and set it running. Everything was going well until I tried to hook it up to the cluster; at first I got an error saying that I didn't have permission to submit jobs. Makes sense, the new VM was not a qualified submit host for the cluster. I asked the sysadmins to add the VM as a submit host to the cluster using qmgr. As soon as this was done, not only could I still not submit jobs from the test Galaxy, but no Galaxy was able to submit jobs to the cluster.
The issue isn't with Galaxy here but the underlying calls that it makes - for drmaa, I tracked it back to pbs-drmaa/bin/drmaa-run. For PBS, I'm sure it's somewhere in with libtorque. In every case, I could call qsub from the command line and it would correctly submit jobs, which was more perplexing.
I re-installed python, drmaa.egg, pbs-drmaa, and rebooted the VM. I of course restarted Galaxy with each step, to no avail. I worked with the admins to see what was happening in the server logs, but the same cryptic error showed up - cannot authorize request. I've had this issue before in the past, more or less, but usually just gave up on it. It seemed to come and go sporadically, but rebooting the clusters seemed to help.
This time, with our production server no longer functioning, I begged for help and the admins looked through the pbs_server config but couldn't find any mistypes or problems. Reloading the config by sending hangup signals to pbs_server didn't help. Then we tried pausing the scheduler and restarting pbs_server completely - and eureka, all problems went away. PBS and DRMAA runners are back up and working fine. This really seems to be a bug in Torque 4.1.5.1.
I hope this saves someone a lot of headache! Newer versions of Torque may be the answer. I would also advise against making changes to the pbs_server configuration while in production - we have monthly maintenance, and I don't think I'll ever request changes when there won't be an immediate reboot to flush the server!
Cheers,
Carrie
Hi Carrie,
It is a bug in Torque/4.x series. It can be fixed for a time by restarting the Torque pbs_server process, but it’s going to come back. It’s not galaxy-specific as any python-drmaa request will fail once Torque starts experiencing the issue.
Regards,
Alex
From: <Ganote>, Carrie L <cganote@iu.edumailto:cganote@iu.edu> Date: Tuesday, October 15, 2013 at 4:58 PM To: "galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu> Subject: [galaxy-dev] Errors running DRMAA and PBS on remote server running Torque 4
Hi List,
I've sprouted some grays in the last week after my Galaxy instances all simultaneously ceased to submit jobs to our main cluster.
Some Galaxy instances are running the PBS job runner, and others use DRMAA. For the DRMAA runner I was getting: galaxy.jobs.runners ERROR 2013-10-15 08:40:14,942 (1024) Unhandled exception calling queue_job Traceback (most recent call last): File "galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job external_job_id = self.ds.runJob(jt) File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InternalException: code 1: (qsub) cannot access script file: Unauthorized Request MSG=can not authorize request (0-Success)
And in my PBS runner: galaxy.jobs.runners.pbs WARNING 2013-10-14 17:13:07,319 (550) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable
To give some background, I had recently requested a new virtual machine to put my test/dev Galaxy on. I copied our production Galaxy to this new VM. I secured a new domain name for it and set it running. Everything was going well until I tried to hook it up to the cluster; at first I got an error saying that I didn't have permission to submit jobs. Makes sense, the new VM was not a qualified submit host for the cluster. I asked the sysadmins to add the VM as a submit host to the cluster using qmgr. As soon as this was done, not only could I still not submit jobs from the test Galaxy, but no Galaxy was able to submit jobs to the cluster.
The issue isn't with Galaxy here but the underlying calls that it makes - for drmaa, I tracked it back to pbs-drmaa/bin/drmaa-run. For PBS, I'm sure it's somewhere in with libtorque. In every case, I could call qsub from the command line and it would correctly submit jobs, which was more perplexing.
I re-installed python, drmaa.egg, pbs-drmaa, and rebooted the VM. I of course restarted Galaxy with each step, to no avail. I worked with the admins to see what was happening in the server logs, but the same cryptic error showed up - cannot authorize request. I've had this issue before in the past, more or less, but usually just gave up on it. It seemed to come and go sporadically, but rebooting the clusters seemed to help.
This time, with our production server no longer functioning, I begged for help and the admins looked through the pbs_server config but couldn't find any mistypes or problems. Reloading the config by sending hangup signals to pbs_server didn't help. Then we tried pausing the scheduler and restarting pbs_server completely - and eureka, all problems went away. PBS and DRMAA runners are back up and working fine. This really seems to be a bug in Torque 4.1.5.1.
I hope this saves someone a lot of headache! Newer versions of Torque may be the answer. I would also advise against making changes to the pbs_server configuration while in production - we have monthly maintenance, and I don't think I'll ever request changes when there won't be an immediate reboot to flush the server!
Cheers,
Carrie
Hi Alex,
I should say that for the most part, our setup using Torque 4.x has worked. Under the following circumstances it has failed: 1.) When Torque configuration is changed but the pbs_server is not restarted 2.) When trying to route jobs to different clusters - the new cluster may work for just a little while or not at all - but this could be related to #1? 3.) When starting a new Galaxy instance, there are almost always road blocks.
Once it starts working, though, it seems pretty stable. I can make an effort to keep track of circumstances surrounding failures. Is there a bug ticket with adaptive computing over this issue?
For anyone starting out, make sure that Torque is compiled --with-drmaa, and that the gperf package is installed on the machine.
I also found this: http://www.supercluster.org/pipermail/torqueusers/2013-October/016293.html
Sincerely,
Carrie Ganote
________________________________ From: Moskalenko,Oleksandr [om@hpc.ufl.edu] Sent: Monday, November 04, 2013 9:47 AM To: Ganote, Carrie L; galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Errors running DRMAA and PBS on remote server running Torque 4
Hi Carrie,
It is a bug in Torque/4.x series. It can be fixed for a time by restarting the Torque pbs_server process, but it’s going to come back. It’s not galaxy-specific as any python-drmaa request will fail once Torque starts experiencing the issue.
Regards,
Alex
From: <Ganote>, Carrie L <cganote@iu.edumailto:cganote@iu.edu> Date: Tuesday, October 15, 2013 at 4:58 PM To: "galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu> Subject: [galaxy-dev] Errors running DRMAA and PBS on remote server running Torque 4
Hi List,
I've sprouted some grays in the last week after my Galaxy instances all simultaneously ceased to submit jobs to our main cluster.
Some Galaxy instances are running the PBS job runner, and others use DRMAA. For the DRMAA runner I was getting: galaxy.jobs.runners ERROR 2013-10-15 08:40:14,942 (1024) Unhandled exception calling queue_job Traceback (most recent call last): File "galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job external_job_id = self.ds.runJob(jt) File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InternalException: code 1: (qsub) cannot access script file: Unauthorized Request MSG=can not authorize request (0-Success)
And in my PBS runner: galaxy.jobs.runners.pbs WARNING 2013-10-14 17:13:07,319 (550) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable
To give some background, I had recently requested a new virtual machine to put my test/dev Galaxy on. I copied our production Galaxy to this new VM. I secured a new domain name for it and set it running. Everything was going well until I tried to hook it up to the cluster; at first I got an error saying that I didn't have permission to submit jobs. Makes sense, the new VM was not a qualified submit host for the cluster. I asked the sysadmins to add the VM as a submit host to the cluster using qmgr. As soon as this was done, not only could I still not submit jobs from the test Galaxy, but no Galaxy was able to submit jobs to the cluster.
The issue isn't with Galaxy here but the underlying calls that it makes - for drmaa, I tracked it back to pbs-drmaa/bin/drmaa-run. For PBS, I'm sure it's somewhere in with libtorque. In every case, I could call qsub from the command line and it would correctly submit jobs, which was more perplexing.
I re-installed python, drmaa.egg, pbs-drmaa, and rebooted the VM. I of course restarted Galaxy with each step, to no avail. I worked with the admins to see what was happening in the server logs, but the same cryptic error showed up - cannot authorize request. I've had this issue before in the past, more or less, but usually just gave up on it. It seemed to come and go sporadically, but rebooting the clusters seemed to help.
This time, with our production server no longer functioning, I begged for help and the admins looked through the pbs_server config but couldn't find any mistypes or problems. Reloading the config by sending hangup signals to pbs_server didn't help. Then we tried pausing the scheduler and restarting pbs_server completely - and eureka, all problems went away. PBS and DRMAA runners are back up and working fine. This really seems to be a bug in Torque 4.1.5.1.
I hope this saves someone a lot of headache! Newer versions of Torque may be the answer. I would also advise against making changes to the pbs_server configuration while in production - we have monthly maintenance, and I don't think I'll ever request changes when there won't be an immediate reboot to flush the server!
Cheers,
Carrie
Carrie,
This turned out to be an issue with Torque 4 changing from pbs_submit() to submit_pbs_hash() procedure while both pbs-drmaa and pbs-python were still using pbs_submit(). The maintainer of the pbs-drmaa library (http://sourceforge.net/p/pbspro-drmaa/wiki/Home/ ) Mariusz Mamonski provided us with a fix for pbs-drmaa-1.0.15 today. If you're experiencing the issue frequently I'd be happy to share the fixed library. Otherwise, I think Mariusz will probably provide a new pbs-drmaa release that incorporates the fix soon.
Regards,
Alex
From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Moskalenko,Oleksandr Sent: Monday, November 04, 2013 9:47 AM To: Ganote, Carrie L; galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Errors running DRMAA and PBS on remote server running Torque 4
Hi Carrie,
It is a bug in Torque/4.x series. It can be fixed for a time by restarting the Torque pbs_server process, but it's going to come back. It's not galaxy-specific as any python-drmaa request will fail once Torque starts experiencing the issue.
Regards,
Alex
From: <Ganote>, Carrie L <cganote@iu.edumailto:cganote@iu.edu> Date: Tuesday, October 15, 2013 at 4:58 PM To: "galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu> Subject: [galaxy-dev] Errors running DRMAA and PBS on remote server running Torque 4
Hi List,
I've sprouted some grays in the last week after my Galaxy instances all simultaneously ceased to submit jobs to our main cluster.
Some Galaxy instances are running the PBS job runner, and others use DRMAA. For the DRMAA runner I was getting: galaxy.jobs.runners ERROR 2013-10-15 08:40:14,942 (1024) Unhandled exception calling queue_job Traceback (most recent call last): File "galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job external_job_id = self.ds.runJob(jt) File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InternalException: code 1: (qsub) cannot access script file: Unauthorized Request MSG=can not authorize request (0-Success)
And in my PBS runner: galaxy.jobs.runners.pbs WARNING 2013-10-14 17:13:07,319 (550) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable
To give some background, I had recently requested a new virtual machine to put my test/dev Galaxy on. I copied our production Galaxy to this new VM. I secured a new domain name for it and set it running. Everything was going well until I tried to hook it up to the cluster; at first I got an error saying that I didn't have permission to submit jobs. Makes sense, the new VM was not a qualified submit host for the cluster. I asked the sysadmins to add the VM as a submit host to the cluster using qmgr. As soon as this was done, not only could I still not submit jobs from the test Galaxy, but no Galaxy was able to submit jobs to the cluster.
The issue isn't with Galaxy here but the underlying calls that it makes - for drmaa, I tracked it back to pbs-drmaa/bin/drmaa-run. For PBS, I'm sure it's somewhere in with libtorque. In every case, I could call qsub from the command line and it would correctly submit jobs, which was more perplexing.
I re-installed python, drmaa.egg, pbs-drmaa, and rebooted the VM. I of course restarted Galaxy with each step, to no avail. I worked with the admins to see what was happening in the server logs, but the same cryptic error showed up - cannot authorize request. I've had this issue before in the past, more or less, but usually just gave up on it. It seemed to come and go sporadically, but rebooting the clusters seemed to help.
This time, with our production server no longer functioning, I begged for help and the admins looked through the pbs_server config but couldn't find any mistypes or problems. Reloading the config by sending hangup signals to pbs_server didn't help. Then we tried pausing the scheduler and restarting pbs_server completely - and eureka, all problems went away. PBS and DRMAA runners are back up and working fine. This really seems to be a bug in Torque 4.1.5.1.
I hope this saves someone a lot of headache! Newer versions of Torque may be the answer. I would also advise against making changes to the pbs_server configuration while in production - we have monthly maintenance, and I don't think I'll ever request changes when there won't be an immediate reboot to flush the server!
Cheers,
Carrie
galaxy-dev@lists.galaxyproject.org