So I’ve been happily using Pulsar to send all my Galaxy server jobs to our cluster here at UCL for several months now (I love it!). I am now exploring the ‘run-as-real-user’ option for DRMAA submissions and have run into a problem. The files are
correctly staged, correctly chowned, successfully submitted to the queue and the job runs. However, at job end (collection?) fails with the following error message in Pulsar:
Exception happened during processing of request from (‘*.*.*.*', 54321)
Traceback (most recent call last):
File "/opt/rocks/lib/python2.6/site-packages/Paste-2.0.1-py2.6.egg/paste/httpserver.py", line 1072, in process_request_in_thread
self.finish_request(request, client_address)
File "/opt/rocks/lib/python2.6/SocketServer.py", line 322, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/rocks/lib/python2.6/SocketServer.py", line 617, in __init__
self.handle()
File "/opt/rocks/lib/python2.6/site-packages/Paste-2.0.1-py2.6.egg/paste/httpserver.py", line 446, in handle
BaseHTTPRequestHandler.handle(self)
File "/opt/rocks/lib/python2.6/BaseHTTPServer.py", line 329, in handle
self.handle_one_request()
File "/opt/rocks/lib/python2.6/site-packages/Paste-2.0.1-py2.6.egg/paste/httpserver.py", line 441, in handle_one_request
self.wsgi_execute()
File "/opt/rocks/lib/python2.6/site-packages/Paste-2.0.1-py2.6.egg/paste/httpserver.py", line 291, in wsgi_execute
self.wsgi_start_response)
File "/cluster/galaxy/pulsar/pulsar/web/framework.py", line 39, in __call__
return controller(environ, start_response, **request_args)
File "/cluster/galaxy/pulsar/pulsar/web/framework.py", line 144, in controller_replacement
result = self.__execute_request(func, args, req, environ)
File "/cluster/galaxy/pulsar/pulsar/web/framework.py", line 124, in __execute_request
result = func(**args)
File "/cluster/galaxy/pulsar/pulsar/web/routes.py", line 82, in status
return status_dict(manager, job_id)
File "/cluster/galaxy/pulsar/pulsar/manager_endpoint_util.py", line 12, in status_dict
job_status = manager.get_status(job_id)
File "/cluster/galaxy/pulsar/pulsar/managers/stateful.py", line 95, in get_status
proxy_status, state_change = self.__proxy_status(job_directory, job_id)
File "/cluster/galaxy/pulsar/pulsar/managers/stateful.py", line 115, in __proxy_status
proxy_status = self._proxied_manager.get_status(job_id)
File "/cluster/galaxy/pulsar/pulsar/managers/queued_external_drmaa_original.py", line 62, in get_status
external_status = super(ExternalDrmaaQueueManager, self)._get_status_external(external_id)
File "/cluster/galaxy/pulsar/pulsar/managers/base/base_drmaa.py", line 31, in _get_status_external
drmaa_state = self.drmaa_session.job_status(external_id)
File "/cluster/galaxy/pulsar/pulsar/managers/util/drmaa/__init__.py", line 50, in job_status
return self.session.jobStatus(str(external_job_id))
File "build/bdist.linux-x86_64/egg/drmaa/session.py", line 518, in jobStatus
c(drmaa_job_ps, jobId, byref(status))
File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 299, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 151, in error_check
raise _ERRORS[code - 1](error_string)
InvalidJobException: code 18: The job specified by the 'jobid' does not exist.
I am running 15.10 and Python 2.7.10 on my iMac for the server and the cluster submission node is running Pulsar 0.5.0 and Python 2.7.12
For these tests I run Pulsar in an interactive window so I have not set the sudoers file up, but rather enter sudo password when requested by Pulsar (at the first step of chowning the staging directory). Also have rewrites set up in Galaxy’s pulsar_actions.yml
and I am using remote_scp for the file transfers rather than http - although I have also tried switching back to http (as I noticed caching, which I am also testing, does not work with scp transfers) but get an identical set of error messages.
As I say, I have no troubles using a regular queued_drmaa manager in pulsar. Any ideas what the problem may be?