New subject: LWR runner configuration for shared folder in cluster

6 May 2013

      Hi all,

  I am trying to set up a Galaxy cluster using the LWR runner. The nodes have
a shared filesystem and in universe.wsgi this parameter is set :

job_working_directory = /mnt/shared
...
clustalw = lwr://http://192.168.33.12:8913
....

this folder has been "chown-ed" to the galaxy user, and also is "a+w",
while it has been verified that can been read / written by ssh-ing to
each node of the cluster. The sticky bit is set.

When I try to run jobs (I used clustalw as example) there seems to be 
confusion between where Galaxy puts files and where LWR tries to read
them from. Here are two setups that error out:

1). When in server.ini for LWR the following is set as:
staging_directory = /mnt/shared/000

galaxy error:

galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128

lwr error (on the cluster node):

  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
    manager.setup_job_directory(job_id)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
    os.mkdir(job_directory)
OSError: [Errno 17] File exists: '/mnt/shared/000/128'

2). When in server.ini for LWR the following is set as:
staging_directory = /mnt/shared

galaxy error:

galaxy.jobs DEBUG 2013-05-06 10:28:46,872 (129) Working directory for job is: /mnt/shared/000/129
galaxy.jobs.handler DEBUG 2013-05-06 10:28:46,872 dispatching job 129 to lwr runner
galaxy.jobs.handler INFO 2013-05-06 10:28:46,967 (129) Job dispatched
192.168.33.1 - - [06/May/2013:10:28:48 -0200] "GET /api/histories/2a56795cad3c7db3 HTTP/1.1" 200 - "http://192.168.33.11:8080/history" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"
galaxy.jobs.runners.lwr DEBUG 2013-05-06 10:28:50,653 run_results {'status': 'status', 'returncode': 0, 'complete': 'true', 'stderr': '', 'stdout': ''}
galaxy.datatypes.metadata DEBUG 2013-05-06 10:28:50,970 Cleaning up external metadata files
galaxy.jobs.runners.lwr ERROR 2013-05-06 10:28:51,050 failure running job 129

lwr error (on the cluster node):

    resp.app_iter = FileIterator(result)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 111, in __init__
    self.input = open(path, 'rb')
IOError: [Errno 2] No such file or directory: u'/mnt/shared/129/outputs/dataset_170.dat'

The full error stacks are at the end of this email. It might be something very simple that I am missing,
but any feedback would be greatly appreciated. Thanks !

Ntino

-- 
Konstantinos (Ntino) Krampis, Ph.D.
Asst. Professor, Informatics
J.Craig Venter Institute

kkrampis@jcvi.org
agbiotec@gmail.com
+1-540-200-8277

Web:
http://bit.ly/cloud-research
http://cloudbiolinux.org/
http://twitter.com/agbiotec

---- GALAXY ERROR

galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
Traceback (most recent call last):
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 286, in run_job
    file_stager = FileStager(client, command_line, job_wrapper.extra_filenames, input_files, output_files, job_wrapper.tool.tool_dir)
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 40, in __init__
    job_config = client.setup()
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 212, in setup
    return self.__raw_execute_and_parse("setup", { "job_id" : self.job_id })
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 150, in __raw_execute_and_parse
    response = self.__raw_execute(command, args, data)
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 146, in __raw_execute
    response = self.url_open(request, data)
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 134, in url_open
    return urllib2.urlopen(request, data)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error

---- LWR ERROR

Exception happened during processing of request from ('192.168.33.11', 44802)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1068, in process_request_in_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
    self.handle()
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 442, in handle
    BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 437, in handle_one_request
    self.wsgi_execute()
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 287, in wsgi_execute
    self.wsgi_start_response)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 35, in __call__
    return controller(environ, start_response, **request_args)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 90, in controller_replacement
    result = func(**args)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
    manager.setup_job_directory(job_id)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
    os.mkdir(job_directory)
OSError: [Errno 17] File exists: '/mnt/shared/000/128'

LWR runner configuration for shared folder in cluster

Krampis, Konstantinos

John Chilton

Krampis, Konstantinos

John Chilton

John Chilton

tags

participants (2)