Odd (possible latency/race?) problems post-upgrade

30 Jul 2012

      I'm seeing a possible latency issue or race condition when starting Galaxy after the latest hg upgrade (July 20) from galaxy-dist; the prior upgrade doesn't have this problem.  We have small setup with one job manager/runner and two web front-ends for testing load balancing:

…from universe_wsgi.ini:
------------------------------
[server:web0]
use = egg:Paste#http
port = 8080
host = 127.0.0.1
use_threadpool = true
threadpool_workers = 7

[server:web1]
use = egg:Paste#http
port = 8081
host = 127.0.0.1
use_threadpool = true
threadpool_workers = 7

[server:manager]
use = egg:Paste#http
port = 8079
host = 127.0.0.1
use_threadpool = true
threadpool_workers = 5
------------------------------

If I run:

    GALAXY_RUN_ALL=1 sh run.sh --daemon

I will intermittently see the following in the paster log for any of the above services (example below is web1, but I have seen this for manager and web0 as well).  The traceback and error is the same in all cases ('File exists: /home/a-m/galaxy/dist-database/tmp/work_tmp'):

------------------------------
galaxy.tool_shed.tool_shed_registry DEBUG 2012-07-30 11:40:10,194 Loading references to tool sheds from tool_sheds_conf.xml
galaxy.tool_shed.tool_shed_registry DEBUG 2012-07-30 11:40:10,194 Loaded reference to tool shed: Galaxy main tool shed
galaxy.tool_shed.tool_shed_registry DEBUG 2012-07-30 11:40:10,194 Loaded reference to tool shed: Galaxy test tool shed
galaxy.model.migrate.check DEBUG 2012-07-30 11:40:10,650 psycopg2 egg successfully loaded for postgres dialect
galaxy.model.migrate.check INFO 2012-07-30 11:40:10,845 At database version 103
galaxy.tool_shed.migrate.check DEBUG 2012-07-30 11:40:10,940 psycopg2 egg successfully loaded for postgres dialect
galaxy.tool_shed.migrate.check INFO 2012-07-30 11:40:10,986 At migrate_tools version 3
galaxy.model.custom_types DEBUG 2012-07-30 11:40:10,994 psycopg2 egg successfully loaded for postgres dialect
Traceback (most recent call last):
  File "/home/a-m/galaxy/galaxy-dist/lib/galaxy/web/buildapp.py", line 82, in app_factory
    app = UniverseApplication( global_conf = global_conf, **kwargs )
  File "/home/a-m/galaxy/galaxy-dist/lib/galaxy/app.py", line 66, in __init__
    self.installed_repository_manager.load_proprietary_datatypes()
  File "/home/a-m/galaxy/galaxy-dist/lib/galaxy/tool_shed/__init__.py", line 47, in load_proprietary_datatypes
    installed_repository_dict = galaxy.util.shed_util.load_installed_datatypes( self.app, tool_shed_repository, relative_install_dir )
  File "/home/a-m/galaxy/galaxy-dist/lib/galaxy/util/shed_util.py", line 1269, in load_installed_datatypes
    work_dir = make_tmp_directory()
  File "/home/a-m/galaxy/galaxy-dist/lib/galaxy/util/shed_util.py", line 1305, in make_tmp_directory
    os.makedirs( work_dir )
  File "/usr/lib64/python2.6/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/home/a-m/galaxy/dist-database/tmp/work_tmp'
Removing PID file web1.pid
------------------------------

I was also seeing this using separate runner/webapp ini files and 'run_multiple_processes.sh --daemon', but we decided to go ahead and migrate over to a unified universe_wsgi.ini file.

Anyway, we found a workaround by rerunning 'GALAXY_RUN_ALL=1 sh run.sh --daemon' which skips any running services, but I'm curious whether anyone else has seen this and whether there is a fix (or maybe added config setting we are missing?)

chris

Fields, Christopher J

Jelle Scholtalbers

Fields, Christopher J

Nate Coraor

tags

participants (3)