[galaxy-dev] Galaxy Server Processes Dying?

8 Jul 2013

      Hi, Galaxy Developers,

I have what I'm hoping is a fairly simple inquiry for the Galaxy community; basically, our production Galaxy server processes appear to be dying off over time.  Our production Galaxy instance implements apache web scaling features so I have a number of server processes, for example my apache Apache configuration has:

BalancerMember http://127.0.0.1:8080
BalancerMember http://127.0.0.1:8081
BalancerMember http://127.0.0.1:8082
BalancerMember http://127.0.0.1:8083
BalancerMember http://127.0.0.1:8084
BalancerMember http://127.0.0.1:8085

Nothing unconventional as I understand it.  Similarly, my galaxy config has matching [server:ws3], [server:ws2] configuration blocks for each of these processes.  When I restart Galaxy, everything is all fine and good.  I'll see a server listening on each one of these ports (if I do something like lsof -i TCP -P, for example).  What appears to be happening, is that for whatever reason, these server processes seem to die off over time (i.e eventually nothing is listening on ports 8080-8085).  This process can take days, and at the time when no servers are available, Apache will begin throwing 503 service unavailable errors.   I am fairly confident this process is gradual, for example I just checked now and the Galaxy was still available, however one server had died (the one on TCP port 8082).  I do do have a single separate job manager and two job handlers; at this point I believe this problem to be related to the servers only (i.e. the job manager and job handlers do not appear to be crashing).

Now, I believe that late last week I might have 'caught' the last server process dying, just by coincidence, although I am not 100% certain.  Here is the Traceback as it occurred:

galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01) PBS job state changed from Q to R
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01) PBS job state changed from R to C
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01) PBS job has completed successfully
galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code or stdio handling; checking stderr for success
galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata from file for: HistoryDatasetAssociation 6046
galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01) PBS job state changed from R to E
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job state changed from E to C
galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled exception calling fail_job
Traceback (most recent call last):
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 58, in run_next
    method(arg)
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 560, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

Now, I have some questions regarding this issue;

1) It appears to me that although this is a sub-optimal solution, restarting Galaxy solves this problem (i.e. server processes will be listening after restarting Galaxy).   Is it possible, or safe, or sane to just restart a single server on a singe port?  Ideally I would actually like to fix the problem that is causing my server processes to crash, although I figured it wouldn't hurt to ask this question regardless.
2) Similar to the question above, is it possible to configure Galaxy in a way that server processes re-spawn  in a self-service manner (i.e. is this a feature of Galaxy, for example, because server processes dying regularly is either a known issue or expected and tolerable (but undesired) behaivor)?
3) To me, the error messages above aren't very meaningful, other than the Traceback appears to be PBS-related.  Would anybody be able comment on the problem above (i.e. have you seen something like this), or comment on Galaxy server processes dying in general?  I have done some brief searching of the Galaxy mailing list for server crashes and did not find anything suggesting this is a common problem.
4) I am not 100% confident at this point that the Traceback above is what killed the server process.  Does anybody know of a specific string I can search for (a literal) to identify when a server process actually dies?  I have done some basic review of log data (our Galaxy server generates lots of logs), and Traceback does not appear to be a valid string to uniquely identify a server crash (they occur too frequently).  I currently have logging configured at DEBUG.

In case this is relevant, I am using the following change set for Galaxy:
...
hg parents
changeset:   9320:47ddf167c9f1
branch:      stable
tag:         tip
user:        Nate Coraor <nate@bx.psu.edu>
date:        Wed May 01 09:50:31 2013 -0400
summary:     Use Galaxy's ErrorMiddleware since Paste's doesn't return start_response.  Fixes downloading tarballs from the Tool Shed when use_debug = false.
I appreciate the time you took in reading my email, and any expertise you could provide in helping me troubleshoot this issue.  

Dan Sullivan