Hi, Galaxy Developers,
I have what I'm hoping is a fairly simple inquiry for the Galaxy community; basically, our production Galaxy server processes appear to be dying off over time. Our production Galaxy instance implements apache web scaling features so I have a number of server processes, for example my apache Apache configuration has:
BalancerMember http://127.0.0.1:8080
BalancerMember http://127.0.0.1:8081
BalancerMember http://127.0.0.1:8082
BalancerMember http://127.0.0.1:8083
BalancerMember http://127.0.0.1:8084
BalancerMember http://127.0.0.1:8085
Nothing unconventional as I understand it. Similarly, my galaxy config has matching [server:ws3], [server:ws2] configuration blocks for each of these processes. When I restart Galaxy, everything is all fine and good. I'll see a server listening on each one of these ports (if I do something like lsof -i TCP -P, for example). What appears to be happening, is that for whatever reason, these server processes seem to die off over time (i.e eventually nothing is listening on ports 8080-8085). This process can take days, and at the time when no servers are available, Apache will begin throwing 503 service unavailable errors. I am fairly confident this process is gradual, for example I just checked now and the Galaxy was still available, however one server had died (the one on TCP port 8082). I do do have a single separate job manager and two job handlers; at this point I believe this problem to be related to the servers only (i.e. the job manager and job handlers do not app!
ear to be crashing).
Now, I believe that late last week I might have 'caught' the last server process dying, just by coincidence, although I am not 100% certain. Here is the Traceback as it occurred:
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01) PBS job state changed from Q to R
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01) PBS job state changed from R to C
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01) PBS job has completed successfully
galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code or stdio handling; checking stderr for success
galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata from file for: HistoryDatasetAssociation 6046
galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01) PBS job state changed from R to E
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job state changed from E to C
galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled exception calling fail_job
Traceback (most recent call last):
File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 58, in run_next
method(arg)
File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 560, in fail_job
if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
Now, I have some questions regarding this issue;
1) It appears to me that although this is a sub-optimal solution, restarting Galaxy solves this problem (i.e. server processes will be listening after restarting Galaxy). Is it possible, or safe, or sane to just restart a single server on a singe port? Ideally I would actually like to fix the problem that is causing my server processes to crash, although I figured it wouldn't hurt to ask this question regardless.
2) Similar to the question above, is it possible to configure Galaxy in a way that server processes re-spawn in a self-service manner (i.e. is this a feature of Galaxy, for example, because server processes dying regularly is either a known issue or expected and tolerable (but undesired) behaivor)?
3) To me, the error messages above aren't very meaningful, other than the Traceback appears to be PBS-related. Would anybody be able comment on the problem above (i.e. have you seen something like this), or comment on Galaxy server processes dying in general? I have done some brief searching of the Galaxy mailing list for server crashes and did not find anything suggesting this is a common problem.
4) I am not 100% confident at this point that the Traceback above is what killed the server process. Does anybody know of a specific string I can search for (a literal) to identify when a server process actually dies? I have done some basic review of log data (our Galaxy server generates lots of logs), and Traceback does not appear to be a valid string to uniquely identify a server crash (they occur too frequently). I currently have logging configured at DEBUG.
In case this is relevant, I am using the following change set for Galaxy:
> hg parents
changeset: 9320:47ddf167c9f1
branch: stable
tag: tip
user: Nate Coraor <nate@bx.psu.edu>
date: Wed May 01 09:50:31 2013 -0400
summary: Use Galaxy's ErrorMiddleware since Paste's doesn't return start_response. Fixes downloading tarballs from the Tool Shed when use_debug = false.
>
I appreciate the time you took in reading my email, and any expertise you could provide in helping me troubleshoot this issue.
Dan Sullivan