Hi, Ross,

I appreciate you taking the time to answer my inquiry.  I did know about the options that technically should not be set on a production server (i.e. debug), although I have to admit I did not fully understand the implications of this (i.e. "your Galaxy process may run out of memory if it's serving large files.").  I think what probably happened is that this setting was turned on at some point to troubleshoot a problem, however it was never subsequently disabled.  I am going to start by disabling the developer settings as you specified, and then depending on the result of this, considering upgrading to a newer version of Galaxy.  Again, I want to express my gratitude for you taking the time to respond to my email.  I'll try to remember to post back to the mailing list to report my findings.

Dan


On Mon, Jul 8, 2013 at 3:33 AM, Ross <ross.lazarus@gmail.com> wrote:
Hi Dan, 
That's old code. Updating will probably help.
Logging level just takes disk space, but just in case you haven't followed http://wiki.galaxyproject.org/Admin/Config/Performance/ProductionServer?action=show&redirect=Admin%2FConfig%2FPerformance leaving debug = True uncommented used to fill all available server process RAM eventually AFAIK.
If not already done, try
# Debug enables access to various config options useful for development and
# debugging: use_lint, use_profile, use_printdebug and use_interactive.  It
# also causes the files used by PBS/SGE (submission script, output, and error)
# to remain on disk after the job is complete.  Debug mode is disabled if
# commented, but is uncommented by default in the sample config.
# debug = True

Hope this helps?


On Mon, Jul 8, 2013 at 4:26 PM, Dan Sullivan <dansullivan@gmail.com> wrote:
Hi, Galaxy Developers,

I have what I'm hoping is a fairly simple inquiry for the Galaxy community; basically, our production Galaxy server processes appear to be dying off over time.  Our production Galaxy instance implements apache web scaling features so I have a number of server processes, for example my apache Apache configuration has:

BalancerMember http://127.0.0.1:8080
BalancerMember http://127.0.0.1:8081
BalancerMember http://127.0.0.1:8082
BalancerMember http://127.0.0.1:8083
BalancerMember http://127.0.0.1:8084
BalancerMember http://127.0.0.1:8085

Nothing unconventional as I understand it.  Similarly, my galaxy config has matching [server:ws3], [server:ws2] configuration blocks for each of these processes.  When I restart Galaxy, everything is all fine and good.  I'll see a server listening on each one of these ports (if I do something like lsof -i TCP -P, for example).  What appears to be happening, is that for whatever reason, these server processes seem to die off over time (i.e eventually nothing is listening on ports 8080-8085).  This process can take days, and at the time when no servers are available, Apache will begin throwing 503 service unavailable errors.   I am fairly confident this process is gradual, for example I just checked now and the Galaxy was still available, however one server had died (the one on TCP port 8082).  I do do have a single separate job manager and two job handlers; at this point I believe this problem to be related to the servers only (i.e. the job manager and job handlers do not app!

 ear to be crashing).

Now, I believe that late last week I might have 'caught' the last server process dying, just by coincidence, although I am not 100% certain.  Here is the Traceback as it occurred:

galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01) PBS job state changed from Q to R
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01) PBS job state changed from R to C
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01) PBS job has completed successfully
galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code or stdio handling; checking stderr for success
galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata from file for: HistoryDatasetAssociation 6046
galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01) PBS job state changed from R to E
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job state changed from E to C
galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01) PBS job failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled exception calling fail_job
Traceback (most recent call last):
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 58, in run_next
    method(arg)
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 560, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

Now, I have some questions regarding this issue;

1) It appears to me that although this is a sub-optimal solution, restarting Galaxy solves this problem (i.e. server processes will be listening after restarting Galaxy).   Is it possible, or safe, or sane to just restart a single server on a singe port?  Ideally I would actually like to fix the problem that is causing my server processes to crash, although I figured it wouldn't hurt to ask this question regardless.
2) Similar to the question above, is it possible to configure Galaxy in a way that server processes re-spawn  in a self-service manner (i.e. is this a feature of Galaxy, for example, because server processes dying regularly is either a known issue or expected and tolerable (but undesired) behaivor)?
3) To me, the error messages above aren't very meaningful, other than the Traceback appears to be PBS-related.  Would anybody be able comment on the problem above (i.e. have you seen something like this), or comment on Galaxy server processes dying in general?  I have done some brief searching of the Galaxy mailing list for server crashes and did not find anything suggesting this is a common problem.
4) I am not 100% confident at this point that the Traceback above is what killed the server process.  Does anybody know of a specific string I can search for (a literal) to identify when a server process actually dies?  I have done some basic review of log data (our Galaxy server generates lots of logs), and Traceback does not appear to be a valid string to uniquely identify a server crash (they occur too frequently).  I currently have logging configured at DEBUG.

In case this is relevant, I am using the following change set for Galaxy:
> hg parents
changeset:   9320:47ddf167c9f1
branch:      stable
tag:         tip
user:        Nate Coraor <nate@bx.psu.edu>
date:        Wed May 01 09:50:31 2013 -0400
summary:     Use Galaxy's ErrorMiddleware since Paste's doesn't return start_response.  Fixes downloading tarballs from the Tool Shed when use_debug = false.
>

I appreciate the time you took in reading my email, and any expertise you could provide in helping me troubleshoot this issue.

Dan Sullivan