Re: [galaxy-dev] Galaxy Server Processes Dying?

8 Jul 2013

      Hi Dan,
That's old code. Updating will probably help.
Logging level just takes disk space, but just in case you haven't followed
http://wiki.galaxyproject.org/Admin/Config/Performance/ProductionServer?acti...
leaving
debug = True uncommented used to fill all available server process RAM
eventually AFAIK.
If not already done, try
# Debug enables access to various config options useful for development and
# debugging: use_lint, use_profile, use_printdebug and use_interactive.  It
# also causes the files used by PBS/SGE (submission script, output, and
error)
# to remain on disk after the job is complete.  Debug mode is disabled if
# commented, but is uncommented by default in the sample config.
# debug = True

Hope this helps?

On Mon, Jul 8, 2013 at 4:26 PM, Dan Sullivan <dansullivan@gmail.com> wrote:
...
Hi, Galaxy Developers,
I have what I'm hoping is a fairly simple inquiry for the Galaxy
community; basically, our production Galaxy server processes appear to be
dying off over time.  Our production Galaxy instance implements apache web
scaling features so I have a number of server processes, for example my
apache Apache configuration has:
BalancerMember http://127.0.0.1:8080
BalancerMember http://127.0.0.1:8081
BalancerMember http://127.0.0.1:8082
BalancerMember http://127.0.0.1:8083
BalancerMember http://127.0.0.1:8084
BalancerMember http://127.0.0.1:8085
Nothing unconventional as I understand it.  Similarly, my galaxy config
has matching [server:ws3], [server:ws2] configuration blocks for each of
these processes.  When I restart Galaxy, everything is all fine and good.
 I'll see a server listening on each one of these ports (if I do something
like lsof -i TCP -P, for example).  What appears to be happening, is that
for whatever reason, these server processes seem to die off over time (i.e
eventually nothing is listening on ports 8080-8085).  This process can take
days, and at the time when no servers are available, Apache will begin
throwing 503 service unavailable errors.   I am fairly confident this
process is gradual, for example I just checked now and the Galaxy was still
available, however one server had died (the one on TCP port 8082).  I do do
have a single separate job manager and two job handlers; at this point I
believe this problem to be related to the servers only (i.e. the job
manager and job handlers do not app!
 ear to be crashing).
Now, I believe that late last week I might have 'caught' the last server
process dying, just by coincidence, although I am not 100% certain.  Here
is the Traceback as it occurred:
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:47:12,011 (6822/39485.sc01)
PBS job state changed from Q to R
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,565 (6822/39485.sc01)
PBS job state changed from R to C
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:36,566 (6822/39485.sc01)
PBS job has completed successfully
galaxy.jobs DEBUG 2013-07-02 08:54:36,685 Tool did not define exit code
or stdio handling; checking stderr for success
galaxy.datatypes.metadata DEBUG 2013-07-02 08:54:36,812 loading metadata
from file for: HistoryDatasetAssociation 6046
galaxy.jobs DEBUG 2013-07-02 08:54:38,153 job 6822 ended
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:49,130 (6812/39473.sc01)
PBS job state changed from R to E
galaxy.jobs.runners.pbs DEBUG 2013-07-02 08:54:52,267 (6812/39473.sc01)
PBS job state changed from E to C
galaxy.jobs.runners.pbs ERROR 2013-07-02 08:54:52,267 (6812/39473.sc01)
PBS job failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-07-02 08:54:52,267 (unknown) Unhandled
exception calling fail_job
Traceback (most recent call last):
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
line 58, in run_next
    method(arg)
  File "/group/galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
560, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
Now, I have some questions regarding this issue;
1) It appears to me that although this is a sub-optimal solution,
restarting Galaxy solves this problem (i.e. server processes will be
listening after restarting Galaxy).   Is it possible, or safe, or sane to
just restart a single server on a singe port?  Ideally I would actually
like to fix the problem that is causing my server processes to crash,
although I figured it wouldn't hurt to ask this question regardless.
2) Similar to the question above, is it possible to configure Galaxy in a
way that server processes re-spawn  in a self-service manner (i.e. is this
a feature of Galaxy, for example, because server processes dying regularly
is either a known issue or expected and tolerable (but undesired) behaivor)?
3) To me, the error messages above aren't very meaningful, other than the
Traceback appears to be PBS-related.  Would anybody be able comment on the
problem above (i.e. have you seen something like this), or comment on
Galaxy server processes dying in general?  I have done some brief searching
of the Galaxy mailing list for server crashes and did not find anything
suggesting this is a common problem.
4) I am not 100% confident at this point that the Traceback above is what
killed the server process.  Does anybody know of a specific string I can
search for (a literal) to identify when a server process actually dies?  I
have done some basic review of log data (our Galaxy server generates lots
of logs), and Traceback does not appear to be a valid string to uniquely
identify a server crash (they occur too frequently).  I currently have
logging configured at DEBUG.
In case this is relevant, I am using the following change set for Galaxy:
...
hg parents
changeset:   9320:47ddf167c9f1
branch:      stable
tag:         tip
user:        Nate Coraor <nate@bx.psu.edu>
date:        Wed May 01 09:50:31 2013 -0400
summary:     Use Galaxy's ErrorMiddleware since Paste's doesn't return
start_response.  Fixes downloading tarballs from the Tool Shed when
use_debug = false.
I appreciate the time you took in reading my email, and any expertise you
could provide in helping me troubleshoot this issue.
Dan Sullivan

Re: [galaxy-dev] Galaxy Server Processes Dying?

Ross