We use track_jobs_in_database = True for all 6, and then then for one process: enable_job_running = True enable_job_recovery = True and False for those two options for the other 6 processes. This creates one "job runner" application, which is not proxied so it does not serve web requests, and 6 processes just to serve web content. You'll probably want to do this as well, so you can have a single job runner to keep better control over the number of allowed running jobs, since you're running locally. enable_job_recovery, btw, does nothing when using the local runner - since the processes lose their parent upon restarting the application, they cannot be recovered --nate Assaf Gordon wrote:
I run 13 workers, but only a single main process.
So you're using it with "track_jobs_in_db = false" and "job_recovery=false", and let the seven processes handle the load?
That's very interesting, I didn't think of that option...
Any other configuration required to run parallel galaxy processes ? I guess I could try it with apache's mod_proxy_balancer ...
James Taylor wrote, On 01/20/2010 05:28 PM:
Sure,
Main runs 7 galaxy processes, each with 7 worker threads, nginx in front load balancing between them, database is postgres 8.3.3
How many worker threads are you running?
Does the heartbeat log (as modified by you) suggest what each thread is currently doing when you are at 100% cpu?
-- jt
On Jan 20, 2010, at 4:51 PM, Assaf Gordon wrote:
I'm just hypothesizing here, because I have no clue what's going on.
The SqlAlchemy error is a minor one - it is not my main problem. My main problem is that under what I would consider to be reasonable load (several running jobs and users refreshing histories) my galaxy python process is at 100% CPU, and thus is very unresponsive: switching/listing histories, running jobs (and especially starting workflows) takes a long long time, and sometimes the user gets a "PROXY ERROR" from apache, indicating the proxy (=galaxy python process) did not respond.
I suspect the SQLAlchemy problem is somehow triggered by those incidents, where because of some disconnect between apache and the python process, the pooled objects are not returned to the pool: maybe some exception is thrown in the python process, with bypasses a call the SQLAlchemy's remove() method, and somehow the garbage collection doesn't catch it.
If your galaxy/database server is configured correctly (unlike mine), then I maybe your python process never goes to 100%, and so connections are never dropped, and the SQLAlchemy never happens.
But as I've said, it's minor. I don't mind increasing it to 1000 connections in the pool.
What I'm trying to do is to make my galaxy server responsive again - because it's getting unusable.
I'm almost certain it's a configuration problem, I just don't know where and what to look for.
Would be mind sharing your public configuration? which servers you use? postgers/mysql versions? configuration files? anything would be very much appreciated.
-gordon
Kanwei Li wrote, On 01/20/2010 04:37 PM:
From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei
On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver? My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev