From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver?
My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30