Hi Kanwei, Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver?
My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%. My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow. BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy. could this be contributing to leaked SQLAlchemy queuePools ? Thanks, -gordon ======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow: Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30