Me again, my Galaxy server is still slow, and I'm desperately looking for possible solutions. Any help, suggestion or idea (from anyone) will be highly appreciated. The metadata is now set externally. The database is postgres, switching to MySql showed minor improvement (especially after reboot, where all cache was probably used by MySQL). The python process goes to %40 while few jobs are running, and goes to 100% when a lot of jobs are running and a lot of users accessing galaxy (list their histories, run tools, workflows, etc.). One strange thing that I'm seeing, is that the python process never goes above 100% (actually 106% but I suspect some 'top' summing inaccuracies). Enabling threads-view in top (pressing 'H') shows all the python threads, but their CPU usage *always* sums up to 100% - never higher. I would assume that a true multi-threaded application can easily go above 100% - each thread should be able to reach 100% independently. Could it be a problem in my python installation ? maybe it doesn't support threads correctly ? or is this a wild goose chase? Thanks, -gordon
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver? (http://bitbucket.org/galaxy/galaxy-central/wiki/Config/Cluster) Kanwei On Wed, Jan 20, 2010 at 3:24 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Me again, my Galaxy server is still slow, and I'm desperately looking for possible solutions.
Any help, suggestion or idea (from anyone) will be highly appreciated.
The metadata is now set externally. The database is postgres, switching to MySql showed minor improvement (especially after reboot, where all cache was probably used by MySQL).
The python process goes to %40 while few jobs are running, and goes to 100% when a lot of jobs are running and a lot of users accessing galaxy (list their histories, run tools, workflows, etc.).
One strange thing that I'm seeing, is that the python process never goes above 100% (actually 106% but I suspect some 'top' summing inaccuracies). Enabling threads-view in top (pressing 'H') shows all the python threads, but their CPU usage *always* sums up to 100% - never higher.
I would assume that a true multi-threaded application can easily go above 100% - each thread should be able to reach 100% independently.
Could it be a problem in my python installation ? maybe it doesn't support threads correctly ? or is this a wild goose chase?
Thanks, -gordon _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Hi Kanwei, Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver?
My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%. My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow. BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy. could this be contributing to leaked SQLAlchemy queuePools ? Thanks, -gordon ======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow: Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver?
My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
I'm just hypothesizing here, because I have no clue what's going on. The SqlAlchemy error is a minor one - it is not my main problem. My main problem is that under what I would consider to be reasonable load (several running jobs and users refreshing histories) my galaxy python process is at 100% CPU, and thus is very unresponsive: switching/listing histories, running jobs (and especially starting workflows) takes a long long time, and sometimes the user gets a "PROXY ERROR" from apache, indicating the proxy (=galaxy python process) did not respond. I suspect the SQLAlchemy problem is somehow triggered by those incidents, where because of some disconnect between apache and the python process, the pooled objects are not returned to the pool: maybe some exception is thrown in the python process, with bypasses a call the SQLAlchemy's remove() method, and somehow the garbage collection doesn't catch it. If your galaxy/database server is configured correctly (unlike mine), then I maybe your python process never goes to 100%, and so connections are never dropped, and the SQLAlchemy never happens. But as I've said, it's minor. I don't mind increasing it to 1000 connections in the pool. What I'm trying to do is to make my galaxy server responsive again - because it's getting unusable. I'm almost certain it's a configuration problem, I just don't know where and what to look for. Would be mind sharing your public configuration? which servers you use? postgers/mysql versions? configuration files? anything would be very much appreciated. -gordon Kanwei Li wrote, On 01/20/2010 04:37 PM:
From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei
On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver? My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
Sure, Main runs 7 galaxy processes, each with 7 worker threads, nginx in front load balancing between them, database is postgres 8.3.3 How many worker threads are you running? Does the heartbeat log (as modified by you) suggest what each thread is currently doing when you are at 100% cpu? -- jt On Jan 20, 2010, at 4:51 PM, Assaf Gordon wrote:
I'm just hypothesizing here, because I have no clue what's going on.
The SqlAlchemy error is a minor one - it is not my main problem. My main problem is that under what I would consider to be reasonable load (several running jobs and users refreshing histories) my galaxy python process is at 100% CPU, and thus is very unresponsive: switching/listing histories, running jobs (and especially starting workflows) takes a long long time, and sometimes the user gets a "PROXY ERROR" from apache, indicating the proxy (=galaxy python process) did not respond.
I suspect the SQLAlchemy problem is somehow triggered by those incidents, where because of some disconnect between apache and the python process, the pooled objects are not returned to the pool: maybe some exception is thrown in the python process, with bypasses a call the SQLAlchemy's remove() method, and somehow the garbage collection doesn't catch it.
If your galaxy/database server is configured correctly (unlike mine), then I maybe your python process never goes to 100%, and so connections are never dropped, and the SQLAlchemy never happens.
But as I've said, it's minor. I don't mind increasing it to 1000 connections in the pool.
What I'm trying to do is to make my galaxy server responsive again - because it's getting unusable.
I'm almost certain it's a configuration problem, I just don't know where and what to look for.
Would be mind sharing your public configuration? which servers you use? postgers/mysql versions? configuration files? anything would be very much appreciated.
-gordon
Kanwei Li wrote, On 01/20/2010 04:37 PM:
From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei
On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver? My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/ local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/ local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/ __init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
I run 13 workers, but only a single main process. So you're using it with "track_jobs_in_db = false" and "job_recovery=false", and let the seven processes handle the load? That's very interesting, I didn't think of that option... Any other configuration required to run parallel galaxy processes ? I guess I could try it with apache's mod_proxy_balancer ... James Taylor wrote, On 01/20/2010 05:28 PM:
Sure,
Main runs 7 galaxy processes, each with 7 worker threads, nginx in front load balancing between them, database is postgres 8.3.3
How many worker threads are you running?
Does the heartbeat log (as modified by you) suggest what each thread is currently doing when you are at 100% cpu?
-- jt
On Jan 20, 2010, at 4:51 PM, Assaf Gordon wrote:
I'm just hypothesizing here, because I have no clue what's going on.
The SqlAlchemy error is a minor one - it is not my main problem. My main problem is that under what I would consider to be reasonable load (several running jobs and users refreshing histories) my galaxy python process is at 100% CPU, and thus is very unresponsive: switching/listing histories, running jobs (and especially starting workflows) takes a long long time, and sometimes the user gets a "PROXY ERROR" from apache, indicating the proxy (=galaxy python process) did not respond.
I suspect the SQLAlchemy problem is somehow triggered by those incidents, where because of some disconnect between apache and the python process, the pooled objects are not returned to the pool: maybe some exception is thrown in the python process, with bypasses a call the SQLAlchemy's remove() method, and somehow the garbage collection doesn't catch it.
If your galaxy/database server is configured correctly (unlike mine), then I maybe your python process never goes to 100%, and so connections are never dropped, and the SQLAlchemy never happens.
But as I've said, it's minor. I don't mind increasing it to 1000 connections in the pool.
What I'm trying to do is to make my galaxy server responsive again - because it's getting unusable.
I'm almost certain it's a configuration problem, I just don't know where and what to look for.
Would be mind sharing your public configuration? which servers you use? postgers/mysql versions? configuration files? anything would be very much appreciated.
-gordon
Kanwei Li wrote, On 01/20/2010 04:37 PM:
From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei
On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver? My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
We use track_jobs_in_database = True for all 6, and then then for one process: enable_job_running = True enable_job_recovery = True and False for those two options for the other 6 processes. This creates one "job runner" application, which is not proxied so it does not serve web requests, and 6 processes just to serve web content. You'll probably want to do this as well, so you can have a single job runner to keep better control over the number of allowed running jobs, since you're running locally. enable_job_recovery, btw, does nothing when using the local runner - since the processes lose their parent upon restarting the application, they cannot be recovered --nate Assaf Gordon wrote:
I run 13 workers, but only a single main process.
So you're using it with "track_jobs_in_db = false" and "job_recovery=false", and let the seven processes handle the load?
That's very interesting, I didn't think of that option...
Any other configuration required to run parallel galaxy processes ? I guess I could try it with apache's mod_proxy_balancer ...
James Taylor wrote, On 01/20/2010 05:28 PM:
Sure,
Main runs 7 galaxy processes, each with 7 worker threads, nginx in front load balancing between them, database is postgres 8.3.3
How many worker threads are you running?
Does the heartbeat log (as modified by you) suggest what each thread is currently doing when you are at 100% cpu?
-- jt
On Jan 20, 2010, at 4:51 PM, Assaf Gordon wrote:
I'm just hypothesizing here, because I have no clue what's going on.
The SqlAlchemy error is a minor one - it is not my main problem. My main problem is that under what I would consider to be reasonable load (several running jobs and users refreshing histories) my galaxy python process is at 100% CPU, and thus is very unresponsive: switching/listing histories, running jobs (and especially starting workflows) takes a long long time, and sometimes the user gets a "PROXY ERROR" from apache, indicating the proxy (=galaxy python process) did not respond.
I suspect the SQLAlchemy problem is somehow triggered by those incidents, where because of some disconnect between apache and the python process, the pooled objects are not returned to the pool: maybe some exception is thrown in the python process, with bypasses a call the SQLAlchemy's remove() method, and somehow the garbage collection doesn't catch it.
If your galaxy/database server is configured correctly (unlike mine), then I maybe your python process never goes to 100%, and so connections are never dropped, and the SQLAlchemy never happens.
But as I've said, it's minor. I don't mind increasing it to 1000 connections in the pool.
What I'm trying to do is to make my galaxy server responsive again - because it's getting unusable.
I'm almost certain it's a configuration problem, I just don't know where and what to look for.
Would be mind sharing your public configuration? which servers you use? postgers/mysql versions? configuration files? anything would be very much appreciated.
-gordon
Kanwei Li wrote, On 01/20/2010 04:37 PM:
From your error it seems that some sqlalchemy sessions are not being flushed. It could be a bug in galaxy, but it makes me wonder why main/test haven't failed in this manner. Anything else standing out in your logs? What is the memory usage, and does it seem to leak?
Kanwei
On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu> wrote:
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
Python's threads do not span over multiple cores/CPU's, so you are correct in only seeing 100%. Maybe you could try a cluster solution and keep a dedicated node for the webserver? My server has 16 cores, and the avg. load is 2, so the problem is not CPU over-load. Running jobs on the cluster should not affect the python process, right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not affect the main galaxy process (and neither set-metadata processes, that are now executed externally). It is the main python process which is stuck at 100%, and therefore is doesn't respond to users HTTP requests - which makes everything very slow.
BTW, I suspect this is somehow related to the SQLALchemy (the "QueuePool limit" error, see full backtrace below), it happens very quickly even with "database_engine_option_pool_size=30" in the INI. I think it happens when my galaxy is stuck: HTTP requests reach the galaxy process, but after 90 seconds the apache proxy returns "proxy error" to the client, and apache drops/cancels the connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks, -gordon
======= Full stack-trace of a "queuepool limit" error, which happens very often when my galaxy server gets slow:
Traceback (most recent call last): File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 46, in run_next self.run_job( job_wrapper ) File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py", line 107, in run_job if job_wrapper.get_state() not in [ model.Job.states.ERROR, model.Job.states.DELETED ] and self.app.config.set_metadata_externally and job_wrapper.output_paths: File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py", line 471, in get_state job = self.sa_session.query( model.Job ).get( self.job_id ) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 507, in get return self._get(key, ident) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1500, in _get return q.all()[0] File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1267, in all return list(self) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1361, in __iter__ return self._execute_and_instances(context) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py", line 1364, in _execute_and_instances result = self.session.execute(querycontext.statement, params=self._params, mapper=self._mapper_zero_or_none()) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 754, in execute return self.__connection(engine, close_with_result=True).execute( File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py", line 721, in __connection return engine.contextual_connect(**kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py", line 1229, in contextual_connect return self.Connection(self, self.pool.connect(), close_with_result=close_with_result, **kwargs) File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 142, in connect return _ConnectionFairy(self).checkout() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 304, in __init__ rec = self._connection_record = pool.get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 161, in get return self.do_get() File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line 628, in do_get raise exc.TimeoutError("QueuePool limit of size %d overflow %d reached, connection timed out, timeout %d" % (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 30 overflow 40 reached, connection timed out, timeout 30
galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Hi all, Nate Coraor wrote, On 01/20/2010 07:36 PM:
This creates one "job runner" application, which is not proxied so it does not serve web requests, and 6 processes just to serve web content.
Kudos for this nice solution! And many thanks for all your help today. One job-runner and two web-services seem to significantly improve responsiveness. My job-runner process is at 6% when no jobs are running, and between 30%-60% when there are running jobs - but that doesn't affect the web responsiveness any more - which is exactly what I was looking for. I'm just wondering if your runner is also at high CPU usage when there are running jobs, or is it a problem in my postgresql database configuration, or maybe it's something related to the local-runner (and doesn't affect servers which use SGE/PBS). My small contribution will be to add a wiki page: http://bitbucket.org/galaxy/galaxy-central/wiki/Config/WebApplicationScaling And as always, I have another question: How does reloading a tool works when I have multiple processes running ? Only a single process will handle the HTTP request, so do they somehow communicate ? Thanks again, -gordon
Assaf Gordon wrote:
My job-runner process is at 6% when no jobs are running, and between 30%-60% when there are running jobs - but that doesn't affect the web responsiveness any more - which is exactly what I was looking for. I'm just wondering if your runner is also at high CPU usage when there are running jobs, or is it a problem in my postgresql database configuration, or maybe it's something related to the local-runner (and doesn't affect servers which use SGE/PBS).
It does go up if the runner is tracking a lot of jobs, although I would expect that to be less with the local runner. The heartbeat log may tell you more about where those cycles are going.
My small contribution will be to add a wiki page: http://bitbucket.org/galaxy/galaxy-central/wiki/Config/WebApplicationScaling
And as always, I have another question: How does reloading a tool works when I have multiple processes running ? Only a single process will handle the HTTP request, so do they somehow communicate ?
Unfortunately, reloading a tool will only reload it in the app that the proxy happens to give you for that request. To reload the tool across the entire set of servers, you have to go to each proxied application and reload it. --nate
participants (4)
-
Assaf Gordon
-
James Taylor
-
Kanwei Li
-
Nate Coraor