Re: [galaxy-dev] Another slowness question

20 Jan 2010

      Sure,

Main runs 7 galaxy processes, each with 7 worker threads, nginx in  
front load balancing between them, database is postgres 8.3.3

How many worker threads are you running?

Does the heartbeat log (as modified by you) suggest what each thread  
is currently doing when you are at 100% cpu?

-- jt

On Jan 20, 2010, at 4:51 PM, Assaf Gordon wrote:
...
I'm just hypothesizing here, because I have no clue what's going on.
The SqlAlchemy error is a minor one - it is not my main problem.
My main problem is that under what I would consider to be reasonable  
load (several running jobs and users refreshing histories) my galaxy  
python process is at 100% CPU, and thus is very unresponsive:
switching/listing histories, running jobs (and especially starting  
workflows) takes a long long time, and sometimes the user gets a  
"PROXY ERROR" from apache, indicating the proxy (=galaxy python  
process) did not respond.
I suspect the SQLAlchemy problem is somehow triggered by those  
incidents, where because of some disconnect between apache and the  
python process, the pooled objects are not returned to the pool:  
maybe some exception is thrown in the python process, with bypasses  
a call the SQLAlchemy's remove() method, and somehow the garbage  
collection doesn't catch it.
If your galaxy/database server is configured correctly (unlike  
mine), then I maybe your python process never goes to 100%, and so  
connections are never dropped, and the SQLAlchemy never happens.
But as I've said, it's minor. I don't mind increasing it to 1000  
connections in the pool.
What I'm trying to do is to make my galaxy server responsive again -  
because it's getting unusable.
I'm almost certain it's a configuration problem, I just don't know  
where and what to look for.
Would be mind sharing your public configuration?
which servers you use? postgers/mysql versions? configuration files?
anything would be very much appreciated.
-gordon
Kanwei Li wrote, On 01/20/2010 04:37 PM:
...
From your error it seems that some sqlalchemy sessions are not being
flushed. It could be a bug in galaxy, but it makes me wonder why
main/test haven't failed in this manner. Anything else standing out  
in
your logs? What is the memory usage, and does it seem to leak?
Kanwei
On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon@cshl.edu>  
wrote:
...
Hi Kanwei,
Kanwei Li wrote, On 01/20/2010 04:13 PM:
...
Python's threads do not span over multiple cores/CPU's, so you are
correct in only seeing 100%. Maybe you could try a cluster solution
and keep a dedicated node for the webserver?
My server has 16 cores, and the avg. load is 2, so the problem is  
not CPU over-load.
Running jobs on the cluster should not affect the python process,  
right ? especially if it is capped at 100%.
My jobs are executed locally (with local runner), but they do not  
affect the main galaxy process (and neither set-metadata  
processes, that are now executed externally).
It is the main python process which is stuck at 100%, and  
therefore is doesn't respond to users HTTP requests - which makes  
everything very slow.
BTW,
I suspect this is somehow related to the SQLALchemy (the  
"QueuePool limit" error, see full backtrace below),
it happens very quickly even with  
"database_engine_option_pool_size=30" in the INI.
I think it happens when my galaxy is stuck: HTTP requests reach  
the galaxy process, but after 90 seconds the apache proxy returns  
"proxy error" to the client, and apache drops/cancels the  
connection to galaxy.
could this be contributing to leaked SQLAlchemy queuePools ?
Thanks,
-gordon
=======
Full stack-trace of a "queuepool limit" error, which happens very  
often when my galaxy server gets slow:
Traceback (most recent call last):
File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/ 
local.py", line 46, in run_next
  self.run_job( job_wrapper )
File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/ 
local.py", line 107, in run_job
  if job_wrapper.get_state() not in [ model.Job.states.ERROR,  
model.Job.states.DELETED ] and  
self.app.config.set_metadata_externally and  
job_wrapper.output_paths:
File "/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/ 
__init__.py", line 471, in get_state
  job = self.sa_session.query( model.Job ).get( self.job_id )
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",  
line 507, in get
  return self._get(key, ident)
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",  
line 1500, in _get
  return q.all()[0]
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",  
line 1267, in all
  return list(self)
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",  
line 1361, in __iter__
  return self._execute_and_instances(context)
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",  
line 1364, in _execute_and_instances
  result = self.session.execute(querycontext.statement,  
params=self._params, mapper=self._mapper_zero_or_none())
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py",  
line 754, in execute
  return self.__connection(engine, close_with_result=True).execute(
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py",  
line 721, in __connection
  return engine.contextual_connect(**kwargs)
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py",  
line 1229, in contextual_connect
  return self.Connection(self, self.pool.connect(),  
close_with_result=close_with_result, **kwargs)
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line  
142, in connect
  return _ConnectionFairy(self).checkout()
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line  
304, in __init__
  rec = self._connection_record = pool.get()
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line  
161, in get
  return self.do_get()
File "/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/ 
SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py", line  
628, in do_get
  raise exc.TimeoutError("QueuePool limit of size %d overflow %d  
reached, connection timed out, timeout %d" % (self.size(),  
self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 30 overflow 40 reached,  
connection timed out, timeout 30
_______________________________________________
galaxy-dev mailing list
galaxy-dev@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev

Re: [galaxy-dev] Another slowness question

James Taylor