We use track_jobs_in_database = True for all 6, and then then for one
process:
enable_job_running = True
enable_job_recovery = True
and False for those two options for the other 6 processes. This creates
one "job runner" application, which is not proxied so it does not serve
web requests, and 6 processes just to serve web content.
You'll probably want to do this as well, so you can have a single job
runner to keep better control over the number of allowed running jobs,
since you're running locally.
enable_job_recovery, btw, does nothing when using the local runner -
since the processes lose their parent upon restarting the application,
they cannot be recovered
--nate
Assaf Gordon wrote:
I run 13 workers, but only a single main process.
So you're using it with "track_jobs_in_db = false" and
"job_recovery=false",
and let the seven processes handle the load?
That's very interesting, I didn't think of that option...
Any other configuration required to run parallel galaxy processes ?
I guess I could try it with apache's mod_proxy_balancer ...
James Taylor wrote, On 01/20/2010 05:28 PM:
> Sure,
>
> Main runs 7 galaxy processes, each with 7 worker threads, nginx in front
> load balancing between them, database is postgres 8.3.3
>
> How many worker threads are you running?
>
> Does the heartbeat log (as modified by you) suggest what each thread is
> currently doing when you are at 100% cpu?
>
> -- jt
>
> On Jan 20, 2010, at 4:51 PM, Assaf Gordon wrote:
>
>> I'm just hypothesizing here, because I have no clue what's going on.
>>
>> The SqlAlchemy error is a minor one - it is not my main problem.
>> My main problem is that under what I would consider to be reasonable
>> load (several running jobs and users refreshing histories) my galaxy
>> python process is at 100% CPU, and thus is very unresponsive:
>> switching/listing histories, running jobs (and especially starting
>> workflows) takes a long long time, and sometimes the user gets a
>> "PROXY ERROR" from apache, indicating the proxy (=galaxy python
>> process) did not respond.
>>
>> I suspect the SQLAlchemy problem is somehow triggered by those
>> incidents, where because of some disconnect between apache and the
>> python process, the pooled objects are not returned to the pool: maybe
>> some exception is thrown in the python process, with bypasses a call
>> the SQLAlchemy's remove() method, and somehow the garbage collection
>> doesn't catch it.
>>
>> If your galaxy/database server is configured correctly (unlike mine),
>> then I maybe your python process never goes to 100%, and so
>> connections are never dropped, and the SQLAlchemy never happens.
>>
>> But as I've said, it's minor. I don't mind increasing it to 1000
>> connections in the pool.
>>
>> What I'm trying to do is to make my galaxy server responsive again -
>> because it's getting unusable.
>>
>> I'm almost certain it's a configuration problem, I just don't know
>> where and what to look for.
>>
>> Would be mind sharing your public configuration?
>> which servers you use? postgers/mysql versions? configuration files?
>> anything would be very much appreciated.
>>
>>
>> -gordon
>>
>>
>>
>>
>> Kanwei Li wrote, On 01/20/2010 04:37 PM:
>>> From your error it seems that some sqlalchemy sessions are not being
>>> flushed. It could be a bug in galaxy, but it makes me wonder why
>>> main/test haven't failed in this manner. Anything else standing out in
>>> your logs? What is the memory usage, and does it seem to leak?
>>>
>>> Kanwei
>>>
>>> On Wed, Jan 20, 2010 at 4:25 PM, Assaf Gordon <gordon(a)cshl.edu> wrote:
>>>> Hi Kanwei,
>>>>
>>>> Kanwei Li wrote, On 01/20/2010 04:13 PM:
>>>>> Python's threads do not span over multiple cores/CPU's, so
you are
>>>>> correct in only seeing 100%. Maybe you could try a cluster solution
>>>>> and keep a dedicated node for the webserver?
>>>> My server has 16 cores, and the avg. load is 2, so the problem is
>>>> not CPU over-load.
>>>> Running jobs on the cluster should not affect the python process,
>>>> right ? especially if it is capped at 100%.
>>>>
>>>> My jobs are executed locally (with local runner), but they do not
>>>> affect the main galaxy process (and neither set-metadata processes,
>>>> that are now executed externally).
>>>> It is the main python process which is stuck at 100%, and therefore
>>>> is doesn't respond to users HTTP requests - which makes everything
>>>> very slow.
>>>>
>>>> BTW,
>>>> I suspect this is somehow related to the SQLALchemy (the "QueuePool
>>>> limit" error, see full backtrace below),
>>>> it happens very quickly even with
>>>> "database_engine_option_pool_size=30" in the INI.
>>>> I think it happens when my galaxy is stuck: HTTP requests reach the
>>>> galaxy process, but after 90 seconds the apache proxy returns
"proxy
>>>> error" to the client, and apache drops/cancels the connection to
>>>> galaxy.
>>>>
>>>> could this be contributing to leaked SQLAlchemy queuePools ?
>>>>
>>>> Thanks,
>>>> -gordon
>>>>
>>>> =======
>>>> Full stack-trace of a "queuepool limit" error, which happens
very
>>>> often when my galaxy server gets slow:
>>>>
>>>> Traceback (most recent call last):
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py",
>>>> line 46, in run_next
>>>> self.run_job( job_wrapper )
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/runners/local.py",
>>>> line 107, in run_job
>>>> if job_wrapper.get_state() not in [ model.Job.states.ERROR,
>>>> model.Job.states.DELETED ] and
>>>> self.app.config.set_metadata_externally and job_wrapper.output_paths:
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/lib/galaxy/jobs/__init__.py",
>>>> line 471, in get_state
>>>> job = self.sa_session.query( model.Job ).get( self.job_id )
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",
>>>> line 507, in get
>>>> return self._get(key, ident)
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",
>>>> line 1500, in _get
>>>> return q.all()[0]
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",
>>>> line 1267, in all
>>>> return list(self)
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",
>>>> line 1361, in __iter__
>>>> return self._execute_and_instances(context)
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/query.py",
>>>> line 1364, in _execute_and_instances
>>>> result = self.session.execute(querycontext.statement,
>>>> params=self._params, mapper=self._mapper_zero_or_none())
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py",
>>>> line 754, in execute
>>>> return self.__connection(engine, close_with_result=True).execute(
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/orm/session.py",
>>>> line 721, in __connection
>>>> return engine.contextual_connect(**kwargs)
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/engine/base.py",
>>>> line 1229, in contextual_connect
>>>> return self.Connection(self, self.pool.connect(),
>>>> close_with_result=close_with_result, **kwargs)
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py",
>>>> line 142, in connect
>>>> return _ConnectionFairy(self).checkout()
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py",
>>>> line 304, in __init__
>>>> rec = self._connection_record = pool.get()
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py",
>>>> line 161, in get
>>>> return self.do_get()
>>>> File
>>>>
"/home/gordon/projects/galaxy_prod/eggs/py2.5-noplatform/SQLAlchemy-0.5.6_dev_r6498-py2.5.egg/sqlalchemy/pool.py",
>>>> line 628, in do_get
>>>> raise exc.TimeoutError("QueuePool limit of size %d overflow %d
>>>> reached, connection timed out, timeout %d" % (self.size(),
>>>> self.overflow(), self._timeout))
>>>> TimeoutError: QueuePool limit of size 30 overflow 40 reached,
>>>> connection timed out, timeout 30
>>>>
>>>>
>> _______________________________________________
>> galaxy-dev mailing list
>> galaxy-dev(a)lists.bx.psu.edu
>>
http://lists.bx.psu.edu/listinfo/galaxy-dev
_______________________________________________
galaxy-dev mailing list
galaxy-dev(a)lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev