I did some more investigation of this issue

I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 4 handler processes running (Plus my web server), but not even getting more than 10% of the CPU each.

There seems to be some process in my handlers that takes an incredible amount of resources, even though TOP is not showing that (Show below)

Has anyone have any idea how to figure out where the bottleneck is?

Is there a way to turn on more detailed logging perhaps to see what each process is doing?

My IT guy suggested there may be some “context Switching” going on due to the many threads that are running (I use a threadpool of 7 for each server), but not sure how to address that issue…

Anyone?

top - 10:00:53 up 37 days, 19:29, 8 users, load average: 32.10, 32.10, 32.09

Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombie

Cpu(s): 4.8%us, 2.5%sy, 0.0%ni, 92.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st

Mem: 16334504k total, 16164084k used, 170420k free, 127720k buffers

Swap: 4194296k total, 15228k used, 4179068k free, 2460252k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

7190 svcgalax 20 0 2721m 284m 5976 S 9.9 1.8 142:53.84 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 --pid-file=handler3.pid --log-file=handler3.log --daemon

7183 svcgalax 20 0 2720m 286m 5984 S 6.4 1.8 135:52.63 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 --pid-file=handler2.pid --log-file=handler2.log --daemon

7175 svcgalax 20 0 2720m 287m 5976 S 5.6 1.8 117:59.40 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 --pid-file=handler1.pid --log-file=handler1.log --daemon

7166 svcgalax 20 0 3442m 2.7g 4884 S 4.6 17.5 74:31.66 python ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 --pid-file=web0.pid --log-file=web0.log --daemon

7172 svcgalax 20 0 2720m 294m 5984 S 4.0 1.8 133:17.19 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 --pid-file=handler0.pid --log-file=handler0.log --daemon

1564 root 20 0 291m 13m 7552 S 0.3 0.1 1:49.65 /usr/sbin/httpd

7890 svcgalax 20 0 17216 1456 1036 S 0.3 0.0 2:15.73 top

10682 apache 20 0 297m 11m 3516 S 0.3 0.1 0:02.23 /usr/sbin/httpd

11224 apache 20 0 295m 11m 3236 S 0.3 0.1 0:00.29 /usr/sbin/httpd

11263 svcgalax 20 0 17248 1460 1036 R 0.3 0.0 0:00.06 top

1 root 20 0 21320 1040 784 S 0.0 0.0 0:00.95 /sbin/init

2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [kthreadd]

3 root RT 0 0 0 0 S 0.0 0.0 0:06.35 [migration/0]

Regards,

Thon

Thon deBoer Ph.D., Bioinformatics Guru
California, USA |p: +1 (650) 799-6839 |m: thondeboer@me.com

From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Thon Deboer
Sent: Wednesday, July 17, 2013 11:31 PM
To: galaxy-dev@lists.bx.psu.edu
Subject: [galaxy-dev] Jobs remain in queue until restart

Hi,

I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy.

The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty.

When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine….

Any ideas?

Thon

P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?)

<?xml version="1.0"?>

<job_conf>

<!-- "workers" is the number of threads for the runner's work queue.

The default from <plugins> is used if not defined for a <plugin>.

-->

</plugins>

<!-- Additional job handlers - the id should match the name of a

[server:<id>] in universe_wsgi.ini.

-->

<!-- <handler id="handler10" tags="handlers"/>

-->

</handlers>

<!-- Destinations define details about remote resources and how jobs

should be executed on those remote resources.

-->

</destination>

</destination>

<param id="nativeSpecification">-V -q short.q -pe smp 1</param>

</destination>

</destination>

<!-- <destination id="real_user_cluster" runner="drmaa">

<param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param>

<param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param>

<param id="galaxy_external_chown_script">scripts/external_chown_script.py</param>

</destination> -->

<param id="type">python</param>

<param id="function">interactiveOrCluster</param>

</destination>

</destinations>

<tools>

<!-- Tools can be configured to use specific destinations or handlers,

identified by either the "id" or "tags" attribute. If assigned to

a tag, a handler or destination that matches that tag will be

chosen at random.

-->

</tools>

<!-- Certain limits can be defined.

-->

</limits>

</job_conf>