Jobs remain in queue until restart
Hi, I have noticed that from time to time, the job queue seems to be "stuck" and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty. When I restart, the startup procedure realizes all jobs are in the a "new state" and it then assigns a jobhandler after which the jobs start fine.. Any ideas? Thon P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?) <?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination> <!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> --> <destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf>
Hi Thon Deboer , I am a newer in Galaxy.I installed my Galaxy with Torque2.5.0 ,and Galaxy uses the pbs_modoule to interface with TORQUE.But I have some question of the job_conf.xml : 1.)In your job_conf.xml ,you use regularjobs,longjobs,shortjobs...to run different jobs ,how our Galaxy know which tool belongs to regularjobs or longjobs.And what is the meaning of "nativeSpecification"? 2.)Shall us use <tools>collection of <tool id="bwa_wrapper" destination="multicorejobs4"/>to specify bwa ?Does it mean the bwa belong to multicorejobs4,and run in cluster? 3.)Does every tool need us to specify which job it belong to? I saw http://wiki.galaxyproject.org/Admin/Config/Jobs about this,but I am not sure above.Could you help me please? shenwiyn From: Thon Deboer Date: 2013-07-18 14:31 To: galaxy-dev Subject: [galaxy-dev] Jobs remain in queue until restart Hi, I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty. When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine…. Any ideas? Torque Thon P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?) <?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination> <!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> --> <destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf>
Hi Thon Deboer , I am a newer in Galaxy.I installed my Galaxy with Torque2.5.0 ,and Galaxy uses the pbs_modoule to interface with TORQUE.But I have some question of the job_conf.xml : 1.)In your job_conf.xml ,you use regularjobs,longjobs,shortjobs...to run different jobs ,how our Galaxy know which tool belongs to regularjobs or longjobs.And what is the meaning of "nativeSpecification"? 2.)Shall us use <tools>collection of <tool id="bwa_wrapper" destination="multicorejobs4"/>to specify bwa ?Does it mean the bwa belong to multicorejobs4,and run in cluster? 3.)Does every tool need us to specify which job it belong to? I saw http://wiki.galaxyproject.org/Admin/Config/Jobs about this,but I am not sure above.Could you help me please? shenwiyn From: Thon Deboer Date: 2013-07-18 14:31 To: galaxy-dev Subject: [galaxy-dev] Jobs remain in queue until restart Hi, I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty. When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine…. Any ideas? Torque Thon P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?) <?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination> <!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> --> <destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf>
On Jul 31, 2013, at 8:52 AM, shenwiyn <shenwiyn@gmail.com> wrote:
Hi Thon Deboer , I am a newer in Galaxy.I installed my Galaxy with Torque2.5.0 ,and Galaxy uses the pbs_modoule to interface with TORQUE.But I have some question of the job_conf.xml : 1.)In your job_conf.xml ,you use regularjobs,longjobs,shortjobs...to run different jobs ,how our Galaxy know which tool belongs to regularjobs or longjobs.And what is the meaning of "nativeSpecification"?
by specifying, as Thon did, in tools the id of the tool and its "destination" which are the settings. the nativeSpecification allows you to set additional parameters that are passed with the call. e.g. -pe smp 4 tells the grid engine to use the parallel environment smp with 4 cores.
2.)Shall us use <tools>collection of <tool id="bwa_wrapper" destination="multicorejobs4"/>to specify bwa ?Does it mean the bwa belong to multicorejobs4,and run in cluster?
exactly
3.)Does every tool need us to specify which job it belong to? I saw http://wiki.galaxyproject.org/Admin/Config/Jobs about this,but I am not sure above.Could you help me please?
fortunately there is a default
shenwiyn
From: Thon Deboer Date: 2013-07-18 14:31 To: galaxy-dev Subject: [galaxy-dev] Jobs remain in queue until restart Hi,
I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty.
When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine….
Any ideas? Torque
Thon
P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?)
<?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination>
<!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> -->
<destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf> ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
HI Shenwiyn, The definition of regularjobs etc. is there to allow each job to be run under different environment on the cluster. I am actually not using most of those definitions, except for the BWA tool, which I want to run using 4 slots on our cluster so I use the destination "multicorejobs4". The nativeSpecification are the options that you can give as if you were to use the QSUB command directly to submit jobs to the cluster. -V -q short.q -pe smp 1 Is what I normally use for the qsub command for a job that is fast, for instance.. You don't need to specify the destination for each job, since the tool section has a default, which is "regularJobs" in my case. So only if you want to do something other than submit a regular job (which only takes one slot) do you need to define something else, like I did for BWA. Hope that helps Regards, Thon Thon deBoer Ph.D., Bioinformatics Guru California, USA |p: +1 (650) 799-6839 |m: <mailto:thondeboer@me.com> thondeboer@me.com From: shenwiyn [mailto:shenwiyn@gmail.com] Sent: Tuesday, July 30, 2013 11:52 PM To: Thon Deboer Cc: galaxy-dev@lists.bx.psu.edu Subject: shall us specify which tool run in cluster? Hi Thon Deboer , I am a newer in Galaxy.I installed my Galaxy with Torque2.5.0 ,and Galaxy uses the pbs_modoule to interface with TORQUE.But I have some question of the job_conf.xml : 1.)In your job_conf.xml ,you use regularjobs,longjobs,shortjobs...to run different jobs ,how our Galaxy know which tool belongs to regularjobs or longjobs.And what is the meaning of "nativeSpecification"? 2.)Shall us use <tools>collection of <tool id="bwa_wrapper" destination="multicorejobs4"/>to specify bwa ?Does it mean the bwa belong to multicorejobs4,and run in cluster? 3.)Does every tool need us to specify which job it belong to? I saw http://wiki.galaxyproject.org/Admin/Config/Jobs about this,but I am not sure above.Could you help me please? _____ shenwiyn From: Thon Deboer <mailto:thondeboer@me.com> Date: 2013-07-18 14:31 To: galaxy-dev <mailto:galaxy-dev@lists.bx.psu.edu> Subject: [galaxy-dev] Jobs remain in queue until restart Hi, I have noticed that from time to time, the job queue seems to be "stuck" and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty. When I restart, the startup procedure realizes all jobs are in the a "new state" and it then assigns a jobhandler after which the jobs start fine.. Any ideas? Torque Thon P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?) <?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination> <!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> --> <destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf>
I did some more investigation of this issue I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 4 handler processes running (Plus my web server), but not even getting more than 10% of the CPU each. There seems to be some process in my handlers that takes an incredible amount of resources, even though TOP is not showing that (Show below) Has anyone have any idea how to figure out where the bottleneck is? Is there a way to turn on more detailed logging perhaps to see what each process is doing? My IT guy suggested there may be some "context Switching" going on due to the many threads that are running (I use a threadpool of 7 for each server), but not sure how to address that issue. Anyone? top - 10:00:53 up 37 days, 19:29, 8 users, load average: 32.10, 32.10, 32.09 Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombie Cpu(s): 4.8%us, 2.5%sy, 0.0%ni, 92.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 16334504k total, 16164084k used, 170420k free, 127720k buffers Swap: 4194296k total, 15228k used, 4179068k free, 2460252k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7190 svcgalax 20 0 2721m 284m 5976 S 9.9 1.8 142:53.84 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 --pid-file=handler3.pid --log-file=handler3.log --daemon 7183 svcgalax 20 0 2720m 286m 5984 S 6.4 1.8 135:52.63 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 --pid-file=handler2.pid --log-file=handler2.log --daemon 7175 svcgalax 20 0 2720m 287m 5976 S 5.6 1.8 117:59.40 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 --pid-file=handler1.pid --log-file=handler1.log --daemon 7166 svcgalax 20 0 3442m 2.7g 4884 S 4.6 17.5 74:31.66 python ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 --pid-file=web0.pid --log-file=web0.log --daemon 7172 svcgalax 20 0 2720m 294m 5984 S 4.0 1.8 133:17.19 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 --pid-file=handler0.pid --log-file=handler0.log --daemon 1564 root 20 0 291m 13m 7552 S 0.3 0.1 1:49.65 /usr/sbin/httpd 7890 svcgalax 20 0 17216 1456 1036 S 0.3 0.0 2:15.73 top 10682 apache 20 0 297m 11m 3516 S 0.3 0.1 0:02.23 /usr/sbin/httpd 11224 apache 20 0 295m 11m 3236 S 0.3 0.1 0:00.29 /usr/sbin/httpd 11263 svcgalax 20 0 17248 1460 1036 R 0.3 0.0 0:00.06 top 1 root 20 0 21320 1040 784 S 0.0 0.0 0:00.95 /sbin/init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [kthreadd] 3 root RT 0 0 0 0 S 0.0 0.0 0:06.35 [migration/0] Regards, Thon Thon deBoer Ph.D., Bioinformatics Guru California, USA |p: +1 (650) 799-6839 |m: <mailto:thondeboer@me.com> thondeboer@me.com From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Thon Deboer Sent: Wednesday, July 17, 2013 11:31 PM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] Jobs remain in queue until restart Hi, I have noticed that from time to time, the job queue seems to be "stuck" and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty. When I restart, the startup procedure realizes all jobs are in the a "new state" and it then assigns a jobhandler after which the jobs start fine.. Any ideas? Thon P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?) <?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination> <!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> --> <destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf>
On Aug 2, 2013, at 1:06 PM, Thon de Boer wrote:
I did some more investigation of this issue
I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 4 handler processes running (Plus my web server), but not even getting more than 10% of the CPU each. There seems to be some process in my handlers that takes an incredible amount of resources, even though TOP is not showing that (Show below)
Has anyone have any idea how to figure out where the bottleneck is? Is there a way to turn on more detailed logging perhaps to see what each process is doing?
My IT guy suggested there may be some “context Switching” going on due to the many threads that are running (I use a threadpool of 7 for each server), but not sure how to address that issue…
Hi Thon, It looks like it's probably the memory use - if you restart the Galaxy processes, do you see any change? --nate
Anyone?
top - 10:00:53 up 37 days, 19:29, 8 users, load average: 32.10, 32.10, 32.09 Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombie Cpu(s): 4.8%us, 2.5%sy, 0.0%ni, 92.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 16334504k total, 16164084k used, 170420k free, 127720k buffers Swap: 4194296k total, 15228k used, 4179068k free, 2460252k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7190 svcgalax 20 0 2721m 284m 5976 S 9.9 1.8 142:53.84 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 --pid-file=handler3.pid --log-file=handler3.log --daemon 7183 svcgalax 20 0 2720m 286m 5984 S 6.4 1.8 135:52.63 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 --pid-file=handler2.pid --log-file=handler2.log --daemon 7175 svcgalax 20 0 2720m 287m 5976 S 5.6 1.8 117:59.40 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 --pid-file=handler1.pid --log-file=handler1.log --daemon 7166 svcgalax 20 0 3442m 2.7g 4884 S 4.6 17.5 74:31.66 python ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 --pid-file=web0.pid --log-file=web0.log --daemon 7172 svcgalax 20 0 2720m 294m 5984 S 4.0 1.8 133:17.19 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 --pid-file=handler0.pid --log-file=handler0.log --daemon 1564 root 20 0 291m 13m 7552 S 0.3 0.1 1:49.65 /usr/sbin/httpd 7890 svcgalax 20 0 17216 1456 1036 S 0.3 0.0 2:15.73 top 10682 apache 20 0 297m 11m 3516 S 0.3 0.1 0:02.23 /usr/sbin/httpd 11224 apache 20 0 295m 11m 3236 S 0.3 0.1 0:00.29 /usr/sbin/httpd 11263 svcgalax 20 0 17248 1460 1036 R 0.3 0.0 0:00.06 top 1 root 20 0 21320 1040 784 S 0.0 0.0 0:00.95 /sbin/init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [kthreadd] 3 root RT 0 0 0 0 S 0.0 0.0 0:06.35 [migration/0]
Regards,
Thon
Thon deBoer Ph.D., Bioinformatics Guru California, USA |p: +1 (650) 799-6839 |m: thondeboer@me.com
From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Thon Deboer Sent: Wednesday, July 17, 2013 11:31 PM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] Jobs remain in queue until restart
Hi,
I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty.
When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine….
Any ideas?
Thon
P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?)
<?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination>
<!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> -->
<destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf> ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
I don't think it's a memory issue (but what made you say that?) since each process is hardly using any memory, although VIRT memory in top is showing 2.7GB per python process, RES is only ever going to 250MB and I have a 16GB machine (although SWAP is only 4GB but not using any of the swap either, so don't think memory is the issue.... I AM running this on a VM machine, but the physical machine is not doing much either... I'll run profile on it to see what is causing the massive load issue... Thon On Aug 14, 2013, at 08:51 AM, Nate Coraor <nate@bx.psu.edu> wrote: On Aug 2, 2013, at 1:06 PM, Thon de Boer wrote: I did some more investigation of this issue I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 4 handler processes running (Plus my web server), but not even getting more than 10% of the CPU each. There seems to be some process in my handlers that takes an incredible amount of resources, even though TOP is not showing that (Show below) Has anyone have any idea how to figure out where the bottleneck is? Is there a way to turn on more detailed logging perhaps to see what each process is doing? My IT guy suggested there may be some “context Switching” going on due to the many threads that are running (I use a threadpool of 7 for each server), but not sure how to address that issue… Hi Thon, It looks like it's probably the memory use - if you restart the Galaxy processes, do you see any change? --nate Anyone? top - 10:00:53 up 37 days, 19:29, 8 users, load average: 32.10, 32.10, 32.09 Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombie Cpu(s): 4.8%us, 2.5%sy, 0.0%ni, 92.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 16334504k total, 16164084k used, 170420k free, 127720k buffers Swap: 4194296k total, 15228k used, 4179068k free, 2460252k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7190 svcgalax 20 0 2721m 284m 5976 S 9.9 1.8 142:53.84 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 --pid-file=handler3.pid --log-file=handler3.log --daemon 7183 svcgalax 20 0 2720m 286m 5984 S 6.4 1.8 135:52.63 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 --pid-file=handler2.pid --log-file=handler2.log --daemon 7175 svcgalax 20 0 2720m 287m 5976 S 5.6 1.8 117:59.40 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 --pid-file=handler1.pid --log-file=handler1.log --daemon 7166 svcgalax 20 0 3442m 2.7g 4884 S 4.6 17.5 74:31.66 python ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 --pid-file=web0.pid --log-file=web0.log --daemon 7172 svcgalax 20 0 2720m 294m 5984 S 4.0 1.8 133:17.19 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 --pid-file=handler0.pid --log-file=handler0.log --daemon 1564 root 20 0 291m 13m 7552 S 0.3 0.1 1:49.65 /usr/sbin/httpd 7890 svcgalax 20 0 17216 1456 1036 S 0.3 0.0 2:15.73 top 10682 apache 20 0 297m 11m 3516 S 0.3 0.1 0:02.23 /usr/sbin/httpd 11224 apache 20 0 295m 11m 3236 S 0.3 0.1 0:00.29 /usr/sbin/httpd 11263 svcgalax 20 0 17248 1460 1036 R 0.3 0.0 0:00.06 top 1 root 20 0 21320 1040 784 S 0.0 0.0 0:00.95 /sbin/init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [kthreadd] 3 root RT 0 0 0 0 S 0.0 0.0 0:06.35 [migration/0] Regards, Thon Thon deBoer Ph.D., Bioinformatics Guru California, USA |p: +1 (650) 799-6839 |m: thondeboer@me.com From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Thon Deboer Sent: Wednesday, July 17, 2013 11:31 PM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] Jobs remain in queue until restart Hi, I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy. The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty. When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine…. Any ideas? Thon P.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?) <?xml version="1.0"?> <job_conf> <plugins workers="4"> <!-- "workers" is the number of threads for the runner's work queue. The default from <plugins> is used if not defined for a <plugin>. --> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/> <plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"/> </plugins> <handlers default="handlers"> <!-- Additional job handlers - the id should match the name of a [server:<id>] in universe_wsgi.ini. --> <handler id="handler0" tags="handlers"/> <handler id="handler1" tags="handlers"/> <handler id="handler2" tags="handlers"/> <handler id="handler3" tags="handlers"/> <!-- <handler id="handler10" tags="handlers"/> <handler id="handler11" tags="handlers"/> <handler id="handler12" tags="handlers"/> <handler id="handler13" tags="handlers"/> --> </handlers> <destinations default="regularjobs"> <!-- Destinations define details about remote resources and how jobs should be executed on those remote resources. --> <destination id="local" runner="local"/> <destination id="regularjobs" runner="drmaa" tags="cluster"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="longjobs" runner="drmaa" tags="cluster,long_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 1</param> </destination> <destination id="shortjobs" runner="drmaa" tags="cluster,short_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q short.q -pe smp 1</param> </destination> <destination id="multicorejobs4" runner="drmaa" tags="cluster,multicore_jobs"> <!-- These are the parameters for qsub, such as queue etc. --> <param id="nativeSpecification">-V -q long.q -pe smp 4</param> </destination> <!-- <destination id="real_user_cluster" runner="drmaa"> <param id="galaxy_external_runjob_script">scripts/drmaa_external_runner.py</param> <param id="galaxy_external_killjob_script">scripts/drmaa_external_killer.py</param> <param id="galaxy_external_chown_script">scripts/external_chown_script.py</param> </destination> --> <destination id="dynamic" runner="dynamic"> <!-- A destination that represents a method in the dynamic runner. --> <param id="type">python</param> <param id="function">interactiveOrCluster</param> </destination> </destinations> <tools> <!-- Tools can be configured to use specific destinations or handlers, identified by either the "id" or "tags" attribute. If assigned to a tag, a handler or destination that matches that tag will be chosen at random. --> <tool id="bwa_wrapper" destination="multicorejobs4"/> </tools> <limits> <!-- Certain limits can be defined. <limit type="registered_user_concurrent_jobs">500</limit> <limit type="unregistered_user_concurrent_jobs">1</limit> <limit type="concurrent_jobs" id="local">1</limit> <limit type="concurrent_jobs" tag="cluster">200</limit> <limit type="concurrent_jobs" tag="long_jobs">200</limit> <limit type="concurrent_jobs" tag="short_jobs">200</limit> <limit type="concurrent_jobs" tag="multicore_jobs">100</limit> --> </limits> </job_conf> ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (6)
-
Anthonius deBoer
-
Ido Tamir
-
Nate Coraor
-
shenwiyn
-
Thon de Boer
-
Thon Deboer