Re: [galaxy-dev] Not-So-Running Jobs

10 Apr 2009

      Assaf Gordon wrote:
...
Users are submitting jobs, and the jobs start almost immediately.
However, the jobs look like they are running for very long time.
Checking the report web page, a job looks like the attached image.
The status is "running", but the command line is empty, and no program 
was executed for this job (I checked with "ps ax -H" and looked for 
python's child-processes).
Some technical information:
Running on Fedora with Python 2.4.3, PostgreSQL 8.0.
The server is loaded but not too loaded ( 14.4 load average for 16 cores ).
Hi Gordon,

Two config options could be the cause here:

   local_job_queue_workers (default: 5)
   cluster_job_queue_workers (default: 3)

The local workers are the threads available for actually running jobs. 
To facilitate the ability for job tracking to occur in the database, 
jobs are moved to the 'running' state before execution.  At that point, 
if there are not enough threads available, they may sit in the local job 
runner's queue until a thread becomes available.

The cluster workers option defines the number of threads available to 
run the prepare/finish methods of jobs running via pbs/sge (and does not 
control the total number of jobs that can run - the cluster scheduler 
does that).  Jobs with long finish methods (e.g. setting metadata on 
large datasets) could consume all of these threads, preventing new jobs 
from making it to the pbs/sge queue until threads are available.

Switching to 'running' before execution is probably wrong, since they 
are not actually running.  It probably makes sense to create a new job 
state for this limbo position.

--nate

Re: [galaxy-dev] Not-So-Running Jobs

Nate Coraor