Assaf Gordon wrote:
Users are submitting jobs, and the jobs start almost immediately.
However, the jobs look like they are running for very long time.
Checking the report web page, a job looks like the attached image.
The status is "running", but the command line is empty, and no program
was executed for this job (I checked with "ps ax -H" and looked for
Some technical information:
Running on Fedora with Python 2.4.3, PostgreSQL 8.0.
The server is loaded but not too loaded ( 14.4 load average for 16 cores ).
Two config options could be the cause here:
local_job_queue_workers (default: 5)
cluster_job_queue_workers (default: 3)
The local workers are the threads available for actually running jobs.
To facilitate the ability for job tracking to occur in the database,
jobs are moved to the 'running' state before execution. At that point,
if there are not enough threads available, they may sit in the local job
runner's queue until a thread becomes available.
The cluster workers option defines the number of threads available to
run the prepare/finish methods of jobs running via pbs/sge (and does not
control the total number of jobs that can run - the cluster scheduler
does that). Jobs with long finish methods (e.g. setting metadata on
large datasets) could consume all of these threads, preventing new jobs
from making it to the pbs/sge queue until threads are available.
Switching to 'running' before execution is probably wrong, since they
are not actually running. It probably makes sense to create a new job
state for this limbo position.