Assaf Gordon wrote:
Users are submitting jobs, and the jobs start almost immediately. However, the jobs look like they are running for very long time. Checking the report web page, a job looks like the attached image.
The status is "running", but the command line is empty, and no program was executed for this job (I checked with "ps ax -H" and looked for python's child-processes).
Some technical information: Running on Fedora with Python 2.4.3, PostgreSQL 8.0.
The server is loaded but not too loaded ( 14.4 load average for 16 cores ).
Hi Gordon, Two config options could be the cause here: local_job_queue_workers (default: 5) cluster_job_queue_workers (default: 3) The local workers are the threads available for actually running jobs. To facilitate the ability for job tracking to occur in the database, jobs are moved to the 'running' state before execution. At that point, if there are not enough threads available, they may sit in the local job runner's queue until a thread becomes available. The cluster workers option defines the number of threads available to run the prepare/finish methods of jobs running via pbs/sge (and does not control the total number of jobs that can run - the cluster scheduler does that). Jobs with long finish methods (e.g. setting metadata on large datasets) could consume all of these threads, preventing new jobs from making it to the pbs/sge queue until threads are available. Switching to 'running' before execution is probably wrong, since they are not actually running. It probably makes sense to create a new job state for this limbo position. --nate