Assaf Gordon wrote:
My Galaxy uses local scheduler with Round Robin policy ( and 5 local job queue workers).
The "Unfinished jobs" report page shows 5 running jobs (really running with command line) and several other "limbo-running" jobs (and tons of "new" jobs).
New jobs are dependent on other, running jobs and some backup can be expected (especially if waiting on early steps in a workflow).
The problem is that the galaxy python process has only 4 child-processes (instead of the expected 5).
I double checked by grepping for the command line that the "unfinished jobs" page shows - it doesn't exists in the processes list ($ ps ax -H).
So it appears galaxy missed the termination of the job, and one queue worker will be forever lost. The only hint I have regarding this is that it was a long running job, and the user canceled it before it was completed (I actually can't tell if it was executed or just limbo-running).
It's possible that the job is still running its finish method (perhaps a new 'finishing' state is also in order). This can be a lengthy process for large datasets where setting metadata is complex.
Is there a way to release the queue worker (besides restarting galaxy?)
Currently, no. --nate