Re: [galaxy-dev] Not-So-Running Jobs

24 Apr 2009


      Assaf Gordon wrote:
...
My Galaxy uses local scheduler with Round Robin policy ( and 5 local job 
queue workers).
The "Unfinished jobs" report page shows 5 running jobs (really running 
with command line) and several other "limbo-running" jobs (and tons of 
"new" jobs).
New jobs are dependent on other, running jobs and some backup can be 
expected (especially if waiting on early steps in a workflow).
...
The problem is that the galaxy python process has only 4 child-processes 
(instead of the expected 5).
I double checked by grepping for the command line that the "unfinished 
jobs" page shows - it doesn't exists in the processes list ($ ps ax -H).
So it appears galaxy missed the termination of the job, and one queue 
worker will be forever lost.
The only hint I have regarding this is that it was a long running job, 
and the user canceled it before it was completed (I actually can't tell 
if it was executed or just limbo-running).
It's possible that the job is still running its finish method (perhaps a 
new 'finishing' state is also in order).  This can be a lengthy process 
for large datasets where setting metadata is complex.
...
Is there a way to release the queue worker (besides restarting galaxy?)
Currently, no.

--nate

Re: [galaxy-dev] Not-So-Running Jobs

Nate Coraor