Hello, On a local galaxy server, I've got into a strange situation: Several jobs are marked as "new", but non are starting. I've stop and re-started the server, and got the following message: ----- galaxy.jobs.runners.local DEBUG 2009-01-26 19:29:00,829 5 workers ready galaxy.jobs.schedulingpolicy.roundrobin INFO 2009-01-26 19:29:00,829 RoundRobin policy: initialized galaxy.jobs INFO 2009-01-26 19:29:00,829 job scheduler policy is galaxy.jobs.schedulingpolicy.roundrobin:UserRoundRobin galaxy.jobs INFO 2009-01-26 19:29:00,829 job manager started galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7886 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7893 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7896 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7902 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7904 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7905 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7906 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7907 is still in new state, adding to the jobs queue galaxy.jobs DEBUG 2009-01-26 19:29:00,952 no runner: 7908 is still in new state, adding to the jobs queue galaxy.jobs INFO 2009-01-26 19:29:00,971 job stopper started ----- But even after a server restart - no jobs are starting (I've waited for about a minute after restart). Is there any configuration setting that can cause these jobs to start if I restart the server? (or cause the 'stale' jobs to be deleted?) Thanks, Gordon.
Assaf Gordon wrote:
On a local galaxy server, I've got into a strange situation: Several jobs are marked as "new", but non are starting.
I've stop and re-started the server, and got the following message:
But even after a server restart - no jobs are starting (I've waited for about a minute after restart).
Gordon, You may want to check the jobs' ancestors, generally they should only remain in the new state if jobs upon which they depend have not yet completed. --nate
Nate, Nate Coraor wrote, On 01/27/2009 09:18 AM:
Assaf Gordon wrote:
On a local galaxy server, I've got into a strange situation: Several jobs are marked as "new", but non are starting.
You may want to check the jobs' ancestors, generally they should only remain in the new state if jobs upon which they depend have not yet completed.
The same situation happened again on my Galaxy server. How do I check the jobs' ancestors ? Other than the jobs marked 'new', there are no other jobs (running or waiting). What I currently do is stop the server, Manually reset the jobs, with the following SQL command: UPDATE job set state='error' where state='new' ; And re-start Galaxy. There are some side-effects to this operation, as there are no running/waiting jobs but the history-list pane still shows running/waiting jobs. But without this manual intervention, new jobs queued by the users are no started. Thanks, Gordon.
Assaf Gordon wrote:
The same situation happened again on my Galaxy server. How do I check the jobs' ancestors ?
It's non-trivial, you have to query that job's input datasets and then query the state of the job that created those datasets.
Other than the jobs marked 'new', there are no other jobs (running or waiting).
What I currently do is stop the server, Manually reset the jobs, with the following SQL command: UPDATE job set state='error' where state='new' ;
And re-start Galaxy.
Can you try enabling the FIFO queue in universe_wsgi.ini?
There are some side-effects to this operation, as there are no running/waiting jobs but the history-list pane still shows running/waiting jobs.
These should go away if you refresh the history pane? If not, there is something Very Wrong here.
But without this manual intervention, new jobs queued by the users are no started.
'new' state jobs, in general, should not block the queue. This almost seems to indicate that something is killing the job queue thread. Any tracebacks in the log file? --nate
Hello, Nate Coraor wrote, On 01/27/2009 01:22 PM:
Assaf Gordon wrote:
What I currently do is stop the server, Manually reset the jobs, with the following SQL command: UPDATE job set state='error' where state='new' ; There are some side-effects to this operation, as there are no running/waiting jobs but the history-list pane still shows running/waiting jobs.
These should go away if you refresh the history pane? If not, there is something Very Wrong here.
Well, my Galaxy database is definitely a mess. It had endured many crashes and exceptions, some of them happening in the middle of long workflows which left some jobs waiting on other non-existing jobs and datasets... So I'm guessing that most of what I'm seeing wouldn't really happen in a stable galaxy installation. However, there are two issues I've found which affect the perceived stability... First, the 'state' column in the DATASET table, regardless of the 'state' column in the JOB table. Stopping and Re-Starting Galaxy while there are DATASET marked as 'running' somehow affect the jobs (or maybe I just imagined it?). It might also affect the number of datasets as reported in the history pane - That is - in the DATASET table has datasets marked as 'new' / 'running' they will appear as new (grey) / running (yellow) even if there are no jobs running/waiting. Second, The 'visible' and 'deleted' columns in the HISTORY_DATASET_ASSOCIATION table. I somehow got into a situation where I had a row in that table with: INFO field = "Unable to Finish Job", DELETED field = FALSE, VISIBLE field = FALSE. That dataset appeared as an error (red box) in the history list, but when the user switched to that history, he (obviously) didn't see any red boxes (I guess because of VISIBLE=FALSE). very confusing indeed. I've removed all the 'running' things (datasets / jobs) and re-started Galaxy - I hope things will calm down now. Thanks for all your help. Gordon.
participants (2)
-
Assaf Gordon
-
Nate Coraor