Some workflows not scheduled until handler restart
Hi all, On our installation (v15.07) we suddenly see that one of two job handlers get stuck with a high cpu load (last message generally, `cleaning up external metadata files`) without new messages appearing. In addition, when running workflows in batch (>6x), only a few of them (~3) get their workflow steps/jobs scheduled (LSF-DRMAA). For the remaining 3, their new histories are created but remain empty (according to the GUI). Only upon restart of the two job handlers the remaining workflow steps are scheduled and shown in the history. First question, how do we resolve this issue? Second, how does this actually work? How are the workflow steps stored in the database i.e. why are they not shown in the web interface until they are processed by a handler? Possible relevant config settings: [server:handler0] use_threadpool = true threadpool_workers = 5 [server:handler1] use_threadpool = true threadpool_workers = 5 [app:main] force_beta_workflow_scheduled_min_steps=1 force_beta_workflow_scheduled_for_collections=True track_jobs_in_database = True enable_job_recovery = True retry_metadata_internally = False cache_user_job_count = True # only a limit set for the very few local tools like upload Cheers, Jelle
Hi all, I work with Jelle and want to add on the issue. Help would be *greatly* appreciated as this is a *major* stopper on our production server right now. In the database ‘workflow_invocation' table, one can see a ’state’ column with values like ’scheduled’ or ‘failed’. Before december 18, I only see the values ’scheduled’ or ‘failed’. After this date, a new state appeared : ’new’ . And this is always associated to handler 1 (would have 2 job handlers i.e. ‘0’ and ‘1’). As time goes on, we can see a mix of ’new’ and ‘scheduled’ state with more and more ‘new’ and from Jan 4 it is only ’new’ (only for handler ‘1') This sounds like all workflows being assigned to handler1 never get into the ‘scheduled’ mode and then jobs are never created. I have 269 entries in the ‘workflow_invocation’ table with ’new’ state and restarting the job handlers has no impact anymore (used to work a few days ago) How can I fix this ? Thank for your help Charles
On 5 Jan 2016, at 11:29, Jelle Scholtalbers <j.scholtalbers@gmail.com> wrote:
Hi all,
On our installation (v15.07) we suddenly see that one of two job handlers get stuck with a high cpu load (last message generally, `cleaning up external metadata files`) without new messages appearing. In addition, when running workflows in batch (>6x), only a few of them (~3) get their workflow steps/jobs scheduled (LSF-DRMAA). For the remaining 3, their new histories are created but remain empty (according to the GUI). Only upon restart of the two job handlers the remaining workflow steps are scheduled and shown in the history.
First question, how do we resolve this issue? Second, how does this actually work? How are the workflow steps stored in the database i.e. why are they not shown in the web interface until they are processed by a handler?
Possible relevant config settings: [server:handler0] use_threadpool = true threadpool_workers = 5
[server:handler1] use_threadpool = true threadpool_workers = 5
[app:main] force_beta_workflow_scheduled_min_steps=1 force_beta_workflow_scheduled_for_collections=True track_jobs_in_database = True enable_job_recovery = True retry_metadata_internally = False cache_user_job_count = True # only a limit set for the very few local tools like upload
Cheers,
Jelle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
I've been swamped with release related things but my intention is to dig deeply into this. This is a very serious bug. A work around is just to disable beta workflow scheduling for now: switch force_beta_workflow_scheduled_min_steps=1 force_beta_workflow_scheduled_for_collections=True to force_beta_workflow_scheduled_min_steps=250 force_beta_workflow_scheduled_for_collections=False Your old workflows wouldn't run, but new ones wouldn't have any problems. This would decrease the size of the workflow you could easily run though. The forthcoming 16.01 has a great number of enhancements to the beta workflow scheduling - improved state tracking, improved logging of problems, many optimizations to the scheduling process - any of these things could help the problem. usegalaxy.org will start running the release_16.01 branch (which already exists) on Monday - it might be worth upgrading to that shortly after. My best guess about what is happening, is some workflow that got scheduled is causing an exception that causes workflow scheduling to stop. I think logging of this might be absent prior to 16.01 due to a huge oversight on my part. Do you want to restart whatever handler is running workflows and send me the first 5 minutes worth of Galaxy logs for that process, it might help me figure out what is happening specifically? On Fri, Jan 8, 2016 at 12:56 PM, Charles Girardot <charles.girardot@embl.de> wrote:
Hi all,
I work with Jelle and want to add on the issue.
Help would be *greatly* appreciated as this is a *major* stopper on our production server right now.
In the database ‘workflow_invocation' table, one can see a ’state’ column with values like ’scheduled’ or ‘failed’.
Before december 18, I only see the values ’scheduled’ or ‘failed’. After this date, a new state appeared : ’new’ . And this is always associated to handler 1 (would have 2 job handlers i.e. ‘0’ and ‘1’). As time goes on, we can see a mix of ’new’ and ‘scheduled’ state with more and more ‘new’ and from Jan 4 it is only ’new’ (only for handler ‘1')
This sounds like all workflows being assigned to handler1 never get into the ‘scheduled’ mode and then jobs are never created. I have 269 entries in the ‘workflow_invocation’ table with ’new’ state and restarting the job handlers has no impact anymore (used to work a few days ago)
How can I fix this ?
Thank for your help
Charles
On 5 Jan 2016, at 11:29, Jelle Scholtalbers <j.scholtalbers@gmail.com> wrote:
Hi all,
On our installation (v15.07) we suddenly see that one of two job handlers get stuck with a high cpu load (last message generally, `cleaning up external metadata files`) without new messages appearing. In addition, when running workflows in batch (>6x), only a few of them (~3) get their workflow steps/jobs scheduled (LSF-DRMAA). For the remaining 3, their new histories are created but remain empty (according to the GUI). Only upon restart of the two job handlers the remaining workflow steps are scheduled and shown in the history.
First question, how do we resolve this issue? Second, how does this actually work? How are the workflow steps stored in the database i.e. why are they not shown in the web interface until they are processed by a handler?
Possible relevant config settings: [server:handler0] use_threadpool = true threadpool_workers = 5
[server:handler1] use_threadpool = true threadpool_workers = 5
[app:main] force_beta_workflow_scheduled_min_steps=1 force_beta_workflow_scheduled_for_collections=True track_jobs_in_database = True enable_job_recovery = True retry_metadata_internally = False cache_user_job_count = True # only a limit set for the very few local tools like upload
Cheers,
Jelle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (3)
-
Charles Girardot
-
Jelle Scholtalbers
-
John Chilton