Hi all,

On our installation (v15.07) we suddenly see that one of two job handlers get stuck with a high cpu load (last message generally, `cleaning up external metadata files`) without new messages appearing. In addition, when running workflows in batch (>6x), only a few of them (~3) get their workflow steps/jobs scheduled (LSF-DRMAA). For the remaining 3, their new histories are created but remain empty (according to the GUI). Only upon restart of the two job handlers the remaining workflow steps are scheduled and shown in the history.

First question, how do we resolve this issue?
Second, how does this actually work? How are the workflow steps stored in the database i.e. why are they not shown in the web interface until they are processed by a handler?

Possible relevant config settings:
[server:handler0]
use_threadpool = true
threadpool_workers = 5

[server:handler1]
use_threadpool = true
threadpool_workers = 5

[app:main]
force_beta_workflow_scheduled_min_steps=1
force_beta_workflow_scheduled_for_collections=True
track_jobs_in_database = True
enable_job_recovery = True
retry_metadata_internally = False
cache_user_job_count = True # only a limit set for the very few local tools like upload

Cheers,

Jelle