Hi all,
On our installation (v15.07) we suddenly see that one of two job
handlers get stuck with a high cpu load (last message generally,
`cleaning up external metadata files`) without new messages appearing.
In addition, when running workflows in batch (>6x), only a few of
them (~3) get their workflow steps/jobs scheduled (LSF-DRMAA). For the
remaining 3, their new histories are created but remain empty (according
to the GUI). Only upon restart of the two job handlers the remaining
workflow steps are scheduled and shown in the history.
First question, how do we resolve this issue?
Second,
how does this actually work? How are the workflow steps stored in the
database i.e. why are they not shown in the web interface until they are
processed by a handler?
Possible relevant config settings:
[server:handler0]
use_threadpool = true
threadpool_workers = 5
[server:handler1]
use_threadpool = true
threadpool_workers = 5
[app:main]
force_beta_workflow_scheduled_min_steps=1
force_beta_workflow_scheduled_for_collections=True
track_jobs_in_database = True
enable_job_recovery = True
retry_metadata_internally = False
cache_user_job_count = True # only a limit set for the very few local tools like upload
Cheers,
Jelle