Dear list,
I was thinking about implementing the job resubmission feature for drmaa.
I hope that I can simplify the job configuration for our installation (and probably others as well) by escalating through different queues (or ressource limits). Thereby I hope to reduce the number of special cases that I need to take care.
I was wondering if there are others
- who are also interested in this feature and want to join? I would try to give this project a head start in the next week.
- that may have started to work on this feature or just started to think about it and want to share code/experience
Best, Matthias
We've done a lot of work in Galaxy dev on this problem over the last few years - I'm not sure how much concrete progress we have made.
Nate started it and I did some work at the end of last year. Just to summarize my most recent work on this - in https://github.com/galaxyproject/galaxy/pull/3291/commits/b78287f1508db2c06f... I added some test cases for the existing job runner resubmission stuff - it was just my sense to understand what was there - hopefully the examples in the form of test cases help you as well.. This includes a little test job_conf.xml file that describes how you can catch job walltime and memory limit hits registered by the job runner and send jobs to different destinations. This requires the job runner knows how to record these problems - which the SLURM job runner does - other job runners like the generic drmaa runner may need to be subclassed to check for these things in general.
In https://github.com/galaxyproject/galaxy/pull/3291/commits/7d52b28ab2ab0314cd... I created a little DSL for resubmissions to make what can be expressed in job_conf more powerful. Then I added variables to expressions language such as seconds_since_queued, seconds_running(https://github.com/galaxyproject/galaxy/pull/3291/commits/18eb1c8d0e4c3f7616...), and attempt number (https://github.com/galaxyproject/galaxy/pull/3291/commits/7e338d790964f594ae...). I also added the ability to resubmit on unknown job runner problems here (https://github.com/galaxyproject/galaxy/pull/3291/commits/0559cff6e94b250ddd...).
None of this is really documented outside the test cases - it is waiting for someone to come along and find it useful.
I think the next thing I'd like to see for job resubmission besides documentation and more job runner support for common runners is described in this issue (https://github.com/galaxyproject/galaxy/issues/3320) - all the existing resubmission logic is based on errors detected from job runners - if the underlying error exhibits itself as a tool failure - we need a way to reason about that and we cannot currently.
Hope this helps.
-John
On Thu, Jun 15, 2017 at 10:37 AM, Matthias Bernt m.bernt@ufz.de wrote:
Dear list,
I was thinking about implementing the job resubmission feature for drmaa.
I hope that I can simplify the job configuration for our installation (and probably others as well) by escalating through different queues (or ressource limits). Thereby I hope to reduce the number of special cases that I need to take care.
I was wondering if there are others
- who are also interested in this feature and want to join? I would try to
give this project a head start in the next week.
- that may have started to work on this feature or just started to think
about it and want to share code/experience
Best, Matthias ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/
Dear John,
thanks a lot for all the information. I guess I will need some time to dig into this.
For drmaa the wait() function of the python library seems to return quite bit of useful information: hasExited, hasCoreDump, hasSignal, and terminateSignal. I guess this would be of help.
The problem seems to be that when the external run script is used jobs can not be queried properly (see my other post). But I did not understand this completely.
Cheers, Matthias
On 15.06.2017 19:05, John Chilton wrote:
We've done a lot of work in Galaxy dev on this problem over the last few years - I'm not sure how much concrete progress we have made.
Nate started it and I did some work at the end of last year. Just to summarize my most recent work on this - in https://github.com/galaxyproject/galaxy/pull/3291/commits/b78287f1508db2c06f... I added some test cases for the existing job runner resubmission stuff
- it was just my sense to understand what was there - hopefully the
examples in the form of test cases help you as well.. This includes a little test job_conf.xml file that describes how you can catch job walltime and memory limit hits registered by the job runner and send jobs to different destinations. This requires the job runner knows how to record these problems - which the SLURM job runner does - other job runners like the generic drmaa runner may need to be subclassed to check for these things in general.
In https://github.com/galaxyproject/galaxy/pull/3291/commits/7d52b28ab2ab0314cd... I created a little DSL for resubmissions to make what can be expressed in job_conf more powerful. Then I added variables to expressions language such as seconds_since_queued, seconds_running(https://github.com/galaxyproject/galaxy/pull/3291/commits/18eb1c8d0e4c3f7616...), and attempt number (https://github.com/galaxyproject/galaxy/pull/3291/commits/7e338d790964f594ae...). I also added the ability to resubmit on unknown job runner problems here (https://github.com/galaxyproject/galaxy/pull/3291/commits/0559cff6e94b250ddd...).
None of this is really documented outside the test cases - it is waiting for someone to come along and find it useful.
I think the next thing I'd like to see for job resubmission besides documentation and more job runner support for common runners is described in this issue (https://github.com/galaxyproject/galaxy/issues/3320) - all the existing resubmission logic is based on errors detected from job runners - if the underlying error exhibits itself as a tool failure - we need a way to reason about that and we cannot currently.
Hope this helps.
-John
On Thu, Jun 15, 2017 at 10:37 AM, Matthias Bernt m.bernt@ufz.de wrote:
Dear list,
I was thinking about implementing the job resubmission feature for drmaa.
I hope that I can simplify the job configuration for our installation (and probably others as well) by escalating through different queues (or ressource limits). Thereby I hope to reduce the number of special cases that I need to take care.
I was wondering if there are others
- who are also interested in this feature and want to join? I would try to
give this project a head start in the next week.
- that may have started to work on this feature or just started to think
about it and want to share code/experience
Best, Matthias ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/
galaxy-dev@lists.galaxyproject.org