Re: [galaxy-dev] resubmission on out of memory

21 Sep 2017

      Something like this is possible with some caveats. It is possible to
detect memory and walltime errors - but not based on regex in tools
but instead by the job runner. So the SLURM runner implements
detection of out of memory errors and timeout I think - I don't think
most of the other runners do.

When I started hacking on this feature, there was no documentation for
it and I wanted to understand how it worked and verify that it worked
so I wrote a test case. The problem is the test case tests a bunch of
different features all at once - so it will be a lot to walk through
and you will need to understand dynamic job destinations and such:

https://github.com/galaxyproject/galaxy/blob/dev/test/integration/resubmissi...
https://github.com/galaxyproject/galaxy/commit/0559cff6e94b250ddd98275b119ab...

That said let me see if I can come up with a simple example:

<job_conf>
<plugins>
<!-- setup a slurm runner or update another runner to detect these
conditions and set it up here -->
</plugins>

<destinations default="small_fast_host">
  <destination id="small_fast_host" runner="slurm">
    <param name="native_specification>SHORT_WALLTIME_SMALL_MEMORY_OPTS_FOR_YOUR_CLUSTER</param>
    <resubmit condition="walltime_reached" destination="longer_walltime_dest" />
    <resubmit condition="memory_limit_reached"
destination="bigger_memory_dest" />
    <resubmit condition="seconds_running < 5 and attempts < 3"
delay="attempt * 1.5" destination="small_fast_host" />
  </destination>
  <destination id="longer_walltime_dest" runner="slurm">
    <param name="native_specification>LONGER_WALLTIME_FOR_YOUR_CLUSTERS</param>
 </destination>
  <destination id="bigger_memory_dest" runner="slurm">
    <param name="native_specification>BIGGER_MEMORY_FOR_YOUR_CLUSTERS</param>
  </destination>
</destination>

<tools />
</job_conf>

Here you would fill in native_specifications for your various runners
to redirect jobs as needed. Everything is going through an initial
destination (though you could parameterize this and have any number of
initial destinations). That destinations is going to resubmit under 3
different conditions - if a walltime error is detected by the job
runner - it will resubmit to a destination that you have to configure
with a longer walltime (with id="longer_walltime_dest") - perhaps this
is a different cluster with longer wait times and corresponding longer
walltimes. Likewise if a memory error is detected - it will resubmit
to "bigger_memory_dest" (perhaps a special part of your cluster with
larger memory servers or a large shared memory machine). Finally to
show off some coolness I added - if the job fails right away (within
the first 5 seconds) - it will delay the job a bit and then retry to
submit up to 5 times. This may be good at working around random
cluster failures during submissions if things get busy.

The test case covers allowing users to supply parameters to assist
with finding destinations and controlling resubmission as well dynamic
destinations and how they may interact with these concepts.

Like you mentioned - it would be wonderful if tools could look at
their output and determine if memory problems were encountered - I
guess this is tracked here
(https://github.com/galaxyproject/galaxy/issues/3107). It is a medium
priority for me - so I may get to it at some point. This sort of thing
is important when scaling up analyses.

-John

On Tue, Sep 19, 2017 at 4:26 PM, Matthias Bernt <m.bernt@ufz.de> wrote:
...
Dear list,
I recall that its possible to configure a tool can such that out of memory
conditions (and run time) can be recognized (by regexp matching on
stadout/stderr). Can this be used to trigger job resubmission on the
cluster?
Could someone please point me to some kind of documentation, if this is the
case?
Best,
Matthias
--
-------------------------------------------
Matthias Bernt
Bioinformatics Service
Molekulare Systembiologie (MOLSYB)
Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/
Helmholtz Centre for Environmental Research GmbH - UFZ
Permoserstraße 15, 04318 Leipzig, Germany
Phone +49 341 235 482296,
m.bernt@ufz.de, www.ufz.de
Sitz der Gesellschaft/Registered Office: Leipzig
Registergericht/Registration Office: Amtsgericht Leipzig
Handelsregister Nr./Trade Register Nr.: B 4703
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: MinDirig
Wilfried Kraus
Wissenschaftlicher Geschäftsführer/Scientific Managing Director:
Prof. Dr. Dr. h.c. Georg Teutsch
Administrative Geschäftsführerin/ Administrative Managing Director:
Prof. Dr. Heike Graßmann
-------------------------------------------
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/

Re: [galaxy-dev] resubmission on out of memory

John Chilton