Something like this is possible with some caveats. It is possible to detect memory and walltime errors - but not based on regex in tools but instead by the job runner. So the SLURM runner implements detection of out of memory errors and timeout I think - I don't think most of the other runners do. When I started hacking on this feature, there was no documentation for it and I wanted to understand how it worked and verify that it worked so I wrote a test case. The problem is the test case tests a bunch of different features all at once - so it will be a lot to walk through and you will need to understand dynamic job destinations and such: https://github.com/galaxyproject/galaxy/blob/dev/test/integration/resubmissi... https://github.com/galaxyproject/galaxy/commit/0559cff6e94b250ddd98275b119ab... That said let me see if I can come up with a simple example: <job_conf> <plugins> <!-- setup a slurm runner or update another runner to detect these conditions and set it up here --> </plugins> <destinations default="small_fast_host"> <destination id="small_fast_host" runner="slurm"> <param name="native_specification>SHORT_WALLTIME_SMALL_MEMORY_OPTS_FOR_YOUR_CLUSTER</param> <resubmit condition="walltime_reached" destination="longer_walltime_dest" /> <resubmit condition="memory_limit_reached" destination="bigger_memory_dest" /> <resubmit condition="seconds_running < 5 and attempts < 3" delay="attempt * 1.5" destination="small_fast_host" /> </destination> <destination id="longer_walltime_dest" runner="slurm"> <param name="native_specification>LONGER_WALLTIME_FOR_YOUR_CLUSTERS</param> </destination> <destination id="bigger_memory_dest" runner="slurm"> <param name="native_specification>BIGGER_MEMORY_FOR_YOUR_CLUSTERS</param> </destination> </destination> <tools /> </job_conf> Here you would fill in native_specifications for your various runners to redirect jobs as needed. Everything is going through an initial destination (though you could parameterize this and have any number of initial destinations). That destinations is going to resubmit under 3 different conditions - if a walltime error is detected by the job runner - it will resubmit to a destination that you have to configure with a longer walltime (with id="longer_walltime_dest") - perhaps this is a different cluster with longer wait times and corresponding longer walltimes. Likewise if a memory error is detected - it will resubmit to "bigger_memory_dest" (perhaps a special part of your cluster with larger memory servers or a large shared memory machine). Finally to show off some coolness I added - if the job fails right away (within the first 5 seconds) - it will delay the job a bit and then retry to submit up to 5 times. This may be good at working around random cluster failures during submissions if things get busy. The test case covers allowing users to supply parameters to assist with finding destinations and controlling resubmission as well dynamic destinations and how they may interact with these concepts. Like you mentioned - it would be wonderful if tools could look at their output and determine if memory problems were encountered - I guess this is tracked here (https://github.com/galaxyproject/galaxy/issues/3107). It is a medium priority for me - so I may get to it at some point. This sort of thing is important when scaling up analyses. -John On Tue, Sep 19, 2017 at 4:26 PM, Matthias Bernt <m.bernt@ufz.de> wrote:
Dear list,
I recall that its possible to configure a tool can such that out of memory conditions (and run time) can be recognized (by regexp matching on stadout/stderr). Can this be used to trigger job resubmission on the cluster?
Could someone please point me to some kind of documentation, if this is the case?
Best, Matthias
--
------------------------------------------- Matthias Bernt Bioinformatics Service Molekulare Systembiologie (MOLSYB) Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/ Helmholtz Centre for Environmental Research GmbH - UFZ Permoserstraße 15, 04318 Leipzig, Germany Phone +49 341 235 482296, m.bernt@ufz.de, www.ufz.de
Sitz der Gesellschaft/Registered Office: Leipzig Registergericht/Registration Office: Amtsgericht Leipzig Handelsregister Nr./Trade Register Nr.: B 4703 Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: MinDirig Wilfried Kraus Wissenschaftlicher Geschäftsführer/Scientific Managing Director: Prof. Dr. Dr. h.c. Georg Teutsch Administrative Geschäftsführerin/ Administrative Managing Director: Prof. Dr. Heike Graßmann ------------------------------------------- ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/