galaxy/cloudman failure handling
Dear list, I would like to start using Amazon's spot pricing model for my Galaxy/Cloudman instances (e.g. an on demand master node with spot instance worker nodes). However, this means that Amazon at times of spot prices higher than my set price limit will shutdown my instances without notion. I was therefore looking for fault tolerance mechanisms in the galaxy project, which I seem to remember existed. Somehow I can't find anything about it right now though. I've tested a little bit, and it seems that as soon as one reboots instances or manually kills a job or task, the whole job is deleted and set to error state. I am not that knowledgeable in cluster computing, so I don't really know what handles what here, but this would be an ideal starting point to learn something about SGE and queue handling. Is there any mechanism in place that deals with node failure, network problems, etc? If not, would it be hard to implement? cheers, jorrit boekel
On Nov 12, 2012, at 10:23 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:
I was therefore looking for fault tolerance mechanisms in the galaxy project, which I seem to remember existed. Somehow I can't find anything about it right now though.
I've tested a little bit, and it seems that as soon as one reboots instances or manually kills a job or task, the whole job is deleted and set to error state. I am not that knowledgeable in cluster computing, so I don't really know what handles what here, but this would be an ideal starting point to learn something about SGE and queue handling. Is there any mechanism in place that deals with node failure, network problems, etc? If not, would it be hard to implement?
You're correct in that currently jobs will be set to error and need to be automatically rerun by the Galaxy user. There isn't anything in place for automatic retry after spot instance failure, but this is definitely something we plan to implement in the near term - a generalized retry and resume mechanism will be useful for both cloud and local instances. -Dannon
participants (2)
-
Dannon Baker
-
Jorrit Boekel