Dear list,
I would like to start using Amazon's spot pricing model for my
Galaxy/Cloudman instances (e.g. an on demand master node with spot instance
worker nodes). However, this means that Amazon at times of spot prices
higher than my set price limit will shutdown my instances without notion.
I was therefore looking for fault tolerance mechanisms in the galaxy
project, which I seem to remember existed. Somehow I can't find anything
about it right now though.
I've tested a little bit, and it seems that as soon as one reboots
instances or manually kills a job or task, the whole job is deleted and set
to error state. I am not that knowledgeable in cluster computing, so I
don't really know what handles what here, but this would be an ideal
starting point to learn something about SGE and queue handling. Is there
any mechanism in place that deals with node failure, network problems, etc?
If not, would it be hard to implement?
cheers,
jorrit boekel