Dear list,

I would like to start using Amazon's spot pricing model for my Galaxy/Cloudman instances (e.g. an on demand master node with spot instance worker nodes). However, this means that Amazon at times of spot prices higher than my set price limit will shutdown my instances without notion.

I was therefore looking for fault tolerance mechanisms in the galaxy project, which I seem to remember existed. Somehow I can't find anything about it right now though.

I've tested a little bit, and it seems that as soon as one reboots instances or manually kills a job or task, the whole job is deleted and set to error state. I am not that knowledgeable in cluster computing, so I don't really know what handles what here, but this would be an ideal starting point to learn something about SGE and queue handling. Is there any mechanism in place that deals with node failure, network problems, etc? If not, would it be hard to implement?

cheers,
jorrit boekel