On Nov 12, 2012, at 10:23 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:
I was therefore looking for fault tolerance mechanisms in the galaxy project, which I seem to remember existed. Somehow I can't find anything about it right now though.
I've tested a little bit, and it seems that as soon as one reboots instances or manually kills a job or task, the whole job is deleted and set to error state. I am not that knowledgeable in cluster computing, so I don't really know what handles what here, but this would be an ideal starting point to learn something about SGE and queue handling. Is there any mechanism in place that deals with node failure, network problems, etc? If not, would it be hard to implement?
You're correct in that currently jobs will be set to error and need to be automatically rerun by the Galaxy user. There isn't anything in place for automatic retry after spot instance failure, but this is definitely something we plan to implement in the near term - a generalized retry and resume mechanism will be useful for both cloud and local instances. -Dannon