Hi,
Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding.
Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file:
galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed
After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server.
Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows.
best regards,
Geert Vandeweyer
galaxy-dev@lists.galaxyproject.org