Hi, Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding. Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file: galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server. Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows. best regards, Geert Vandeweyer -- Geert Vandeweyer, Ph.D. Department of Medical Genetics University of Antwerp Prins Boudewijnlaan 43 2650 Edegem Belgium Tel: +32 (0)3 275 97 56 E-mail: geert.vandeweyer@ua.ac.be http://ua.ac.be/cognitivegenetics http://www.linkedin.com/pub/geert-vandeweyer/26/457/726