Unable to queue job, resubmitting might help
Hi, Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding. Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file: galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server. Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows. best regards, Geert Vandeweyer
participants (1)
-
Geert Vandeweyer