Unable to queue job, resubmitting might help

12 Jul 2012

      Hi,

Today I ran into a cluster error on our local instance using latest 
galaxy-dist and torque/pbs with the python-pbs binding.

Under heavy load of the galaxy process, it appears that the handler 
processes failed to contact the pbs-server, although the pbs_server was 
still up and running. after that, a lot of the following statements kept 
appearing in the handler.log file:

galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 
(11647/12788.pbs_master_address) Skipping state check because PBS server 
connection failed

After restarting the galaxy process (run.sh), everything worked again, 
with no changes to the pbs_server.

Would it be possible to setup some checks for this failure? Like:
  - contact system admin
  - restart galaxy
  - auto retry job submission after a while as to not crash workflows.

best regards,

Geert Vandeweyer

-- 

Geert Vandeweyer, Ph.D.
Department of Medical Genetics
University of Antwerp
Prins Boudewijnlaan 43
2650 Edegem
Belgium
Tel: +32 (0)3 275 97 56
E-mail: geert.vandeweyer@ua.ac.be
http://ua.ac.be/cognitivegenetics
http://www.linkedin.com/pub/geert-vandeweyer/26/457/726

Geert Vandeweyer

John Chilton

Nate Coraor

Geert Vandeweyer

tags

participants (3)