Unable to queue job, resubmitting might help

11 Jul 2012

      Hi,

Today I ran into a cluster error on our local instance using latest 
galaxy-dist and torque/pbs with the python-pbs binding.

Under heavy load of the galaxy process, it appears that the handler 
processes failed to contact the pbs-server, although the pbs_server was 
still up and running. after that, a lot of the following statements kept 
appearing in the handler.log file:

galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 
(11647/12788.pbs_master_address) Skipping state check because PBS server 
connection failed

After restarting the galaxy process (run.sh), everything worked again, 
with no changes to the pbs_server.

Would it be possible to setup some checks for this failure? Like:
  - contact system admin
  - restart galaxy
  - auto retry job submission after a while as to not crash workflows.

best regards,

Geert Vandeweyer

Geert Vandeweyer

tags

participants (1)