Unable to queue job, resubmitting might help
Hi, Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding. Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file: galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server. Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows. best regards, Geert Vandeweyer -- Geert Vandeweyer, Ph.D. Department of Medical Genetics University of Antwerp Prins Boudewijnlaan 43 2650 Edegem Belgium Tel: +32 (0)3 275 97 56 E-mail: geert.vandeweyer@ua.ac.be http://ua.ac.be/cognitivegenetics http://www.linkedin.com/pub/geert-vandeweyer/26/457/726
We have been experiencing this issue lately too since downgrading the galaxy web processes from a machine with 128G of RAM to a VM with 4G. There is a memory leak in the pbs stuff somewhere, it has been mentioned on the list before, no one seems to have found the cause or how to prevent it. It seems like the thing to do is monitor the memory usage of that process and restart it if it reaches a certain threshold. Someone has posted that they are doing this with monit (mmonit.com/monit/), locally we are trying to setup a cron job with a simple script that does the same thing. -John On Thu, Jul 12, 2012 at 2:03 AM, Geert Vandeweyer <geert.vandeweyer2@ua.ac.be> wrote:
Hi,
Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding.
Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file:
galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed
After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server.
Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows.
best regards,
Geert Vandeweyer
--
Geert Vandeweyer, Ph.D. Department of Medical Genetics University of Antwerp Prins Boudewijnlaan 43 2650 Edegem Belgium Tel: +32 (0)3 275 97 56 E-mail: geert.vandeweyer@ua.ac.be http://ua.ac.be/cognitivegenetics http://www.linkedin.com/pub/geert-vandeweyer/26/457/726
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Jul 12, 2012, at 3:03 AM, Geert Vandeweyer wrote:
Hi,
Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding.
Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file:
galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed
After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server.
Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows.
Hi Geert, It'd be useful to retry submission rather than fail. I doubt we'll get to it soon, but would welcome any submissions that did this. Is restarting Galaxy absolutely necessary, or will job submission begin to succeed again after load goes down? --nate
best regards,
Geert Vandeweyer
--
Geert Vandeweyer, Ph.D. Department of Medical Genetics University of Antwerp Prins Boudewijnlaan 43 2650 Edegem Belgium Tel: +32 (0)3 275 97 56 E-mail: geert.vandeweyer@ua.ac.be http://ua.ac.be/cognitivegenetics http://www.linkedin.com/pub/geert-vandeweyer/26/457/726
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Nate, I kept keeping the errors untill I restarted galaxy completely. I could still submit jobs to the HPC queue from other programs. I'm not very familiar with python, but if you have pointers on where to start to solve this, I might be able to contribute. If it would be possible to restart for example just the handlers, this might be enough. best regards, Geert On 07/12/2012 07:05 PM, Nate Coraor wrote:
On Jul 12, 2012, at 3:03 AM, Geert Vandeweyer wrote:
Hi,
Today I ran into a cluster error on our local instance using latest galaxy-dist and torque/pbs with the python-pbs binding.
Under heavy load of the galaxy process, it appears that the handler processes failed to contact the pbs-server, although the pbs_server was still up and running. after that, a lot of the following statements kept appearing in the handler.log file:
galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649 (11647/12788.pbs_master_address) Skipping state check because PBS server connection failed
After restarting the galaxy process (run.sh), everything worked again, with no changes to the pbs_server.
Would it be possible to setup some checks for this failure? Like: - contact system admin - restart galaxy - auto retry job submission after a while as to not crash workflows. Hi Geert,
It'd be useful to retry submission rather than fail. I doubt we'll get to it soon, but would welcome any submissions that did this. Is restarting Galaxy absolutely necessary, or will job submission begin to succeed again after load goes down?
--nate
best regards,
Geert Vandeweyer
--
Geert Vandeweyer, Ph.D. Department of Medical Genetics University of Antwerp Prins Boudewijnlaan 43 2650 Edegem Belgium Tel: +32 (0)3 275 97 56 E-mail: geert.vandeweyer@ua.ac.be http://ua.ac.be/cognitivegenetics http://www.linkedin.com/pub/geert-vandeweyer/26/457/726
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Geert Vandeweyer, Ph.D. Department of Medical Genetics University of Antwerp Prins Boudewijnlaan 43 2650 Edegem Belgium Tel: +32 (0)3 275 97 56 E-mail: geert.vandeweyer@ua.ac.be http://ua.ac.be/cognitivegenetics http://www.linkedin.com/pub/geert-vandeweyer/26/457/726
participants (3)
-
Geert Vandeweyer
-
John Chilton
-
Nate Coraor