Re: [galaxy-dev] Unable to queue job, resubmitting might help

12 Jul 2012


      We have been experiencing this issue lately too since downgrading the
galaxy web processes from a machine with 128G of RAM to a VM with 4G.
There is a memory leak in the pbs stuff somewhere, it has been
mentioned on the list before, no one seems to have found the cause or
how to prevent it. It seems like the thing to do is monitor the memory
usage of that process and restart it if it reaches a certain
threshold. Someone has posted that they are doing this with monit
(mmonit.com/monit/), locally we are trying to setup a cron job with a
simple script that does the same thing.

-John

On Thu, Jul 12, 2012 at 2:03 AM, Geert Vandeweyer
<geert.vandeweyer2@ua.ac.be> wrote:
...
Hi,
Today I ran into a cluster error on our local instance using latest
galaxy-dist and torque/pbs with the python-pbs binding.
Under heavy load of the galaxy process, it appears that the handler
processes failed to contact the pbs-server, although the pbs_server was
still up and running. after that, a lot of the following statements kept
appearing in the handler.log file:
galaxy.jobs.runners.pbs DEBUG 2012-07-11 17:39:06,649
(11647/12788.pbs_master_address) Skipping state check because PBS server
connection failed
After restarting the galaxy process (run.sh), everything worked again, with
no changes to the pbs_server.
Would it be possible to setup some checks for this failure? Like:
 - contact system admin
 - restart galaxy
 - auto retry job submission after a while as to not crash workflows.
best regards,
Geert Vandeweyer
--
Geert Vandeweyer, Ph.D.
Department of Medical Genetics
University of Antwerp
Prins Boudewijnlaan 43
2650 Edegem
Belgium
Tel: +32 (0)3 275 97 56
E-mail: geert.vandeweyer@ua.ac.be
http://ua.ac.be/cognitivegenetics
http://www.linkedin.com/pub/geert-vandeweyer/26/457/726
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/

Re: [galaxy-dev] Unable to queue job, resubmitting might help

John Chilton