
On May 20, 2013, at 9:45 AM, Nate Coraor <nate@bx.psu.edu> wrote:
On May 19, 2013, at 11:41 PM, Fields, Christopher J wrote:
I've been seeing this error popping up quite a bit recently (we're using Torque 3.0.5), which is giving a general 'cluster failure' Galaxy error on jobs:
galaxy.jobs.runners.pbs WARNING 2013-05-19 18:56:44,073 (10588) pbs_submit failed (try 1/5), PBS error 15033: No free connections
This just recently started springing up after we had been using the cluster for over a year, not sure why it would start acting up now. This appears to be related to a bug/defect with pbs_python, which doesn't seem to have been fixed yet (I posted a query whether this has been addressed):
https://oss.trac.surfsara.nl/pbs_python/ticket/29
Restarting helps, but are there any other recommended workarounds? Is the only solution recompiling Torque with NCONNECTS?
Hi Chris,
I've always increased NCONNECTS to avoid this. You may also want to decrease the number of workers for the runners as this should decrease the number of connections that Galaxy makes.
You could also try the DRMAA runner.
--nate
Hi Nate, I reduced the handlers down and that seems to have taken care of it for now. One of the authors of pbs_python (in the error report I commented on (https://oss.trac.surfsara.nl/pbs_python/ticket/29#comment:5) replied: "Which framework do you use PBSQuery? Then there will be open/close with every query. If you use pbs_python you have to close the connetction and open it again. The number of connections is handle by the pbs_server and also the time that a connection can be open. This has nothing to do with pbs python code, maybe i can improve the error codes." Not sure how to answer that one as I don't know exactly what Galaxy is doing internally. BTW, we're still using a pre-April Galaxy release, so we may test this again once we have updated to the latest (probably in the next month or so). We may switch over to DRMAA if you find that more stable, would just need to schedule down time to recompile with DRMAA support and higher NCONNECTS. chris