Re: [galaxy-dev] Irregular failures on cluster (pbs_python?)

21 May 2013


      On May 21, 2013, at 2:13 PM, Christopher Fields <cjfields@illinois.edu>
 wrote:
...
On May 20, 2013, at 3:08 PM, Nate Coraor <nate@bx.psu.edu> wrote:
...
...
Hi Chris,
We're disconnecting under all normal conditions and most error conditions - it looks like only a few conditions would not properly disconnect:
- If pbs_submit() fails 5 times in a row
- If an exception is raised anywhere in the queue_job() method after pbs_connect() is called
- If the call to pbs_statjob() in check_all_jobs() or check_single_job() raises an exception (if that's even possible)
For the exceptions, you would see that an exception was caught in the log file, so you should be able to determine if this is happening.
For the pbs_submit() case, you'd see the message "All attempts to submit job failed".
You may want to move the call to pbs_connect() in queue_job() so that it occurs immediately prior to the call to pbs_submit() and see if that makes a difference.  The reason we connect so early on is to avoid writing out the job's files if the PBS server doesn't exist anyway.
Yes, seeing the pbs_submit() case.  Lowering the # handlers does seem to help, but we still seem to run into this after
Sorry, saw a unicorn.  Meant, 'we still seem to run into this after a period of time on the cluster'.  Coffee…  :P

chris

Re: [galaxy-dev] Irregular failures on cluster (pbs_python?)

Fields, Christopher J