Hi, Galaxy Developers,

I have what I hops is somewhat of a basic question regarding Galaxy's interaction with a pbs job cluster and information reported via the webUI.  Basically, in certain situations, the walltime of a specific job is exceeded.  This is of course to be expected and all fine and understandeable.

My problem is that the information is not being relayed back to the end user via the Galaxy web UI, which causes confusion in our Galaxy user community.   Basically the Torque scheduler generates the following message when a walltime is exceeded:

11/04/2013 08:39:45;000d;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;preparing to send 'a' mail for job 163.sctest.cri.uchicago.edu to s.cri.galaxy@crigalaxy-test.uchicago.edu (Job exceeded its walltime limit. Job was aborted
11/04/2013 08:39:45;0009;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;job exit status -11 handled

Now, my problem is that this status -11 return code is not being correctly handled by Galaxy.  What happens is that Galaxy throws an exception, specificially:

10.135.217.178 - - [04/Nov/2013:08:39:42 -0500] "GET /api/histories/90240358ebde1489 HTTP/1.1" 200 - "https://crigalaxy-test.uchicago.edu/history" "Mozilla/5.0 (X11; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0"
galaxy.jobs.runners.pbs DEBUG 2013-11-04 08:39:46,137 (2150/163.sctest.cri.uchicago.edu) PBS job state changed from R to C
galaxy.jobs.runners.pbs ERROR 2013-11-04 08:39:46,139 (2150/163.sctest.cri.uchicago.edu) PBS job failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-11-04 08:39:46,139 (unknown) Unhandled exception calling fail_job
Traceback (most recent call last):
  File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next
    method(arg)
  File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 561, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

After this exception occurs, the Galaxy job status via the Web UI is still reported as "Job is currently running".  It appears that the job will remain in this state (from the end users perspective) indefinitely.  Has anybody seen this issue before?

I noticed that return code -11 does not exist in /group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py under the JOB_EXIT_STATUS  dictionary.   I tried adding an entry for this, however when I do the exception changes to:

galaxy.jobs.runners.pbs ERROR 2013-11-04 10:02:17,274 (2151/164.sctest.cri.uchicago.edu) PBS job failed: job walltime exceeded
galaxy.jobs.runners ERROR 2013-11-04 10:02:17,275 (unknown) Unhandled exception calling fail_job
Traceback (most recent call last):
  File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next
    method(arg)
  File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 562, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

I am wondering if this is a bug or if it is just because I am using a newer version of TORQUE (I am using TORQUE 4.2.2).

In terms of Galaxy, I am using:

[s.cri.galaxy@crigalaxy-test galaxy-dist]$ hg parents
changeset:   10408:6822f41bc9bb
branch:      stable
parent:      10393:d05bf67aefa6
user:        Dave Bouvier <dave@bx.psu.edu>
date:        Mon Aug 19 13:06:17 2013 -0400
summary:     Fix for case where running functional tests might overwrite certain files in database/files.

[s.cri.galaxy@crigalaxy-test galaxy-dist]$

Does anybody know how I could fix this such that walltime exceeded messages are correctly reporeted via the Galaxy web UI for TORQUE 4.2.2?  Thank you so much for your input and guidance, and for the ongoing development of Galaxy.

Dan Sullivan