Hi, John,

Thank you for taking the time to help me look into this issue.  I have applied the patch you provided and confirmed that it appears to help remediate the problem (when a walltime is exceeded feedback is in fact provided via the Galaxy web UI; it no longer appears that jobs are running indefinitely).    One thing I would like to note is that the error that is provided to the user is generic, i.e. the web UI reports "An error occurred with this dataset: Job cannot be completed due to a cluster error, please retry it later".  So, the fact that a Walltime exceeded error actually occurred is not presented to the user (I am not sure if this is intentional or not).  Again, I appreciate you taking the time to verify and patch this issue.  I have attached a screenshot of the output for your review.

I am probably going to be testing Galaxy with Torque 4.2.5 in the coming weeks, I will let you know if I identify any additional problems.  Thank you so much have a wonderful day.

Dan Sullivan


On Tue, Nov 5, 2013 at 8:48 AM, John Chilton <chilton@msi.umn.edu> wrote:
Hey Daniel,

Thanks so much for the details problem report, it was very helpful.
Reviewing the code there appears to be a bug in the PBS job runner -
in some cases pbs_job_state.stop_job is never set but is attempted to
be read. I don't have torque so I don't have a great test setup for
this problem, any chance you can make the following changes for me and
let me know if they work?

Between the following two lines:

                    log.error( '(%s/%s) PBS job failed: %s' % (
galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ),
'Unknown error: %s' % status.exit_status ) ) )
                    self.work_queue.put( ( self.fail_job, pbs_job_state ) )

                    log.error( '(%s/%s) PBS job failed: %s' % (
galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ),
'Unknown error: %s' % status.exit_status ) ) )
                    pbs_job_state.stop_job = False
                    self.work_queue.put( ( self.fail_job, pbs_job_state ) )

And at the top of the file can you add a -11 option to the
JOB_EXIT_STATUS to indicate a job timeout.

I have attached a patch that would apply against the latest stable -
it will probably will work against your branch as well.

If you would rather not act as my QC layer, I can try to come up with
a way to do some testing on my end :).

Thanks again,
-John


On Mon, Nov 4, 2013 at 10:10 AM, Daniel Patrick Sullivan
<dansullivan@gmail.com> wrote:
> Hi, Galaxy Developers,
>
> I have what I hops is somewhat of a basic question regarding Galaxy's
> interaction with a pbs job cluster and information reported via the webUI.
> Basically, in certain situations, the walltime of a specific job is
> exceeded.  This is of course to be expected and all fine and
> understandeable.
>
> My problem is that the information is not being relayed back to the end user
> via the Galaxy web UI, which causes confusion in our Galaxy user community.
> Basically the Torque scheduler generates the following message when a
> walltime is exceeded:
>
> 11/04/2013
> 08:39:45;000d;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;preparing to
> send 'a' mail for job 163.sctest.cri.uchicago.edu to
> s.cri.galaxy@crigalaxy-test.uchicago.edu (Job exceeded its walltime limit.
> Job was aborted
> 11/04/2013
> 08:39:45;0009;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;job exit
> status -11 handled
>
> Now, my problem is that this status -11 return code is not being correctly
> handled by Galaxy.  What happens is that Galaxy throws an exception,
> specificially:
>
> 10.135.217.178 - - [04/Nov/2013:08:39:42 -0500] "GET
> /api/histories/90240358ebde1489 HTTP/1.1" 200 -
> "https://crigalaxy-test.uchicago.edu/history" "Mozilla/5.0 (X11; Linux
> x86_64; rv:23.0) Gecko/20100101 Firefox/23.0"
> galaxy.jobs.runners.pbs DEBUG 2013-11-04 08:39:46,137
> (2150/163.sctest.cri.uchicago.edu) PBS job state changed from R to C
> galaxy.jobs.runners.pbs ERROR 2013-11-04 08:39:46,139
> (2150/163.sctest.cri.uchicago.edu) PBS job failed: Unknown error: -11
> galaxy.jobs.runners ERROR 2013-11-04 08:39:46,139 (unknown) Unhandled
> exception calling fail_job
> Traceback (most recent call last):
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 60, in run_next
>     method(arg)
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
> 561, in fail_job
>     if pbs_job_state.stop_job:
> AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
>
> After this exception occurs, the Galaxy job status via the Web UI is still
> reported as "Job is currently running".  It appears that the job will remain
> in this state (from the end users perspective) indefinitely.  Has anybody
> seen this issue before?
>
> I noticed that return code -11 does not exist in
> /group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py under the
> JOB_EXIT_STATUS  dictionary.   I tried adding an entry for this, however
> when I do the exception changes to:
>
> galaxy.jobs.runners.pbs ERROR 2013-11-04 10:02:17,274
> (2151/164.sctest.cri.uchicago.edu) PBS job failed: job walltime exceeded
> galaxy.jobs.runners ERROR 2013-11-04 10:02:17,275 (unknown) Unhandled
> exception calling fail_job
> Traceback (most recent call last):
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 60, in run_next
>     method(arg)
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
> 562, in fail_job
>     if pbs_job_state.stop_job:
> AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
>
> I am wondering if this is a bug or if it is just because I am using a newer
> version of TORQUE (I am using TORQUE 4.2.2).
>
> In terms of Galaxy, I am using:
>
> [s.cri.galaxy@crigalaxy-test galaxy-dist]$ hg parents
> changeset:   10408:6822f41bc9bb
> branch:      stable
> parent:      10393:d05bf67aefa6
> user:        Dave Bouvier <dave@bx.psu.edu>
> date:        Mon Aug 19 13:06:17 2013 -0400
> summary:     Fix for case where running functional tests might overwrite
> certain files in database/files.
>
> [s.cri.galaxy@crigalaxy-test galaxy-dist]$
>
> Does anybody know how I could fix this such that walltime exceeded messages
> are correctly reporeted via the Galaxy web UI for TORQUE 4.2.2?  Thank you
> so much for your input and guidance, and for the ongoing development of
> Galaxy.
>
> Dan Sullivan
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/