Hi, John, Thank you for taking the time to help me look into this issue. I have applied the patch you provided and confirmed that it appears to help remediate the problem (when a walltime is exceeded feedback is in fact provided via the Galaxy web UI; it no longer appears that jobs are running indefinitely). One thing I would like to note is that the error that is provided to the user is generic, i.e. the web UI reports "An error occurred with this dataset: Job cannot be completed due to a cluster error, please retry it later". So, the fact that a Walltime exceeded error actually occurred is not presented to the user (I am not sure if this is intentional or not). Again, I appreciate you taking the time to verify and patch this issue. I have attached a screenshot of the output for your review. I am probably going to be testing Galaxy with Torque 4.2.5 in the coming weeks, I will let you know if I identify any additional problems. Thank you so much have a wonderful day. Dan Sullivan On Tue, Nov 5, 2013 at 8:48 AM, John Chilton <chilton@msi.umn.edu> wrote:
Hey Daniel,
Thanks so much for the details problem report, it was very helpful. Reviewing the code there appears to be a bug in the PBS job runner - in some cases pbs_job_state.stop_job is never set but is attempted to be read. I don't have torque so I don't have a great test setup for this problem, any chance you can make the following changes for me and let me know if they work?
Between the following two lines:
log.error( '(%s/%s) PBS job failed: %s' % ( galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ), 'Unknown error: %s' % status.exit_status ) ) ) self.work_queue.put( ( self.fail_job, pbs_job_state ) )
log.error( '(%s/%s) PBS job failed: %s' % ( galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ), 'Unknown error: %s' % status.exit_status ) ) ) pbs_job_state.stop_job = False self.work_queue.put( ( self.fail_job, pbs_job_state ) )
And at the top of the file can you add a -11 option to the JOB_EXIT_STATUS to indicate a job timeout.
I have attached a patch that would apply against the latest stable - it will probably will work against your branch as well.
If you would rather not act as my QC layer, I can try to come up with a way to do some testing on my end :).
Thanks again, -John
Hi, Galaxy Developers,
I have what I hops is somewhat of a basic question regarding Galaxy's interaction with a pbs job cluster and information reported via the webUI. Basically, in certain situations, the walltime of a specific job is exceeded. This is of course to be expected and all fine and understandeable.
My problem is that the information is not being relayed back to the end user via the Galaxy web UI, which causes confusion in our Galaxy user community. Basically the Torque scheduler generates the following message when a walltime is exceeded:
11/04/2013 08:39:45;000d;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;preparing to send 'a' mail for job 163.sctest.cri.uchicago.edu to s.cri.galaxy@crigalaxy-test.uchicago.edu (Job exceeded its walltime
Job was aborted 11/04/2013 08:39:45;0009;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;job exit status -11 handled
Now, my problem is that this status -11 return code is not being correctly handled by Galaxy. What happens is that Galaxy throws an exception, specificially:
10.135.217.178 - - [04/Nov/2013:08:39:42 -0500] "GET /api/histories/90240358ebde1489 HTTP/1.1" 200 - "https://crigalaxy-test.uchicago.edu/history" "Mozilla/5.0 (X11; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0" galaxy.jobs.runners.pbs DEBUG 2013-11-04 08:39:46,137 (2150/163.sctest.cri.uchicago.edu) PBS job state changed from R to C galaxy.jobs.runners.pbs ERROR 2013-11-04 08:39:46,139 (2150/163.sctest.cri.uchicago.edu) PBS job failed: Unknown error: -11 galaxy.jobs.runners ERROR 2013-11-04 08:39:46,139 (unknown) Unhandled exception calling fail_job Traceback (most recent call last): File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
561, in fail_job if pbs_job_state.stop_job: AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
After this exception occurs, the Galaxy job status via the Web UI is still reported as "Job is currently running". It appears that the job will remain in this state (from the end users perspective) indefinitely. Has anybody seen this issue before?
I noticed that return code -11 does not exist in /group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py under the JOB_EXIT_STATUS dictionary. I tried adding an entry for this, however when I do the exception changes to:
galaxy.jobs.runners.pbs ERROR 2013-11-04 10:02:17,274 (2151/164.sctest.cri.uchicago.edu) PBS job failed: job walltime exceeded galaxy.jobs.runners ERROR 2013-11-04 10:02:17,275 (unknown) Unhandled exception calling fail_job Traceback (most recent call last): File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
On Mon, Nov 4, 2013 at 10:10 AM, Daniel Patrick Sullivan <dansullivan@gmail.com> wrote: limit. line line
562, in fail_job if pbs_job_state.stop_job: AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
I am wondering if this is a bug or if it is just because I am using a newer version of TORQUE (I am using TORQUE 4.2.2).
In terms of Galaxy, I am using:
[s.cri.galaxy@crigalaxy-test galaxy-dist]$ hg parents changeset: 10408:6822f41bc9bb branch: stable parent: 10393:d05bf67aefa6 user: Dave Bouvier <dave@bx.psu.edu> date: Mon Aug 19 13:06:17 2013 -0400 summary: Fix for case where running functional tests might overwrite certain files in database/files.
[s.cri.galaxy@crigalaxy-test galaxy-dist]$
Does anybody know how I could fix this such that walltime exceeded messages are correctly reporeted via the Galaxy web UI for TORQUE 4.2.2? Thank you so much for your input and guidance, and for the ongoing development of Galaxy.
Dan Sullivan
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/