HI Nate,

Many thanks for these ideas - our HPC guys are going to try a few things. Hopefully we'll nail the problem and be able to report back in case someone else has the same issues.


Best Wishes,
David.

__________________________________
Dr David A. Matthews

Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.

Tel. +44 117 3312058
Fax. +44 117 3312091







On 19 Dec 2011, at 15:56, Nate Coraor wrote:

On Dec 14, 2011, at 6:13 PM, David Matthews wrote:

Hi Guys,

Sorry to be a pain but this seems to be getting worse for us. Here are the latest tracebacks - any suggestions would be gratefully received!!

Hi David,

As the MemoryError indicates, the Galaxy process is running out of memory.  debug = False is preferable, actually.  I asked because having debug = True could easily result in the behavior you're seeing.

The pbs code definitely has a memory leak, I believe within libtorque or pbs_python.  Because of this, I restart my job runner process when it reaches a certain amount of memory usage.  However, this may not be the cause of your errors.  To figure it out, we'll need to know exactly which thread is consuming the memory.  You may want to enable the heartbeat log and look there to see which threads are active.

The question about the path was in reference to whether these errors occur immediately upon running a tophat job, without any interaction, or if they occur when you try to click to view the job's output, or on some other part of the Galaxy interface.

Thanks,
--nate


Cheers
David



galaxy.jobs.runners.pbs ERROR 2011-12-13 19:57:57,689 Uncaught exception checking jobs
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 338, in monitor
 self.check_watched_items()
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 351, in check_watched_items
 ( failures, statuses ) = self.check_all_jobs()
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 462, in check_all_jobs
 statuses.update( self.convert_statjob_to_bunches( jobs ) )
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line 476, in convert_statjob_to_bunches
 statuses[ job.name ] = Bunch( **status )
MemoryError
Unhandled exception in thread started by
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 580, in __bootstrap_inner
MemoryError
Unhandled exception in thread started by <bound method Thread.__bootstrap of <Thread(Thread-11, stopped 1111390528)>>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 545, in __bootstrap_inner
MemoryError
Unexpected exception in worker <function <lambda> at 0x883acf8>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 863, in worker_thread_callback
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1037, in <lambda>
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1056, in process_request_in_thread
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1044, in handle_error
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 334, in handle_error
MemoryError
Unhandled exception in thread started by <bound method Thread.__bootstrap of <Thread(Thread-10, stopped 1109289280)>>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 545, in __bootstrap_inner
MemoryError
----------------------------------------
Exception happened during processing of request from ('xxx.xxx.xxx.xxx', 44389)
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1053, in process_request_in_thread
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 322, in finish_request
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 616, in __init__
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 657, in setup
MemoryError
----------------------------------------
----------------------------------------
Exception happened during processing of request from ('xxx.xxx.xx.xx', 60069)
Unexpected exception in worker <function <lambda> at 0x883a2a8>Traceback (most recent call last):

File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1053, in process_request_in_thread
Unhandled exception in thread started by <bound method Thread.__bootstrap of <Thread(worker 9, stopped 1130301760)>>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 545, in __bootstrap_inner
MemoryError  File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 322, in finish_request

Unexpected exception in worker <function <lambda> at 0x8721410>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 863, in worker_thread_callback
Unhandled exception in thread started by <bound method Thread.__bootstrap of <Thread(worker 0, stopped 1086265664)>>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 545, in __bootstrap_inner
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 242, in format_exc
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 142, in format_exception
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 76, in format_tb
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 101, in extract_tb
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 14, in getline
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 40, in getlines
MemoryError
----------------------------------------
Exception happened during processing of request from ('xxx.xxx.xx.xx', 60071)
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1053, in process_request_in_thread
Unexpected exception in worker <function <lambda> at 0x8721410>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 863, in worker_thread_callback
Unhandled exception in thread started by <bound method Thread.__bootstrap of <Thread(worker 6, stopped 1123998016)>>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
 self.__bootstrap_inner()
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 545, in __bootstrap_inner
 (self.name, _format_exc()))
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 242, in format_exc
 return ''.join(format_exception(etype, value, tb, limit))
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 142, in format_exception
 list = list + format_tb(tb, limit)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 76, in format_tb
 return format_list(extract_tb(tb, limit))
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 101, in extract_tb
 line = linecache.getline(filename, lineno, f.f_globals)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 14, in getline
 lines = getlines(filename, module_globals)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 40, in getlines
 return updatecache(filename, module_globals)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 131, in updatecache
 lines = fp.readlines()
MemoryError
----------------------------------------
Exception happened during processing of request from ('xxx.xxx.xxx.xxx', 44416)
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 1053, in process_request_in_thread
Unexpected exception in worker <function <lambda> at 0x8721410>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py", line 863, in worker_thread_callback
Unhandled exception in thread started by <bound method Thread.__bootstrap of <Thread(worker 7, stopped 1126099264)>>
Traceback (most recent call last):
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 504, in __bootstrap
 self.__bootstrap_inner()
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 545, in __bootstrap_inner
 (self.name, _format_exc()))
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 242, in format_exc
 return ''.join(format_exception(etype, value, tb, limit))
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 142, in format_exception
 list = list + format_tb(tb, limit)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 76, in format_tb
 return format_list(extract_tb(tb, limit))
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 101, in extract_tb
 line = linecache.getline(filename, lineno, f.f_globals)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 14, in getline
 lines = getlines(filename, module_globals)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 40, in getlines
 return updatecache(filename, module_globals)
File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 131, in updatecache
 lines = fp.readlines()
MemoryError
----------------------------------------

--
-----------------------------------------------------------
Callum Wright
HPC Systems Administrator
High Performance Computing
University of Bristol

Phone:   0117 331 4429
email:   c.wright@bristol.ac.uk
web:            www.acrc.bristol.ac.uk
-----------------------------------------------------------