Memory leaks with pbs job runner
Hi, We emailed previously about possible memory leaks in our installation of Galaxy here on the HPC at Bristol. We can run Galaxy just fine on our login node but when we integrate into the cluster using pbs job runner the whole thing falls over - almost certainly due to a memory leak. In essence, every attempt to submit a TopHat job (with 2x5GB paired end reads to the full human genome) always results in the whole thing falling over - but not when Galaxy is restricted to the login node. We saw that Nate responded to Todd Oakley about a week ago saying that there is a memory leak in libtorque or pbs_python when using the pbs job runner. Have there been any developments on this ? Best Wishes, David. __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 Fax. +44 117 3312091 D.A.Matthews@bristol.ac.uk
On Mar 13, 2012, at 6:59 AM, David Matthews wrote:
Hi,
We emailed previously about possible memory leaks in our installation of Galaxy here on the HPC at Bristol. We can run Galaxy just fine on our login node but when we integrate into the cluster using pbs job runner the whole thing falls over - almost certainly due to a memory leak. In essence, every attempt to submit a TopHat job (with 2x5GB paired end reads to the full human genome) always results in the whole thing falling over - but not when Galaxy is restricted to the login node. We saw that Nate responded to Todd Oakley about a week ago saying that there is a memory leak in libtorque or pbs_python when using the pbs job runner. Have there been any developments on this ?
Best Wishes, David.
Hi David, I am almost certain that the problem you have with tophat is not due to the same leak, since it's a slow leak, not an immediate spike. Before we go any further, in reading back over our past conversation about this problem, I noticed that I never asked whether you've set `set_metadata_externally = True` in your Galaxy config. If not, this is almost certainly the cause of the problem. If you're already setting metadata externally, answers to a few of the questions I asked last time (or perhaps any findings of your HPC guys) and a few new things to try would be helpful in figuring out why your tophat jobs still crash: 1. Create a separate job runner and web frontend so we can be sure that the job running portion is the memory culprit: http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Web%20Application%20Scali... You would not need any of the load balancing config, just start a single web process and a single runner process. From reading your prior email I believe you have a proxy server, and so as long as you start the web process on the same port as your previous Galaxy server, no change would be needed to your proxy server. 2. Set use_heartbeat = True in the config file of whichever process is consuming all of the memory. 3. Does the MemoryError appear in the log after Galaxy has noticed that the job has finished on the cluster (`(<id>/<pbs id>) PBS job has left queue`), but before the job post-processing is finished (`job <id> ended`)? 4. Does the MemoryError appear regardless of whether anyone accesses the web interface? There is another memory consumption problem we'll look at soon, which occurs when the job runner reads the metadata files written by the external set_metadata tool. If the output dataset(s) have an extremely large number of columns, this can cause a very large, nearly immediate memory spike when job post-processing begins, even if the output file itself is relatively small. --nate
__________________________________ Dr David A. Matthews
Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K.
Tel. +44 117 3312058 Fax. +44 117 3312091
D.A.Matthews@bristol.ac.uk
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Nate, Thanks for the feedback. We found that set_meta_data_externally was commented out so we put it in and set it to true. We are waiting to see what happens. If that fails we'll start to run through the other options - many thanks for your patience on this matter! Best Wishes, David. __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 Fax. +44 117 3312091 D.A.Matthews@bristol.ac.uk On 13 Mar 2012, at 17:59, Nate Coraor wrote:
On Mar 13, 2012, at 6:59 AM, David Matthews wrote:
Hi,
We emailed previously about possible memory leaks in our installation of Galaxy here on the HPC at Bristol. We can run Galaxy just fine on our login node but when we integrate into the cluster using pbs job runner the whole thing falls over - almost certainly due to a memory leak. In essence, every attempt to submit a TopHat job (with 2x5GB paired end reads to the full human genome) always results in the whole thing falling over - but not when Galaxy is restricted to the login node. We saw that Nate responded to Todd Oakley about a week ago saying that there is a memory leak in libtorque or pbs_python when using the pbs job runner. Have there been any developments on this ?
Best Wishes, David.
Hi David,
I am almost certain that the problem you have with tophat is not due to the same leak, since it's a slow leak, not an immediate spike. Before we go any further, in reading back over our past conversation about this problem, I noticed that I never asked whether you've set `set_metadata_externally = True` in your Galaxy config. If not, this is almost certainly the cause of the problem.
If you're already setting metadata externally, answers to a few of the questions I asked last time (or perhaps any findings of your HPC guys) and a few new things to try would be helpful in figuring out why your tophat jobs still crash:
1. Create a separate job runner and web frontend so we can be sure that the job running portion is the memory culprit:
http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Web%20Application%20Scali...
You would not need any of the load balancing config, just start a single web process and a single runner process. From reading your prior email I believe you have a proxy server, and so as long as you start the web process on the same port as your previous Galaxy server, no change would be needed to your proxy server.
2. Set use_heartbeat = True in the config file of whichever process is consuming all of the memory.
3. Does the MemoryError appear in the log after Galaxy has noticed that the job has finished on the cluster (`(<id>/<pbs id>) PBS job has left queue`), but before the job post-processing is finished (`job <id> ended`)?
4. Does the MemoryError appear regardless of whether anyone accesses the web interface?
There is another memory consumption problem we'll look at soon, which occurs when the job runner reads the metadata files written by the external set_metadata tool. If the output dataset(s) have an extremely large number of columns, this can cause a very large, nearly immediate memory spike when job post-processing begins, even if the output file itself is relatively small.
--nate
__________________________________ Dr David A. Matthews
Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K.
Tel. +44 117 3312058 Fax. +44 117 3312091
D.A.Matthews@bristol.ac.uk
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
David Matthews
-
Nate Coraor