Defunct munge processes using Torque PBS
I am fairly new to PBS management, so I can't rule out some misconfiguration, but I have a strange issue when running Galaxy with the PBS job runner. It seems that munge spawns a bunch of defunct processes after running Galaxy on my cluster: `ps axjf`: 1 25992 25991 25991 ? -1 Sl 77777 8:48 python ./scripts/paster.py serve universe_wsgi.ini --daemon 25992 26032 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> 25992 26034 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> 25992 26036 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> Now, these processes are being spawned by Galaxy, and I can't figure out why. Can anyone provide some insight or clues about where to start debugging this? Thanks, Matt
On Dec 6, 2012, at 2:34 PM, Matthew Shirley wrote:
I am fairly new to PBS management, so I can't rule out some misconfiguration, but I have a strange issue when running Galaxy with the PBS job runner. It seems that munge spawns a bunch of defunct processes after running Galaxy on my cluster:
`ps axjf`:
1 25992 25991 25991 ? -1 Sl 77777 8:48 python ./scripts/paster.py serve universe_wsgi.ini --daemon 25992 26032 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> 25992 26034 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> 25992 26036 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct>
Now, these processes are being spawned by Galaxy, and I can't figure out why. Can anyone provide some insight or clues about where to start debugging this? Thanks,
Hi Matt, I'm not sure what munge is, it's not something provided with Galaxy. Googling suggests it might be an authentication tool used in some HPC environments. Without having any familiarity with it, I can't say what process in Galaxy would be interacting with it, especially since that interaction must occur implicitly somewhere down the chain of normal Galaxy operations. --nate
Matt
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Nate. I do understand that this is not a bug directly stemming from the Galaxy code base. Munge is really just a tool to pass user credentials between systems during job submission to the PBS server. Galaxy is spooling jobs through the PBS job runner, which presumably indirectly calls munge through jobs submission to the PBS server. I'm just not sure why the munge process is sometimes becoming corrupt. This is an issue since I rapidly reach my max number of threads for the Galaxy user on my head node. At this point I guess I'll try downloading the latest stable version of Torque and build RPMs. I have been using what is in EPEL for RHEL6. Thanks for the reply, and any other thoughts are still appreciated! On Dec 7, 2012, at 10:51 AM, Nate Coraor <nate@bx.psu.edu> wrote:
On Dec 6, 2012, at 2:34 PM, Matthew Shirley wrote:
I am fairly new to PBS management, so I can't rule out some misconfiguration, but I have a strange issue when running Galaxy with the PBS job runner. It seems that munge spawns a bunch of defunct processes after running Galaxy on my cluster:
`ps axjf`:
1 25992 25991 25991 ? -1 Sl 77777 8:48 python ./scripts/paster.py serve universe_wsgi.ini --daemon 25992 26032 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> 25992 26034 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct> 25992 26036 25991 25991 ? -1 Z 77777 0:00 \_ [munge] <defunct>
Now, these processes are being spawned by Galaxy, and I can't figure out why. Can anyone provide some insight or clues about where to start debugging this? Thanks,
Hi Matt,
I'm not sure what munge is, it's not something provided with Galaxy. Googling suggests it might be an authentication tool used in some HPC environments. Without having any familiarity with it, I can't say what process in Galaxy would be interacting with it, especially since that interaction must occur implicitly somewhere down the chain of normal Galaxy operations.
--nate
Matt
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Matthew Shirley
-
Nate Coraor