PBS Server Throwing Errors
 
            Hi, Galaxy Community, Greetings from The University of Chicago; I hope all who attended the Galaxy conference enjoyed it as much as I did. I have searched the mailing list archives as well as Google to a resolve problem I am seeing, however I am somewhat at a loss as to the next course of action I should be taking to bring this issue to a close. I am hoping that one of the bright minds on this mailing list could help me shed some light on the solution to my problem, or at least help me identify a root cause. I have configured Galaxy to integrate with TORQUE (version 4.0.2) server, and successfully built the PBS python egg as specified in the Galaxy documentation. I am using Python version 2.6 and the latest build of Galaxy. Whenever I launch a job from the Galaxy UI, I get the following error message(s) on the PBS server: 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=AlternateUserAuthentication, from galaxy@ 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=QueueJob, from galaxy@ 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Disconnect, from galaxy@ 07/30/2012 15:43:01;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 1 One thing I did notice, that suggests there might be a problem, is that there is no hostname after the galaxy@; most of the other messages in this log file have a host name appended to log entry, i.e.: 07/30/2012 15:43:38;0100;PBS_Server;Req;;Type StatusJob request received from root@sc01, sock=10 I have completed a tcpdump on the schedule node, and I can definitely see bi-directional traffic between the Galaxy server and the scheduler node on TCP port 15001. In addition to this, I have installed the TORQUE client tools on on the Galaxy server, and can spawn an interactive job with qsub -I, as well as check the status of queued jobs using qstat (from the Galaxy server). This suggests to me that there a potential problem with the PBS Egg, although I am not certain. Has anybody seen something like this before, or could somebody point me in the right direction? We do have a support contract with Adaptive Computing, and I am opening a ticket with them as well, however I wanted to reach out to the Galaxy community to cover all of my bases. Thank-you so much for taking the time to read my email. Dan Sullivan
 
            Hi Dan- Thanks again for your great hospitality at UCI! It looks like you've done most of the sane things, and a quick check looks like the PBS egg is requesting protocol version 1 while the server supports protocol version 2. One possibility is that there is a protocol mismatch, which could be the result of using an old library. We can check if the egg is at fault first - try the following link for compiling the pbs_python.py egg yourself and try the example to see if the connection and a simple stats-gathering operation (the examples/pbsnodes-a.py script from pbs_python) works: https://subtrac.sara.nl/oss/pbs_python/wiki/TorqueInstallation -Scott ----- Original Message -----
Hi, Galaxy Community,
Greetings from The University of Chicago; I hope all who attended the Galaxy conference enjoyed it as much as I did. I have searched the mailing list archives as well as Google to a resolve problem I am seeing, however I am somewhat at a loss as to the next course of action I should be taking to bring this issue to a close. I am hoping that one of the bright minds on this mailing list could help me shed some light on the solution to my problem, or at least help me identify a root cause. I have configured Galaxy to integrate with TORQUE (version 4.0.2) server, and successfully built the PBS python egg as specified in the Galaxy documentation. I am using Python version 2.6 and the latest build of Galaxy. Whenever I launch a job from the Galaxy UI, I get the following error message(s) on the PBS server:
07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=AlternateUserAuthentication, from galaxy@ 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=QueueJob, from galaxy@ 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Disconnect, from galaxy@ 07/30/2012 15:43:01;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 1
One thing I did notice, that suggests there might be a problem, is that there is no hostname after the galaxy@; most of the other messages in this log file have a host name appended to log entry, i.e.:
07/30/2012 15:43:38;0100;PBS_Server;Req;;Type StatusJob request received from root@sc01, sock=10
I have completed a tcpdump on the schedule node, and I can definitely see bi-directional traffic between the Galaxy server and the scheduler node on TCP port 15001. In addition to this, I have installed the TORQUE client tools on on the Galaxy server, and can spawn an interactive job with qsub -I, as well as check the status of queued jobs using qstat (from the Galaxy server). This suggests to me that there a potential problem with the PBS Egg, although I am not certain. Has anybody seen something like this before, or could somebody point me in the right direction? We do have a support contract with Adaptive Computing, and I am opening a ticket with them as well, however I wanted to reach out to the Galaxy community to cover all of my bases. Thank-you so much for taking the time to read my email.
Dan Sullivan ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
 
            I've been told by Galaxy folks to keep this on galaxy-dev. My apologies for any confusion. -Scott ----- Original Message -----
Hi Dan-
Thanks again for your great hospitality at UCI!
It looks like you've done most of the sane things, and a quick check looks like the PBS egg is requesting protocol version 1 while the server supports protocol version 2. One possibility is that there is a protocol mismatch, which could be the result of using an old library. We can check if the egg is at fault first - try the following link for compiling the pbs_python.py egg yourself and try the example to see if the connection and a simple stats-gathering operation (the examples/pbsnodes-a.py script from pbs_python) works: https://subtrac.sara.nl/oss/pbs_python/wiki/TorqueInstallation
-Scott
----- Original Message -----
Hi, Galaxy Community,
Greetings from The University of Chicago; I hope all who attended the Galaxy conference enjoyed it as much as I did. I have searched the mailing list archives as well as Google to a resolve problem I am seeing, however I am somewhat at a loss as to the next course of action I should be taking to bring this issue to a close. I am hoping that one of the bright minds on this mailing list could help me shed some light on the solution to my problem, or at least help me identify a root cause. I have configured Galaxy to integrate with TORQUE (version 4.0.2) server, and successfully built the PBS python egg as specified in the Galaxy documentation. I am using Python version 2.6 and the latest build of Galaxy. Whenever I launch a job from the Galaxy UI, I get the following error message(s) on the PBS server:
07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=AlternateUserAuthentication, from galaxy@ 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=QueueJob, from galaxy@ 07/30/2012 15:42:48;0080;PBS_Server;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected 07/30/2012 15:42:48;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Disconnect, from galaxy@ 07/30/2012 15:43:01;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 1
One thing I did notice, that suggests there might be a problem, is that there is no hostname after the galaxy@; most of the other messages in this log file have a host name appended to log entry, i.e.:
07/30/2012 15:43:38;0100;PBS_Server;Req;;Type StatusJob request received from root@sc01, sock=10
I have completed a tcpdump on the schedule node, and I can definitely see bi-directional traffic between the Galaxy server and the scheduler node on TCP port 15001. In addition to this, I have installed the TORQUE client tools on on the Galaxy server, and can spawn an interactive job with qsub -I, as well as check the status of queued jobs using qstat (from the Galaxy server). This suggests to me that there a potential problem with the PBS Egg, although I am not certain. Has anybody seen something like this before, or could somebody point me in the right direction? We do have a support contract with Adaptive Computing, and I am opening a ticket with them as well, however I wanted to reach out to the Galaxy community to cover all of my bases. Thank-you so much for taking the time to read my email.
Dan Sullivan ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
- 
                 Daniel Patrick Sullivan Daniel Patrick Sullivan
- 
                 Scott McManus Scott McManus