Re: [galaxy-dev] PBS_Python Unable to submit jobs

5 Apr 2013

      Hi Steve,

Apologies, I didn't check the Galaxy list before sending you an email.

I came to mostly the same conclusion. I installed the Torque 4.x client on the submit node and I can submit jobs that way through the command line without issue.

I can't get pbs_submit to work from pbs_python, however.  Seems like it has to be some way in which the swig is translating the C code, or the python library is somehow not working with trqauthd over localhost:15005, or some other mysterious error.

Drmaa was my first choice, but our server was configured without --enable-drmaa, so I haven't been able to submit to it that way either. We've previously used pbs before so I thought it was a pretty safe backup plan!

Luckily, I don't have to do staging, we mounted our shared filesystem onto the VM running galaxy - you might look into Lustre if you have any ability to control that. I highly recommend bribing your sysadmins with beer.

I do hope there will be continued work done to address this issue - not because I have anything against drmaa, but because I suspect that the error lies upon false assumptions somewhere in the code that would do well to be fixed.

Thanks for your help!

Carrie Ganote
________________________________
From: Steve.Mcmahon@csiro.au [Steve.Mcmahon@csiro.au]
Sent: Friday, April 05, 2013 2:24 AM
To: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] PBS_Python Unable to submit jobs

Hi Carrie,

I’ve had the same problem.  I wanted to get Galaxy to submit to a cluster which was running Torque 4.x.  Torque clients need to be 4.x to work with the that version of the server.  I spent a bit of time looking into this and determined that pbs_python used by Galaxy is not compatible with Torque 4.x.  A new version would need to be built.

At that stage I investigated using the DRMAA runner to talk to the Torque 4.x server.  That did work if I built the Torque clients with the server name hard coded --with-default-server.

What the DRMAA runner didn’t do was data staging as the PBS runner does.  So I started working on some code for that.

I’m looking at giving up on the data staging by moving the Galaxy instance to the cluster.

Sorry I didn’t help.  I would be interested in comments from Galaxy developers about whether the PBS runner will be supported in the future and, hence, whether Torque 4.x will be supported.  I’m also interested whether the DRMAA runner will support data staging or whether Galaxy instances really need to share file systems with a cluster.

Regards.

Steve McMahon
Solutions architect & senior systems administrator
ASC Cluster Services
Information Management & Technology (IM&T)
CSIRO
Phone: +61-2-62142968  |  Mobile:  +61-4-00779318
steve.mcmahon@csiro.au<mailto:steve.mcmahon@csiro.au> |  www.csiro.au<http://www.csiro.au/>
PO Box 225, DICKSON  ACT  2602
1 Wilf Crane Crescent, Yarralumla  ACT  2600

From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Ganote, Carrie L
Sent: Friday, 5 April 2013 4:52 AM
To: galaxy-dev@bx.psu.edu
Subject: [galaxy-dev] PBS_Python Unable to submit jobs

Hi Galaxy dev,

My setup is a bit non-standard, but I'm getting the following error:
galaxy.jobs.runners.pbs WARNING 2013-04-04 13:24:00,590 (75) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable

Here is my setup:
Torque3 is installed in /usr/local/bin and I can use it to connect with (Default) server1.
Torque4 is installed in /N/soft/ and I can use it to connect to server2.

I'm running trq_authd so torque4 should work.
I can submit jobs to both servers from the command line. For server2, I specify the path to qsub and the servername (-q batch@server2).

In Galaxy, I used torquelib_dir=/N/soft to scramble pbs_python.
My path is pointing at /N/soft first so 'which qsub' returns torque4.
If I just use pbs:///, it will submit a job to server1 (shouldn't work, because /N/soft/qsub doesn't work from the commandline, since the default server1 is running torque3).
If I use pbs://-l vmem=100mb,walltime=00:30:00/, it won't work (the server string in pbs.py becomes "-l vmem=100mb,walltime=00:30:00" intsead of "server1")
If I use pbs://server2/, I get the Resources temp unavail error above. The server string is server2, and I put the following in pbs.py:
        whichq = os.popen("which qsub").read()
        stats = os.popen("qstat @server2").read()
These return the correct values for server2 using the correct torque version4.

I'm stumped as to why this is not making the connection. It's probably something about the python implementation I'm overlooking.

Thanks for any advice,

Carrie Ganote