Sonali Amonkar wrote:
On further digging, we found that the script is failing in the following part of $GALAXY_HOME/lib/galaxy/jobs/runners/pbs.py:
# submit galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) ) job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)
This is the line here, it's failing to submit the job.
pbs.pbs_disconnect(c)
# check to see if it submitted if not job_id: errno, text = pbs.error() log.debug( "(%s) pbs_submit failed, PBS error %d: %s" % (galaxy_job_id, errno, text) ) job_wrapper.fail( "Unable to run this job due to a cluster error" ) return
Could this be a problem related to the pbs_python egg (v. pbs_python-4.1.0) being used by Galaxy or a Torque-specific issue? Just to reiterate, we are on a development snapshot of Torque which is hard to replace as many other people using it.
It's possible that pbs_python is generating code which is incompatible, but since it's linked against your version of TORQUE this should not be the case. It's hard to say exactly what's causing this since it's outside of Galaxy. I'm not sure if TORQUE has any client-side debugging that would help with this issue but that's where I'd start.
Also, could you please advise which Torque & pbs_python version combinations have you successfully tested against?
We're using an older version (2.1.11) on our submission hosts since we saw performance problems when using pbs_python with the newer 2.4.x versions. The TORQUE server and execution hosts run 2.4.9.
Regards, Sonali
PS: pbs_python has a new version 4.3 out (https://subtrac.sara.nl/oss/pbs_python/wiki/TorqueInstallation), why is this not in the PSU egg repository yet? Would that make a difference?
I'm not sure if it would make a difference. I upgrade the pbs_python egg as necessary or when it's particularly far out of date. --nate