Sonali Amonkar wrote:
We are still awaiting any replies to the error on the Torque community. About the debugging, we did try tracejob, however since the job was not getting submitted itself, Torque did not have any logging to the job(it wasn't even a job yet). Meanwhile, we are retrying deployment of Galaxy on a different version of Torque(2.3.6) with pbs_python(2.6), but now face a new error,
galaxy.jobs.runners.pbs DEBUG 2011-02-25 04:59:18,345 (34/2519.server) Removed from PBS queue before job completion
This would indicate the job is being stopped either by a user, or the job walltime or job output size limit configured in universe_wsgi.ini.
galaxy.jobs.runners.pbs DEBUG 2011-02-25 04:59:18,344 (34/2519.server) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2011-02-25 04:59:18,351 Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.
One certain job gets removed, failing the entire workflow. Please let me know if you have any information / if you have come across this error before.
Many thanks for your time.
-----Original Message----- From: Nate Coraor [mailto:email@example.com] Sent: Tuesday, February 15, 2011 10:30 PM To: Sonali Amonkar Cc: Galaxy Dev Subject: Re: [galaxy-dev] Error with setuptools version in Galaxy installation on Cluster
Sonali Amonkar wrote:
On further digging, we found that the script is failing in the following part of $GALAXY_HOME/lib/galaxy/jobs/runners/pbs.py:
# submit galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) ) job_id = pbs.pbs_submit(c, job_attrs, job_file,
This is the line here, it's failing to submit the job.
pbs.pbs_disconnect(c) # check to see if it submitted if not job_id: errno, text = pbs.error() log.debug( "(%s) pbs_submit failed, PBS error %d: %s" % (galaxy_job_id, errno, text) ) job_wrapper.fail( "Unable to run this job due to a cluster error" ) return
Could this be a problem related to the pbs_python egg (v. pbs_python-4.1.0) being used by Galaxy or a Torque-specific issue? Just to reiterate, we are on a development snapshot of Torque which is hard to replace as many other people using it.
It's possible that pbs_python is generating code which is incompatible, but since it's linked against your version of TORQUE this should not be the case.
It's hard to say exactly what's causing this since it's outside of Galaxy. I'm not sure if TORQUE has any client-side debugging that would help with this issue but that's where I'd start.
Also, could you please advise which Torque & pbs_python version combinations have you successfully tested against?
We're using an older version (2.1.11) on our submission hosts since we saw performance problems when using pbs_python with the newer 2.4.x versions.
The TORQUE server and execution hosts run 2.4.9.
PS: pbs_python has a new version 4.3 out (https://subtrac.sara.nl/oss/pbs_python/wiki/TorqueInstallation), why is this not in the PSU egg repository yet? Would that make a difference?
I'm not sure if it would make a difference. I upgrade the pbs_python egg as necessary or when it's particularly far out of date.
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.