Hi, Thanks a lot it actually helped. It is not exactly as straightforward in drmaa.py but somehow I could manage. However, it was not the problem. For some reason, the user needs to write files from the node to job_working_directory/00X/XXXX/ and the latter is not world-writable. I had to make everyone chmod everything to 777 to make it work. Did I miss something? Best, L-A Le 24/04/2012 15:17, Alban Lermine a écrit :
Hi L-A,
I run Galaxy as real user on our cluster with pbs (free version).
We first configure LDAP authentification for having email account related to unix account (just cut the @curie.fr) Then I have modify pbs.py (in <GALAXY_DIR>galaxy-dist/lib/galaxy/jobs/runners)
I have just disconnected the pbs submission through python library and replace it by a system call (just like I send jobs to the cluster with command line), here is the code used:
galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) )
# Submit job with system call instead of using python PBS library - Permit to run jobs as .. with sudo -u cmd prefix
galaxy_job_idSTR = str(job_wrapper.job_id) galaxy_tool_idSTR = str(job_wrapper.tool.id) galaxy_job_name = galaxy_job_idSTR+"_"+galaxy_tool_idSTR+"_"+job_wrapper.user torque_options = runner_url.split("/") queue = torque_options[3] ressources = torque_options[4] user_mail = job_wrapper.user.split("@") username = user_mail[0]
torque_cmd = "sudo -u username echo "+"\""+command_line+"\" | qsub -o "+ofile+" -e "+efile+" -M "+job_wrapper.user+" -N "+galaxy_job_name+" -q "+queue+" "+ressources
submit_pbs_job = os.popen(torque_cmd)
job_id = submit_pbs_job.read().rstrip("\n")
#Original job launcher #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)
pbs.pbs_disconnect(c)
Second thing I have done is to wait error and output file from torque in the finish_job function (if not, I never receive the output, seems to be your problem..), here is the code used:
def finish_job( self, pbs_job_state ): """ Get the output/error for a finished job, pass to `job_wrapper.finish` and cleanup all the PBS temporary files. """ ofile = pbs_job_state.ofile efile = pbs_job_state.efile job_file = pbs_job_state.job_file
# collect the output try:
# With qsub system call, need to wait efile and ofile creation at the end of the job execution before reading them
efileExists = os.path.isfile(efile) ofileExists = os.path.isfile(ofile) efileExistsSTR = str(efileExists) ofileExistsSTR = str(ofileExists)
while efileExistsSTR == "False": time.sleep( 1 ) efileExists = os.path.isfile(efile) efileExistsSTR = str(efileExists)
while ofileExistsSTR == "False": time.sleep( 1 ) ofileExists = os.path.isfile(ofile) ofileExistsSTR = str(ofileExists)
# Back to original code
ofh = file(ofile, "r") efh = file(efile, "r") stdout = ofh.read( 32768 ) stderr = efh.read( 32768 ) except: stdout = '' stderr = 'Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.' log.debug(stderr)
* Last step is to allow galaxy user to run sudo
Hope it can help you finding your problem..
See you,
Alban