Le 24/04/2012 14:53, Louise-Amélie Schmitt a écrit :
At first we thought it could be an ssh issue but submitting jobs and getting the output back isn't a problem when I do it from my personal user manually, so it's really related to Galaxy. We're using PBS Pro btw.
And I'm still at loss... :(
L-A
Le 23/04/2012 15:42, zhengqiu cai a écrit :
I am having the same problem when I use condor as the scheduler instead of sge.
Cai
--- 12年4月23日,周一, Louise-Amélie Schmittlouise-amelie.schmitt@embl.de 写道:
发件人: Louise-Amélie Schmittlouise-amelie.schmitt@embl.de 主题: [galaxy-dev] Error: Job output not returned from cluster 收件人: galaxy-dev@lists.bx.psu.edu 日期: 2012年4月23日,周一,下午5:09 Hello everyone,
I'm still trying to set up the job submission as the real user, and I get a mysterious error. The job obviously runs somewhere and when it ends it is in error state and displays the following message: "Job output not returned from cluster"
In the Galaxy log I have the following lines when the job finishes running:
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509 (1455/9161620.pbs-master2.embl.de) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job output not returned from cluster galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat
to /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768 Failed to cleanup MetadataTempFile temp files from /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM:
No JSON object could be decoded: line 1 column 0 (char 0)
The /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/ directory is empty and /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat exists but is empty.
Any ideas about what can go wrong there? Any lead would be immensely appreciated!
Thanks, L-A
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi L-A,
I run Galaxy as real user on our cluster with pbs (free version).
We first configure LDAP authentification for having email account related to unix account (just cut the @curie.fr) Then I have modify pbs.py (in <GALAXY_DIR>galaxy-dist/lib/galaxy/jobs/runners)
I have just disconnected the pbs submission through python library and replace it by a system call (just like I send jobs to the cluster with command line), here is the code used:
galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) )
# Submit job with system call instead of using python PBS library - Permit to run jobs as .. with sudo -u cmd prefix
galaxy_job_idSTR = str(job_wrapper.job_id) galaxy_tool_idSTR = str(job_wrapper.tool.id) galaxy_job_name = galaxy_job_idSTR+"_"+galaxy_tool_idSTR+"_"+job_wrapper.user torque_options = runner_url.split("/") queue = torque_options[3] ressources = torque_options[4] user_mail = job_wrapper.user.split("@") username = user_mail[0]
torque_cmd = "sudo -u username echo "+"""+command_line+"" | qsub -o "+ofile+" -e "+efile+" -M "+job_wrapper.user+" -N "+galaxy_job_name+" -q "+queue+" "+ressources
submit_pbs_job = os.popen(torque_cmd)
job_id = submit_pbs_job.read().rstrip("\n")
#Original job launcher #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)
pbs.pbs_disconnect(c)
Second thing I have done is to wait error and output file from torque in the finish_job function (if not, I never receive the output, seems to be your problem..), here is the code used:
def finish_job( self, pbs_job_state ): """ Get the output/error for a finished job, pass to `job_wrapper.finish` and cleanup all the PBS temporary files. """ ofile = pbs_job_state.ofile efile = pbs_job_state.efile job_file = pbs_job_state.job_file
# collect the output try:
# With qsub system call, need to wait efile and ofile creation at the end of the job execution before reading them
efileExists = os.path.isfile(efile) ofileExists = os.path.isfile(ofile) efileExistsSTR = str(efileExists) ofileExistsSTR = str(ofileExists)
while efileExistsSTR == "False": time.sleep( 1 ) efileExists = os.path.isfile(efile) efileExistsSTR = str(efileExists)
while ofileExistsSTR == "False": time.sleep( 1 ) ofileExists = os.path.isfile(ofile) ofileExistsSTR = str(ofileExists)
# Back to original code
ofh = file(ofile, "r") efh = file(efile, "r") stdout = ofh.read( 32768 ) stderr = efh.read( 32768 ) except: stdout = '' stderr = 'Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.' log.debug(stderr)
* Last step is to allow galaxy user to run sudo
Hope it can help you finding your problem..
See you,
Alban