Error: Job output not returned from cluster
Hello everyone, I'm still trying to set up the job submission as the real user, and I get a mysterious error. The job obviously runs somewhere and when it ends it is in error state and displays the following message: "Job output not returned from cluster" In the Galaxy log I have the following lines when the job finishes running: galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509 (1455/9161620.pbs-master2.embl.de) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job output not returned from cluster galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat to /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768 Failed to cleanup MetadataTempFile temp files from /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM: No JSON object could be decoded: line 1 column 0 (char 0) The /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/ directory is empty and /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat exists but is empty. Any ideas about what can go wrong there? Any lead would be immensely appreciated! Thanks, L-A
I am having the same problem when I use condor as the scheduler instead of sge. Cai --- 12年4月23日,周一, Louise-Amélie Schmitt <louise-amelie.schmitt@embl.de> 写道:
发件人: Louise-Amélie Schmitt <louise-amelie.schmitt@embl.de> 主题: [galaxy-dev] Error: Job output not returned from cluster 收件人: galaxy-dev@lists.bx.psu.edu 日期: 2012年4月23日,周一,下午5:09 Hello everyone,
I'm still trying to set up the job submission as the real user, and I get a mysterious error. The job obviously runs somewhere and when it ends it is in error state and displays the following message: "Job output not returned from cluster"
In the Galaxy log I have the following lines when the job finishes running:
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509 (1455/9161620.pbs-master2.embl.de) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job output not returned from cluster galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat to /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768 Failed to cleanup MetadataTempFile temp files from /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM: No JSON object could be decoded: line 1 column 0 (char 0)
The /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/ directory is empty and /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat exists but is empty.
Any ideas about what can go wrong there? Any lead would be immensely appreciated!
Thanks, L-A
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
At first we thought it could be an ssh issue but submitting jobs and getting the output back isn't a problem when I do it from my personal user manually, so it's really related to Galaxy. We're using PBS Pro btw. And I'm still at loss... :( L-A Le 23/04/2012 15:42, zhengqiu cai a écrit :
I am having the same problem when I use condor as the scheduler instead of sge.
Cai
--- 12年4月23日,周一, Louise-Amélie Schmitt<louise-amelie.schmitt@embl.de> 写道:
发件人: Louise-Amélie Schmitt<louise-amelie.schmitt@embl.de> 主题: [galaxy-dev] Error: Job output not returned from cluster 收件人: galaxy-dev@lists.bx.psu.edu 日期: 2012年4月23日,周一,下午5:09 Hello everyone,
I'm still trying to set up the job submission as the real user, and I get a mysterious error. The job obviously runs somewhere and when it ends it is in error state and displays the following message: "Job output not returned from cluster"
In the Galaxy log I have the following lines when the job finishes running:
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509 (1455/9161620.pbs-master2.embl.de) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job output not returned from cluster galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat to /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768 Failed to cleanup MetadataTempFile temp files from /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM: No JSON object could be decoded: line 1 column 0 (char 0)
The /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/ directory is empty and /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat exists but is empty.
Any ideas about what can go wrong there? Any lead would be immensely appreciated!
Thanks, L-A
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Le 24/04/2012 14:53, Louise-Amélie Schmitt a écrit :
At first we thought it could be an ssh issue but submitting jobs and getting the output back isn't a problem when I do it from my personal user manually, so it's really related to Galaxy. We're using PBS Pro btw.
And I'm still at loss... :(
L-A
Le 23/04/2012 15:42, zhengqiu cai a écrit :
I am having the same problem when I use condor as the scheduler instead of sge.
Cai
--- 12年4月23日,周一, Louise-Amélie Schmitt<louise-amelie.schmitt@embl.de> 写道:
发件人: Louise-Amélie Schmitt<louise-amelie.schmitt@embl.de> 主题: [galaxy-dev] Error: Job output not returned from cluster 收件人: galaxy-dev@lists.bx.psu.edu 日期: 2012年4月23日,周一,下午5:09 Hello everyone,
I'm still trying to set up the job submission as the real user, and I get a mysterious error. The job obviously runs somewhere and when it ends it is in error state and displays the following message: "Job output not returned from cluster"
In the Galaxy log I have the following lines when the job finishes running:
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509 (1455/9161620.pbs-master2.embl.de) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job output not returned from cluster galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat
to /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755 Cleaning up external metadata files galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768 Failed to cleanup MetadataTempFile temp files from /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM:
No JSON object could be decoded: line 1 column 0 (char 0)
The /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/ directory is empty and /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat exists but is empty.
Any ideas about what can go wrong there? Any lead would be immensely appreciated!
Thanks, L-A
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/ Hi L-A,
I run Galaxy as real user on our cluster with pbs (free version). We first configure LDAP authentification for having email account related to unix account (just cut the @curie.fr) Then I have modify pbs.py (in <GALAXY_DIR>galaxy-dist/lib/galaxy/jobs/runners) I have just disconnected the pbs submission through python library and replace it by a system call (just like I send jobs to the cluster with command line), here is the code used: galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) ) # Submit job with system call instead of using python PBS library - Permit to run jobs as .. with sudo -u cmd prefix galaxy_job_idSTR = str(job_wrapper.job_id) galaxy_tool_idSTR = str(job_wrapper.tool.id) galaxy_job_name = galaxy_job_idSTR+"_"+galaxy_tool_idSTR+"_"+job_wrapper.user torque_options = runner_url.split("/") queue = torque_options[3] ressources = torque_options[4] user_mail = job_wrapper.user.split("@") username = user_mail[0] torque_cmd = "sudo -u username echo "+"\""+command_line+"\" | qsub -o "+ofile+" -e "+efile+" -M "+job_wrapper.user+" -N "+galaxy_job_name+" -q "+queue+" "+ressources submit_pbs_job = os.popen(torque_cmd) job_id = submit_pbs_job.read().rstrip("\n") #Original job launcher #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None) pbs.pbs_disconnect(c) Second thing I have done is to wait error and output file from torque in the finish_job function (if not, I never receive the output, seems to be your problem..), here is the code used: def finish_job( self, pbs_job_state ): """ Get the output/error for a finished job, pass to `job_wrapper.finish` and cleanup all the PBS temporary files. """ ofile = pbs_job_state.ofile efile = pbs_job_state.efile job_file = pbs_job_state.job_file # collect the output try: # With qsub system call, need to wait efile and ofile creation at the end of the job execution before reading them efileExists = os.path.isfile(efile) ofileExists = os.path.isfile(ofile) efileExistsSTR = str(efileExists) ofileExistsSTR = str(ofileExists) while efileExistsSTR == "False": time.sleep( 1 ) efileExists = os.path.isfile(efile) efileExistsSTR = str(efileExists) while ofileExistsSTR == "False": time.sleep( 1 ) ofileExists = os.path.isfile(ofile) ofileExistsSTR = str(ofileExists) # Back to original code ofh = file(ofile, "r") efh = file(efile, "r") stdout = ofh.read( 32768 ) stderr = efh.read( 32768 ) except: stdout = '' stderr = 'Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.' log.debug(stderr) * Last step is to allow galaxy user to run sudo Hope it can help you finding your problem.. See you, Alban -- Alban Lermine Unité 900 : Inserm - Mines ParisTech - Institut Curie « Bioinformatics and Computational Systems Biology of Cancer » 11-13 rue Pierre et Marie Curie (1er étage) - 75005 Paris - France Tel : +33 (0) 1 56 24 69 84
Hi, Thanks a lot it actually helped. It is not exactly as straightforward in drmaa.py but somehow I could manage. However, it was not the problem. For some reason, the user needs to write files from the node to job_working_directory/00X/XXXX/ and the latter is not world-writable. I had to make everyone chmod everything to 777 to make it work. Did I miss something? Best, L-A Le 24/04/2012 15:17, Alban Lermine a écrit :
Hi L-A,
I run Galaxy as real user on our cluster with pbs (free version).
We first configure LDAP authentification for having email account related to unix account (just cut the @curie.fr) Then I have modify pbs.py (in <GALAXY_DIR>galaxy-dist/lib/galaxy/jobs/runners)
I have just disconnected the pbs submission through python library and replace it by a system call (just like I send jobs to the cluster with command line), here is the code used:
galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) )
# Submit job with system call instead of using python PBS library - Permit to run jobs as .. with sudo -u cmd prefix
galaxy_job_idSTR = str(job_wrapper.job_id) galaxy_tool_idSTR = str(job_wrapper.tool.id) galaxy_job_name = galaxy_job_idSTR+"_"+galaxy_tool_idSTR+"_"+job_wrapper.user torque_options = runner_url.split("/") queue = torque_options[3] ressources = torque_options[4] user_mail = job_wrapper.user.split("@") username = user_mail[0]
torque_cmd = "sudo -u username echo "+"\""+command_line+"\" | qsub -o "+ofile+" -e "+efile+" -M "+job_wrapper.user+" -N "+galaxy_job_name+" -q "+queue+" "+ressources
submit_pbs_job = os.popen(torque_cmd)
job_id = submit_pbs_job.read().rstrip("\n")
#Original job launcher #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)
pbs.pbs_disconnect(c)
Second thing I have done is to wait error and output file from torque in the finish_job function (if not, I never receive the output, seems to be your problem..), here is the code used:
def finish_job( self, pbs_job_state ): """ Get the output/error for a finished job, pass to `job_wrapper.finish` and cleanup all the PBS temporary files. """ ofile = pbs_job_state.ofile efile = pbs_job_state.efile job_file = pbs_job_state.job_file
# collect the output try:
# With qsub system call, need to wait efile and ofile creation at the end of the job execution before reading them
efileExists = os.path.isfile(efile) ofileExists = os.path.isfile(ofile) efileExistsSTR = str(efileExists) ofileExistsSTR = str(ofileExists)
while efileExistsSTR == "False": time.sleep( 1 ) efileExists = os.path.isfile(efile) efileExistsSTR = str(efileExists)
while ofileExistsSTR == "False": time.sleep( 1 ) ofileExists = os.path.isfile(ofile) ofileExistsSTR = str(ofileExists)
# Back to original code
ofh = file(ofile, "r") efh = file(efile, "r") stdout = ofh.read( 32768 ) stderr = efh.read( 32768 ) except: stdout = '' stderr = 'Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.' log.debug(stderr)
* Last step is to allow galaxy user to run sudo
Hope it can help you finding your problem..
See you,
Alban
On Apr 25, 2012, at 8:25 AM, Louise-Amélie Schmitt wrote:
Hi,
Thanks a lot it actually helped. It is not exactly as straightforward in drmaa.py but somehow I could manage.
However, it was not the problem. For some reason, the user needs to write files from the node to job_working_directory/00X/XXXX/ and the latter is not world-writable. I had to make everyone chmod everything to 777 to make it work. Did I miss something?
That doesn't seem right. The subdirectory for your job underneath job_working_directory/00X/XXX/ should be chowned to the real user before the job is submitted and then chowned back to the galaxy user once the job is complete. --nate
Best, L-A
Le 24/04/2012 15:17, Alban Lermine a écrit :
Hi L-A,
I run Galaxy as real user on our cluster with pbs (free version).
We first configure LDAP authentification for having email account related to unix account (just cut the @curie.fr) Then I have modify pbs.py (in <GALAXY_DIR>galaxy-dist/lib/galaxy/jobs/runners)
I have just disconnected the pbs submission through python library and replace it by a system call (just like I send jobs to the cluster with command line), here is the code used:
galaxy_job_id = job_wrapper.job_id log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) )
# Submit job with system call instead of using python PBS library - Permit to run jobs as .. with sudo -u cmd prefix
galaxy_job_idSTR = str(job_wrapper.job_id) galaxy_tool_idSTR = str(job_wrapper.tool.id) galaxy_job_name = galaxy_job_idSTR+"_"+galaxy_tool_idSTR+"_"+job_wrapper.user torque_options = runner_url.split("/") queue = torque_options[3] ressources = torque_options[4] user_mail = job_wrapper.user.split("@") username = user_mail[0]
torque_cmd = "sudo -u username echo "+"\""+command_line+"\" | qsub -o "+ofile+" -e "+efile+" -M "+job_wrapper.user+" -N "+galaxy_job_name+" -q "+queue+" "+ressources
submit_pbs_job = os.popen(torque_cmd)
job_id = submit_pbs_job.read().rstrip("\n")
#Original job launcher #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)
pbs.pbs_disconnect(c)
Second thing I have done is to wait error and output file from torque in the finish_job function (if not, I never receive the output, seems to be your problem..), here is the code used:
def finish_job( self, pbs_job_state ): """ Get the output/error for a finished job, pass to `job_wrapper.finish` and cleanup all the PBS temporary files. """ ofile = pbs_job_state.ofile efile = pbs_job_state.efile job_file = pbs_job_state.job_file
# collect the output try:
# With qsub system call, need to wait efile and ofile creation at the end of the job execution before reading them
efileExists = os.path.isfile(efile) ofileExists = os.path.isfile(ofile) efileExistsSTR = str(efileExists) ofileExistsSTR = str(ofileExists)
while efileExistsSTR == "False": time.sleep( 1 ) efileExists = os.path.isfile(efile) efileExistsSTR = str(efileExists)
while ofileExistsSTR == "False": time.sleep( 1 ) ofileExists = os.path.isfile(ofile) ofileExistsSTR = str(ofileExists)
# Back to original code
ofh = file(ofile, "r") efh = file(efile, "r") stdout = ofh.read( 32768 ) stderr = efh.read( 32768 ) except: stdout = '' stderr = 'Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.' log.debug(stderr)
* Last step is to allow galaxy user to run sudo
Hope it can help you finding your problem..
See you,
Alban
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (4)
-
Alban Lermine
-
Louise-Amélie Schmitt
-
Nate Coraor
-
zhengqiu cai