Re: [galaxy-dev] Error: Job output not returned from cluster

24 Apr 2012

      Le 24/04/2012 14:53, Louise-Amélie Schmitt a écrit :
...
At first we thought it could be an ssh issue but submitting jobs and
getting the output back isn't a problem when I do it from my personal
user manually, so it's really related to Galaxy. We're using PBS Pro btw.
And I'm still at loss... :(
L-A
Le 23/04/2012 15:42, zhengqiu cai a écrit :
...
I am having the same problem when I use condor as the scheduler
instead of sge.
Cai
--- 12年4月23日，周一, Louise-Amélie
Schmitt<louise-amelie.schmitt@embl.de>  写道：
...
发件人: Louise-Amélie Schmitt<louise-amelie.schmitt@embl.de>
主题: [galaxy-dev] Error: Job output not returned from cluster
收件人: galaxy-dev@lists.bx.psu.edu
日期: 2012年4月23日,周一,下午5:09
Hello everyone,
I'm still trying to set up the job submission as the real
user, and I get a mysterious error. The job obviously runs
somewhere and when it ends it is in error state and displays
the following message: "Job output not returned from
cluster"
In the Galaxy log I have the following lines when the job
finishes running:
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509
(1455/9161620.pbs-master2.embl.de) state change: job
finished, but failed
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job
output not returned from cluster
galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat
to
/g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat
galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended
galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755
Cleaning up external metadata files
galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768
Failed to cleanup MetadataTempFile temp files from
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM:
No JSON object could be decoded: line 1 column 0 (char 0)
The
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/
directory is empty and
/g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat
exists but is empty.
Any ideas about what can go wrong there? Any lead would be
immensely appreciated!
Thanks,
L-A
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to
this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
Hi L-A,
I run Galaxy as real user on our cluster with pbs (free version).

We first configure LDAP authentification for having email account
related to unix account (just cut the @curie.fr)
Then I have modify pbs.py (in
<GALAXY_DIR>galaxy-dist/lib/galaxy/jobs/runners)

I have just disconnected the pbs submission through python library and
replace it by a system call (just like I send jobs to the cluster with
command line), here is the code used:

     galaxy_job_id = job_wrapper.job_id
     log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) )
     log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) )

    # Submit job with system call instead of using python PBS library -
Permit to run jobs as .. with sudo -u cmd prefix

    galaxy_job_idSTR = str(job_wrapper.job_id)
    galaxy_tool_idSTR = str(job_wrapper.tool.id)
    galaxy_job_name =
galaxy_job_idSTR+"_"+galaxy_tool_idSTR+"_"+job_wrapper.user
    torque_options = runner_url.split("/")
    queue = torque_options[3]
    ressources = torque_options[4]
    user_mail = job_wrapper.user.split("@")
    username = user_mail[0]

    torque_cmd = "sudo -u username echo "+"\""+command_line+"\" | qsub
-o "+ofile+" -e "+efile+" -M "+job_wrapper.user+" -N "+galaxy_job_name+"
-q "+queue+" "+ressources

    submit_pbs_job = os.popen(torque_cmd)

    job_id = submit_pbs_job.read().rstrip("\n")

    #Original job launcher
    #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)

    pbs.pbs_disconnect(c)

Second thing I have done is to wait error and output file from torque in
the finish_job function (if not, I never receive the output, seems to be
your problem..), here is the code used:

def finish_job( self, pbs_job_state ):
        """
        Get the output/error for a finished job, pass to
`job_wrapper.finish`
        and cleanup all the PBS temporary files.
        """
        ofile = pbs_job_state.ofile
        efile = pbs_job_state.efile
        job_file = pbs_job_state.job_file

        # collect the output
        try:

        # With qsub system call, need to wait efile and ofile creation
at the end of the job execution before reading them

        efileExists = os.path.isfile(efile)
        ofileExists = os.path.isfile(ofile)
        efileExistsSTR = str(efileExists)
        ofileExistsSTR = str(ofileExists)

        while efileExistsSTR == "False":
        time.sleep( 1 )
        efileExists = os.path.isfile(efile)
        efileExistsSTR = str(efileExists)

        while ofileExistsSTR == "False":
        time.sleep( 1 )
        ofileExists = os.path.isfile(ofile)
        ofileExistsSTR = str(ofileExists)

        # Back to original code

        ofh = file(ofile, "r")
        efh = file(efile, "r")
        stdout = ofh.read( 32768 )
        stderr = efh.read( 32768 )
        except:
            stdout = ''
            stderr = 'Job output not returned by PBS: the output
datasets were deleted while the job was running, the job was manually
dequeued or there was a cluster error.'
            log.debug(stderr)

* Last step is to allow galaxy user to run sudo

Hope it can help you finding your problem..

See you,

Alban

-- 
Alban Lermine 
Unité 900 : Inserm - Mines ParisTech - Institut Curie
« Bioinformatics and Computational Systems Biology of Cancer »
11-13 rue Pierre et Marie Curie (1er étage) - 75005 Paris - France
Tel : +33 (0) 1 56 24 69 84