"Job output not returned from cluster"
I've been getting these errors sometimes lately, particularly when the cluster is heavily loaded. The jobs have completed successfully, as I can see the output if I click the pen icon, but the job is in a failed state. Have any other sites been experiencing this problem? Or can the galaxy developers help shed some light on the issue? FYI, I use the outputs_to_working_directory option in universe_wsgi.ini so that i can use a faster/more reliable filesystem to collect output from the cluster. I'm not using the recently discussed patch to run jobs as the unix user. I'll continue to experiment with different filesystems and software settings.
My jobs have this problem when the command for the tool is wrapped by the stderr wrapper script. Ka Ming ________________________________________ From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Edward Kirton [eskirton@lbl.gov] Sent: July 28, 2011 3:41 PM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] "Job output not returned from cluster" I've been getting these errors sometimes lately, particularly when the cluster is heavily loaded. The jobs have completed successfully, as I can see the output if I click the pen icon, but the job is in a failed state. Have any other sites been experiencing this problem? Or can the galaxy developers help shed some light on the issue? FYI, I use the outputs_to_working_directory option in universe_wsgi.ini so that i can use a faster/more reliable filesystem to collect output from the cluster. I'm not using the recently discussed patch to run jobs as the unix user. I'll continue to experiment with different filesystems and software settings.
On Fri, Jul 29, 2011 at 1:01 AM, Ka Ming Nip <kmnip@bcgsc.ca> wrote:
My jobs have this problem when the command for the tool is wrapped by the stderr wrapper script.
Ka Ming
Which stderr wrapper script? I think there is more than one... I've also had this error message (I'm currently working out how to connect our Galaxy to our cluster), and in at least one case it was caused by a file permission problem - the tool appeared to run but could not write the output files. If Galaxy could give more diagnostics rather than just "Job output not returned from cluster" it would help. For instance, as we use SGE, perhaps the captured stdout/stderr files might be available. Peter
It was the one on the wiki page. Ka Ming ________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: July 29, 2011 2:42 AM To: Ka Ming Nip Cc: Edward Kirton; galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] "Job output not returned from cluster" On Fri, Jul 29, 2011 at 1:01 AM, Ka Ming Nip <kmnip@bcgsc.ca> wrote:
My jobs have this problem when the command for the tool is wrapped by the stderr wrapper script.
Ka Ming
Which stderr wrapper script? I think there is more than one... I've also had this error message (I'm currently working out how to connect our Galaxy to our cluster), and in at least one case it was caused by a file permission problem - the tool appeared to run but could not write the output files. If Galaxy could give more diagnostics rather than just "Job output not returned from cluster" it would help. For instance, as we use SGE, perhaps the captured stdout/stderr files might be available. Peter
thanks for your comments, fellas. permissions would certainly cause this problem, but that's not the cause for me. most wrappers just serve to redirect stderr, so i don't think it's the wrapper script itself, but the stdout/stderr files are part of the problem. the error message is thrown in the finish_job method when it can't open the source/dest stdout/stderr for reading/writing. i split the try statement to add finer-grained error messages but i already verified the files do exist, so it's seems to be a file system issue. i suspect it's because the storage i'm using as a staging area has flashdrives between the RAM and spinnning disks, so upon close, the file buffers may get flushed out of RAM to the SSDs but not immediately be available from the SCSI drives. Or maybe the (inode) metadata table hasn't finished updating yet. if so, it's not the fact that the cluster is heavily utilized, but the filesystem is. this disk is expressly for staging cluster jobs. i'll see if adding a short sleep and retry once upon error solves this problem... but i won't know immediately as the problem is intermittent. that's the problem with fancy toys; they often come with fancy problems! On Fri, Jul 29, 2011 at 2:42 AM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
also had this error message (I'm currently working out how to connect our Galaxy to our cluster), and in at least one case it was caused by a file permission problem - the tool appeared to run but could not write the output files.
participants (3)
-
Edward Kirton
-
Ka Ming Nip
-
Peter Cock