thanks for your comments, fellas.
most wrappers just serve to redirect stderr, so i don't think it's the wrapper script itself, but the stdout/stderr files are part of the problem.
the error message is thrown in the finish_job method when it can't open the source/dest stdout/stderr for reading/writing. i split the try statement to add finer-grained error messages but i already verified the files do exist, so it's seems to be a file system issue.
i suspect it's because the storage i'm using as a staging area has flashdrives between the RAM and spinnning disks, so upon close, the file buffers may get flushed out of RAM to the SSDs but not immediately be available from the SCSI drives. Or maybe the (inode) metadata table hasn't finished updating yet. if so, it's not the fact that the cluster is heavily utilized, but the filesystem is. this disk is expressly for staging cluster jobs. i'll see if adding a short sleep and retry once upon error solves this problem... but i won't know immediately as the problem is intermittent. that's the problem with fancy toys; they often come with fancy problems!
On Fri, Jul 29, 2011 at 2:42 AM, Peter Cock
<p.j.a.cock@googlemail.com> wrote:
also had this error message (I'm currently working out how to
connect our Galaxy to our cluster), and in at least one case it was
caused by a file permission problem - the tool appeared to run but
could not write the output files.