Hi all, I've been running into some sporadic errors on our Cluster while using the latest development Galaxy, and the error handling has made this quite difficult to diagnose. For a user perspective, the jobs seem to run, get submitted to the cluster, and finish, and the data looks OK via the 'eye' view icon, but is red in the history with: 0 bytes An error occurred running this job: info unavailable Furthermore, the stdout and stderr via the 'info' icon are blank. For watching the log (and adding more diagnosis lines), what is happening is the job is being split and sent out to the cluster fine, and starts running. If one of the tasks fails (and this seems to be happening due to some sort of file system error on our cluster), Galaxy spots this, and kills the rest of the jobs. That's good. The problem is it fails to record any record of why the job died. This is my suggestion for now - it would be nice to go further and fill the info text show in the history peep as well?: $ hg diff diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py --- a/lib/galaxy/jobs/__init__.py Fri Sep 21 11:02:50 2012 +0100 +++ b/lib/galaxy/jobs/__init__.py Fri Sep 21 11:59:27 2012 +0100 @@ -1061,6 +1061,14 @@ log.error( "stderr for job %d is greater than 32K, only first part will be logged to database" % task.id ) task.stderr = stderr[:32768] task.command_line = self.command_line + + if task.state == task.states.ERROR: + # If failed, will kill the other tasks in this job. Record this + # task's stdout/stderr as should be useful to explain the failure: + job = self.get_job() + job.stdout = ("(From one sub-task:)\n" +task.stdout)[:32768] + job.stderr = ("(From one sub-task:)\n" +task.stderr)[:32768] + self.sa_session.flush() log.debug( 'task %d ended' % self.task_id ) Regards, Peter