Improved error logging in TaskWrapper

21 Sep 2012

Hi all,

I've been running into some sporadic errors on our Cluster while
using the latest development Galaxy, and the error handling has
made this quite difficult to diagnose.

For a user perspective, the jobs seem to run, get submitted to
the cluster, and finish, and the data looks OK via the 'eye' view
icon, but is red in the history with:

0 bytes
An error occurred running this job: info unavailable

Furthermore, the stdout and stderr via the 'info' icon are blank.

For watching the log (and adding more diagnosis lines), what
is happening is the job is being split and sent out to the cluster
fine, and starts running. If one of the tasks fails (and this seems
to be happening due to some sort of file system error on our
cluster), Galaxy spots this, and kills the rest of the jobs. That's
good.

The problem is it fails to record any record of why the job died.
This is my suggestion for now - it would be nice to go further
and fill the info text show in the history peep as well?:

$ hg diff
diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py

--- a/lib/galaxy/jobs/__init__.py	Fri Sep 21 11:02:50 2012 +0100
+++ b/lib/galaxy/jobs/__init__.py	Fri Sep 21 11:59:27 2012 +0100
@@ -1061,6 +1061,14 @@
             log.error( "stderr for job %d is greater than 32K, only
first part will be logged to database" % task.id )
         task.stderr = stderr[:32768]
         task.command_line = self.command_line
+
+        if task.state == task.states.ERROR:
+            # If failed, will kill the other tasks in this job. Record this
+            # task's stdout/stderr as should be useful to explain the failure:
+            job = self.get_job()
+            job.stdout = ("(From one sub-task:)\n" +task.stdout)[:32768]
+            job.stderr = ("(From one sub-task:)\n" +task.stderr)[:32768]
+
         self.sa_session.flush()
         log.debug( 'task %d ended' % self.task_id )


Regards,

Peter

    

Peter Cock

Scott McManus

Peter Cock

tags

participants (2)