Improved error logging in TaskWrapper
Hi all, I've been running into some sporadic errors on our Cluster while using the latest development Galaxy, and the error handling has made this quite difficult to diagnose. For a user perspective, the jobs seem to run, get submitted to the cluster, and finish, and the data looks OK via the 'eye' view icon, but is red in the history with: 0 bytes An error occurred running this job: info unavailable Furthermore, the stdout and stderr via the 'info' icon are blank. For watching the log (and adding more diagnosis lines), what is happening is the job is being split and sent out to the cluster fine, and starts running. If one of the tasks fails (and this seems to be happening due to some sort of file system error on our cluster), Galaxy spots this, and kills the rest of the jobs. That's good. The problem is it fails to record any record of why the job died. This is my suggestion for now - it would be nice to go further and fill the info text show in the history peep as well?: $ hg diff diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py --- a/lib/galaxy/jobs/__init__.py Fri Sep 21 11:02:50 2012 +0100 +++ b/lib/galaxy/jobs/__init__.py Fri Sep 21 11:59:27 2012 +0100 @@ -1061,6 +1061,14 @@ log.error( "stderr for job %d is greater than 32K, only first part will be logged to database" % task.id ) task.stderr = stderr[:32768] task.command_line = self.command_line + + if task.state == task.states.ERROR: + # If failed, will kill the other tasks in this job. Record this + # task's stdout/stderr as should be useful to explain the failure: + job = self.get_job() + job.stdout = ("(From one sub-task:)\n" +task.stdout)[:32768] + job.stderr = ("(From one sub-task:)\n" +task.stderr)[:32768] + self.sa_session.flush() log.debug( 'task %d ended' % self.task_id ) Regards, Peter
Thanks, Peter! Those are good suggestions. I'll look into it soon. -Scott ----- Original Message -----
Hi all,
I've been running into some sporadic errors on our Cluster while using the latest development Galaxy, and the error handling has made this quite difficult to diagnose.
For a user perspective, the jobs seem to run, get submitted to the cluster, and finish, and the data looks OK via the 'eye' view icon, but is red in the history with:
0 bytes An error occurred running this job: info unavailable
Furthermore, the stdout and stderr via the 'info' icon are blank.
For watching the log (and adding more diagnosis lines), what is happening is the job is being split and sent out to the cluster fine, and starts running. If one of the tasks fails (and this seems to be happening due to some sort of file system error on our cluster), Galaxy spots this, and kills the rest of the jobs. That's good.
The problem is it fails to record any record of why the job died. This is my suggestion for now - it would be nice to go further and fill the info text show in the history peep as well?:
$ hg diff diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py --- a/lib/galaxy/jobs/__init__.py Fri Sep 21 11:02:50 2012 +0100 +++ b/lib/galaxy/jobs/__init__.py Fri Sep 21 11:59:27 2012 +0100 @@ -1061,6 +1061,14 @@ log.error( "stderr for job %d is greater than 32K, only first part will be logged to database" % task.id ) task.stderr = stderr[:32768] task.command_line = self.command_line + + if task.state == task.states.ERROR: + # If failed, will kill the other tasks in this job. Record this + # task's stdout/stderr as should be useful to explain the failure: + job = self.get_job() + job.stdout = ("(From one sub-task:)\n" +task.stdout)[:32768] + job.stderr = ("(From one sub-task:)\n" +task.stderr)[:32768] + self.sa_session.flush() log.debug( 'task %d ended' % self.task_id )
Regards,
Peter
Hi Scott, I see you've been working on this - it looks very comprehensive: https://bitbucket.org/galaxy/galaxy-central/changeset/3d07a7800f9a I can't test this just now, but if I run into any issues with the new code later on, I'll be in touch. Thanks, Peter On Fri, Sep 21, 2012 at 8:42 PM, Scott McManus <scottmcmanus@gatech.edu> wrote:
Thanks, Peter! Those are good suggestions. I'll look into it soon.
-Scott
----- Original Message -----
Hi all,
I've been running into some sporadic errors on our Cluster while using the latest development Galaxy, and the error handling has made this quite difficult to diagnose.
For a user perspective, the jobs seem to run, get submitted to the cluster, and finish, and the data looks OK via the 'eye' view icon, but is red in the history with:
0 bytes An error occurred running this job: info unavailable
Furthermore, the stdout and stderr via the 'info' icon are blank.
For watching the log (and adding more diagnosis lines), what is happening is the job is being split and sent out to the cluster fine, and starts running. If one of the tasks fails (and this seems to be happening due to some sort of file system error on our cluster), Galaxy spots this, and kills the rest of the jobs. That's good.
The problem is it fails to record any record of why the job died. This is my suggestion for now - it would be nice to go further and fill the info text show in the history peep as well?:
$ hg diff diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py --- a/lib/galaxy/jobs/__init__.py Fri Sep 21 11:02:50 2012 +0100 +++ b/lib/galaxy/jobs/__init__.py Fri Sep 21 11:59:27 2012 +0100 @@ -1061,6 +1061,14 @@ log.error( "stderr for job %d is greater than 32K, only first part will be logged to database" % task.id ) task.stderr = stderr[:32768] task.command_line = self.command_line + + if task.state == task.states.ERROR: + # If failed, will kill the other tasks in this job. Record this + # task's stdout/stderr as should be useful to explain the failure: + job = self.get_job() + job.stdout = ("(From one sub-task:)\n" +task.stdout)[:32768] + job.stderr = ("(From one sub-task:)\n" +task.stderr)[:32768] + self.sa_session.flush() log.debug( 'task %d ended' % self.task_id )
Regards,
Peter
participants (2)
-
Peter Cock
-
Scott McManus