Hi Scott,
I see you've been working on this - it looks very comprehensive:
https://bitbucket.org/galaxy/galaxy-central/changeset/3d07a7800f9a
I can't test this just now, but if I run into any issues with the new
code later on, I'll be in touch.
Thanks,
Peter
On Fri, Sep 21, 2012 at 8:42 PM, Scott McManus <scottmcmanus(a)gatech.edu> wrote:
Thanks, Peter! Those are good suggestions. I'll look into it soon.
-Scott
----- Original Message -----
> Hi all,
>
> I've been running into some sporadic errors on our Cluster while
> using the latest development Galaxy, and the error handling has
> made this quite difficult to diagnose.
>
> For a user perspective, the jobs seem to run, get submitted to
> the cluster, and finish, and the data looks OK via the 'eye' view
> icon, but is red in the history with:
>
> 0 bytes
> An error occurred running this job: info unavailable
>
> Furthermore, the stdout and stderr via the 'info' icon are blank.
>
> For watching the log (and adding more diagnosis lines), what
> is happening is the job is being split and sent out to the cluster
> fine, and starts running. If one of the tasks fails (and this seems
> to be happening due to some sort of file system error on our
> cluster), Galaxy spots this, and kills the rest of the jobs. That's
> good.
>
> The problem is it fails to record any record of why the job died.
> This is my suggestion for now - it would be nice to go further
> and fill the info text show in the history peep as well?:
>
> $ hg diff
> diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py
> --- a/lib/galaxy/jobs/__init__.py Fri Sep 21 11:02:50 2012 +0100
> +++ b/lib/galaxy/jobs/__init__.py Fri Sep 21 11:59:27 2012 +0100
> @@ -1061,6 +1061,14 @@
> log.error( "stderr for job %d is greater than 32K, only
> first part will be logged to database" % task.id )
> task.stderr = stderr[:32768]
> task.command_line = self.command_line
> +
> + if task.state == task.states.ERROR:
> + # If failed, will kill the other tasks in this job.
> Record this
> + # task's stdout/stderr as should be useful to explain
> the failure:
> + job = self.get_job()
> + job.stdout = ("(From one sub-task:)\n"
> +task.stdout)[:32768]
> + job.stderr = ("(From one sub-task:)\n"
> +task.stderr)[:32768]
> +
> self.sa_session.flush()
> log.debug( 'task %d ended' % self.task_id )
>
>
> Regards,
>
> Peter
>