Thanks, Peter! I'll get to it this afternoon EDT. -Scott ----- Original Message -----
On Thu, Oct 18, 2012 at 5:19 PM, Scott McManus <scottmcmanus@gatech.edu> wrote:
Hey Peter-
Thanks - I'll look into it. If you're able to reproduce the problem easily and wouldn't mind crafting a pull request, then it would be much appreciated. Otherwise I'll put this on my to-do list to be done soon. I or someone else may want to revisit the exception handling to prevent that from happening.
Thanks!
-Scott
OK then: https://bitbucket.org/galaxy/galaxy-central/pull-request/78/avoid-stall-when...
I can explain what was happening: We had a mount problem. The Galaxy server could talk to SGE and submit jobs, but when the jobs came to run the mount providing their home directory and the Galaxy file system was down, so they failed. Naturally this meant Galaxy got no output files back.
Reading the code, you deliberately attempt to merge any files present (e.g. if 9 out of 10 come back). That does make sense as it could be instructive (as long as it is flagged as an error, which doesn't seem to be happening).
I think getting zero files back from the split-jobs ought to be an error condition. In fact, failing to get all the expected sub-files back should also be an error condition (although it is still nice to do the merge so the user can see the partial output).
I think a little re-factoring might be needed to treat these explicitly as errors.
Regards,
Peter