On Jun 14, 2012, at 12:48 PM, Peter Cock wrote:
In a separate example with 33 sub-tasks, there were two of these inversions, while in yet another example with 33 sub-tasks there was a trio submitted out of order. This non-deterministic behavior is a little surprising, but in itself not an immediate problem.
You're correct in that submission order shouldn't matter at all, but I'll take a look and see if I can come up with an explanation for why.
In what appears to be a separate (and more concerning) loss of order, after merging the output file order appears randomized. I would expect the output from task_0, then task_1, ..., finally task_16. I haven't yet worked out what order I am getting, but it isn't this, and neither is it the order from the SGE job numbers (e.g. correct bar one pair switched round).
This would be happening in the merge. It looks like changeset c959d32f2405 might be the culprit for this -- it doesn't explicitly reorder by task number in the merge method, which would lead to (I'm guessing) an alphanumeric sort. I'll test and fix this.
[*] P.S. I would like to see an upper bound on the sleep_time in method run_job, say half an hour? Otherwise with a group of long running jobs it seems Galaxy may end up waiting a very long time between checks for their completion since it just doubles the wait at each point. I had sometimes noticed a delay between the sub-jobs finishing according to the cluster and Galaxy doing anything about merging it - this is probably why.
This sleep time should currently cap at 8 seconds.