Loss of sub-task order with <parallelism>

Dear all, Are there any known issues with the job-splitting code (i.e. the new <parallelism> tags in the tool wrappers) and the order of the sub-jobs? I've noticed on our production Galaxy (a bit old now, 6799:40f1816d6857 from 7 March) two apparent problems here. I added a diagnostic print statement to the jobs' stdout giving the node number and SGE job number. When viewing the combined stdout in Galaxy, the SGE job numbers should (I think) be strictly increasing. That isn't always the case, e.g. here task_7 was added to the queue before task_6: /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_0: Running on n3 as job 27700 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_1: Running on n12 as job 27701 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_2: Running on n8 as job 27702 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_3: Running on n6 as job 27703 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_4: Running on n11 as job 27704 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_5: Running on n10 as job 27705 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_6: Running on n4 as job 27707 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_7: Running on n5 as job 27706 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_8: Running on n9 as job 27708 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_9: Running on n7 as job 27709 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_10: Running on n12 as job 27710 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_11: Running on n9 as job 27711 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_12: Running on n6 as job 27712 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_13: Running on n7 as job 27713 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_14: Running on n4 as job 27714 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_15: Running on n8 as job 27715 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_16: Running on n10 as job 27716 In a separate example with 33 sub-tasks, there were two of these inversions, while in yet another example with 33 sub-tasks there was a trio submitted out of order. This non-deterministic behavior is a little surprising, but in itself not an immediate problem. In what appears to be a separate (and more concerning) loss of order, after merging the output file order appears randomized. I would expect the output from task_0, then task_1, ..., finally task_16. I haven't yet worked out what order I am getting, but it isn't this, and neither is it the order from the SGE job numbers (e.g. correct bar one pair switched round). Having looked at lib/galaxy/jobs/runners/tasks.py the source of this behaviour currently eludes me [*]. Has anyone else observed anything like this before? Regards, Peter [*] P.S. I would like to see an upper bound on the sleep_time in method run_job, say half an hour? Otherwise with a group of long running jobs it seems Galaxy may end up waiting a very long time between checks for their completion since it just doubles the wait at each point. I had sometimes noticed a delay between the sub-jobs finishing according to the cluster and Galaxy doing anything about merging it - this is probably why.

On Jun 14, 2012, at 12:48 PM, Peter Cock wrote:
In a separate example with 33 sub-tasks, there were two of these inversions, while in yet another example with 33 sub-tasks there was a trio submitted out of order. This non-deterministic behavior is a little surprising, but in itself not an immediate problem.
You're correct in that submission order shouldn't matter at all, but I'll take a look and see if I can come up with an explanation for why.
In what appears to be a separate (and more concerning) loss of order, after merging the output file order appears randomized. I would expect the output from task_0, then task_1, ..., finally task_16. I haven't yet worked out what order I am getting, but it isn't this, and neither is it the order from the SGE job numbers (e.g. correct bar one pair switched round).
This would be happening in the merge. It looks like changeset c959d32f2405 might be the culprit for this -- it doesn't explicitly reorder by task number in the merge method, which would lead to (I'm guessing) an alphanumeric sort. I'll test and fix this.
[*] P.S. I would like to see an upper bound on the sleep_time in method run_job, say half an hour? Otherwise with a group of long running jobs it seems Galaxy may end up waiting a very long time between checks for their completion since it just doubles the wait at each point. I had sometimes noticed a delay between the sub-jobs finishing according to the cluster and Galaxy doing anything about merging it - this is probably why.
This sleep time should currently cap at 8 seconds.

On Thu, Jun 14, 2012 at 6:15 PM, Dannon Baker <dannonbaker@me.com> wrote:
On Jun 14, 2012, at 12:48 PM, Peter Cock wrote:
In a separate example with 33 sub-tasks, there were two of these inversions, while in yet another example with 33 sub-tasks there was a trio submitted out of order. This non-deterministic behavior is a little surprising, but in itself not an immediate problem.
You're correct in that submission order shouldn't matter at all, but I'll take a look and see if I can come up with an explanation for why.
Great.
In what appears to be a separate (and more concerning) loss of order, after merging the output file order appears randomized. I would expect the output from task_0, then task_1, ..., finally task_16. I haven't yet worked out what order I am getting, but it isn't this, and neither is it the order from the SGE job numbers (e.g. correct bar one pair switched round).
This would be happening in the merge. It looks like change set c959d32f2405 might be the culprit for this -- it doesn't explicitly reorder by task number in the merge method, which would lead to (I'm guessing) an alphanumeric sort. I'll test and fix this.
I hope you're right - I've not stuck any debugging code in to try and pin point this yet (maybe tomorrow).
[*] P.S. I would like to see an upper bound on the sleep_time in method run_job, say half an hour? Otherwise with a group of long running jobs it seems Galaxy may end up waiting a very long time between checks for their completion since it just doubles the wait at each point. I had sometimes noticed a delay between the sub-jobs finishing according to the cluster and Galaxy doing anything about merging it - this is probably why.
This sleep time should currently cap at 8 seconds.
Doh. Yes, that makes sense. Peter
participants (2)
-
Dannon Baker
-
Peter Cock