Loss of sub-task order with <parallelism>

14 Jun 2012

      Dear all,

Are there any known issues with the job-splitting code (i.e. the new
<parallelism> tags in the tool wrappers) and the order of the sub-jobs?
I've noticed on our production Galaxy (a bit old now, 6799:40f1816d6857
from 7 March) two apparent problems here.

I added a diagnostic print statement to the jobs' stdout giving the node
number and SGE job number. When viewing the combined stdout in
Galaxy, the SGE job numbers should (I think) be strictly increasing.
That isn't always the case, e.g. here task_7 was added to the queue
before task_6:

/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_0:
Running on n3 as job 27700
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_1:
Running on n12 as job 27701
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_2:
Running on n8 as job 27702
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_3:
Running on n6 as job 27703
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_4:
Running on n11 as job 27704
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_5:
Running on n10 as job 27705
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_6:
Running on n4 as job 27707
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_7:
Running on n5 as job 27706
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_8:
Running on n9 as job 27708
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_9:
Running on n7 as job 27709
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_10:
Running on n12 as job 27710
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_11:
Running on n9 as job 27711
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_12:
Running on n6 as job 27712
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_13:
Running on n7 as job 27713
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_14:
Running on n4 as job 27714
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_15:
Running on n8 as job 27715
/mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_16:
Running on n10 as job 27716

In a separate example with 33 sub-tasks, there were two of these
inversions, while in yet another example with 33 sub-tasks there was
a trio submitted out of order. This non-deterministic behavior is a
little surprising, but in itself not an immediate problem.

In what appears to be a separate (and more concerning) loss of order,
after merging the output file order appears randomized. I would expect
the output from task_0, then task_1, ..., finally task_16. I haven't yet
worked out what order I am getting, but it isn't this, and neither is it
the order from the SGE job numbers (e.g. correct bar one pair
switched round).

Having looked at lib/galaxy/jobs/runners/tasks.py the source of this
behaviour currently eludes me [*]. Has anyone else observed anything
like this before?

Regards,

Peter

[*] P.S. I would like to see an upper bound on the sleep_time in method
run_job, say half an hour? Otherwise with a group of long running jobs
it seems Galaxy may end up waiting a very long time between checks
for their completion since it just doubles the wait at each point. I had
sometimes noticed a delay between the sub-jobs finishing according
to the cluster and Galaxy doing anything about merging it - this is
probably why.

Peter Cock

Dannon Baker

Peter Cock

tags

participants (2)