[galaxy-dev] Sub-task splitting

3 May 2012

      Hello all,

Currently the Galaxy experimental task splitting code allows splitting into
N chunks, e.g. 8 parts, with:

<parallelism method="multi" split_mode="number_of_parts" split_size="8" ..." />

Or, into chunks of at most size N (units dependent on the file type, e.g. lines
in a tabular file or number of sequences in FASTA/FASTQ), e.g. at most 1000
sequences:

<parallelism method="multi" split_mode="to_size" split_size="1000" ... />

As an aside I found it confusing that the meaning of the "split_size" attribute
depend on the "split_mode" (number of jobs, or size of jobs).

I would prefer to be able to set both sizes - in this case tell Galaxy to try
to use at least 8 parts, each of at most 1000 sequences. Thus in a BLAST
task, initially the split would be (up to) eight ways:

8 queries => 8 jobs each with 1 query
80 queries => 8 jobs each with 10 queries
800 queries => 8 jobs each with 100 queries
8000 queries => 8 jobs each with 1000 queries

Then, once the max chunk size comes into play, you'd just get more jobs:

9000 queries => 9 jobs each with 1000 queries
10000 queries => 10 jobs each with 1000 queries
20000 queries => 20 jobs each with 1000 queries
etc

The appeal of this is it takes advantage of parallelism for small jobs
(under 100 queries) and large jobs (1000s of queries), while able to
impose a maximum size on each cluster job.

The problem is this requires changing the XML tags, and getting rid
of the current two modes in favour of this combined one. Perhaps this:

<parallelism method="multi" min_jobs="8" max_size="1000" ... />

The jobs threshold isn't strictly a minimum - if you have N < 8 query
sequences, you'd just have N jobs of 1 query each.

Does this sound sufficiently general? The split code is still rather
experimental so I don't expect breaking the API to be a big issue
(not many people are using it).

Peter

[galaxy-dev] Sub-task splitting

Peter Cock