Hello all, Currently the Galaxy experimental task splitting code allows splitting into N chunks, e.g. 8 parts, with: <parallelism method="multi" split_mode="number_of_parts" split_size="8" ..." /> Or, into chunks of at most size N (units dependent on the file type, e.g. lines in a tabular file or number of sequences in FASTA/FASTQ), e.g. at most 1000 sequences: <parallelism method="multi" split_mode="to_size" split_size="1000" ... /> As an aside I found it confusing that the meaning of the "split_size" attribute depend on the "split_mode" (number of jobs, or size of jobs). I would prefer to be able to set both sizes - in this case tell Galaxy to try to use at least 8 parts, each of at most 1000 sequences. Thus in a BLAST task, initially the split would be (up to) eight ways: 8 queries => 8 jobs each with 1 query 80 queries => 8 jobs each with 10 queries 800 queries => 8 jobs each with 100 queries 8000 queries => 8 jobs each with 1000 queries Then, once the max chunk size comes into play, you'd just get more jobs: 9000 queries => 9 jobs each with 1000 queries 10000 queries => 10 jobs each with 1000 queries 20000 queries => 20 jobs each with 1000 queries etc The appeal of this is it takes advantage of parallelism for small jobs (under 100 queries) and large jobs (1000s of queries), while able to impose a maximum size on each cluster job. The problem is this requires changing the XML tags, and getting rid of the current two modes in favour of this combined one. Perhaps this: <parallelism method="multi" min_jobs="8" max_size="1000" ... /> The jobs threshold isn't strictly a minimum - if you have N < 8 query sequences, you'd just have N jobs of 1 query each. Does this sound sufficiently general? The split code is still rather experimental so I don't expect breaking the API to be a big issue (not many people are using it). Peter
+1. This is especially useful for us, with hardware-accelerated algorithms having limits on input size. On 12-05-03 9:51 AM, "Peter Cock" <p.j.a.cock@googlemail.com> wrote:
Hello all,
Currently the Galaxy experimental task splitting code allows splitting into N chunks, e.g. 8 parts, with:
<parallelism method="multi" split_mode="number_of_parts" split_size="8" ..." />
Or, into chunks of at most size N (units dependent on the file type, e.g. lines in a tabular file or number of sequences in FASTA/FASTQ), e.g. at most 1000 sequences:
<parallelism method="multi" split_mode="to_size" split_size="1000" ... />
As an aside I found it confusing that the meaning of the "split_size" attribute depend on the "split_mode" (number of jobs, or size of jobs).
I would prefer to be able to set both sizes - in this case tell Galaxy to try to use at least 8 parts, each of at most 1000 sequences. Thus in a BLAST task, initially the split would be (up to) eight ways:
8 queries => 8 jobs each with 1 query 80 queries => 8 jobs each with 10 queries 800 queries => 8 jobs each with 100 queries 8000 queries => 8 jobs each with 1000 queries
Then, once the max chunk size comes into play, you'd just get more jobs:
9000 queries => 9 jobs each with 1000 queries 10000 queries => 10 jobs each with 1000 queries 20000 queries => 20 jobs each with 1000 queries etc
The appeal of this is it takes advantage of parallelism for small jobs (under 100 queries) and large jobs (1000s of queries), while able to impose a maximum size on each cluster job.
The problem is this requires changing the XML tags, and getting rid of the current two modes in favour of this combined one. Perhaps this:
<parallelism method="multi" min_jobs="8" max_size="1000" ... />
The jobs threshold isn't strictly a minimum - if you have N < 8 query sequences, you'd just have N jobs of 1 query each.
Does this sound sufficiently general? The split code is still rather experimental so I don't expect breaking the API to be a big issue (not many people are using it).
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On 12-05-03 9:51 AM, "Peter Cock" <p.j.a.cock@googlemail.com> wrote:
Hello all,
Currently the Galaxy experimental task splitting code allows splitting into N chunks, e.g. 8 parts, ...
Or, into chunks of at most size N (units dependent on the file type, e.g. lines in a tabular file or number of sequences in FASTA/FASTQ), e.g. at most 1000 sequences:
...
I would prefer to be able to set both sizes - in this case tell Galaxy to try to use at least 8 parts, each of at most 1000 sequences. ...
Does this sound sufficiently general? The split code is still rather experimental so I don't expect breaking the API to be a big issue (not many people are using it).
Peter
On Thu, May 3, 2012 at 6:01 PM, Paul Gordon <gordonp@ucalgary.ca> wrote:
+1. This is especially useful for us, with hardware-accelerated algorithms having limits on input size.
I've got this working with FASTA files on our Galaxy at the moment, and touch-wood it is behaving nicely. The code doesn't yet handle splitting other input file formats, which would be required before applying this to the trunk - but some feedback on if the Galaxy team are keen on this direction or not would be appreciated: https://bitbucket.org/peterjc/galaxy-central/changeset/aa98de8effd1 (I'll have to sort out the branches since this is now mixed up with BLAST database work... but that is an aside). Peter
participants (2)
-
Paul Gordon
-
Peter Cock