Splitting large jobs over multiple nodes/CPUs?

newer
Running galaxy in a cluster as if...

older
How to learn about new tools in...

Peter Cock

15 Feb 2012 15 Feb '12

4:36 p.m.

Hi all, The comments on this issue suggest that the Galaxy team is/were working on splitting large jobs over multiple nodes/CPUs: https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs Is there any relevant page on the wiki I should be aware of? Specifically I am hoping for a general framework where one of the tool inputs can be marked as "embarrassingly parallel" meaning it can be subdivided easily (e.g. multiple sequences in FASTA or FASTQ format, multiple annotations in BED format, multiple lines in tabular format) and the outputs can all be easily combined (e.g. by concatenation in the same order as the input was split). Thanks, Peter

Show replies by date

Dannon Baker

15 Feb 15 Feb

5:08 p.m.

It's definitely an experimental feature at this point, and there's no wiki, but basic support for breaking jobs into tasks does exist. It needs a lot more work and can go in a few different directions to make it better, but check out the wrappers with <parallelism> defined, and enable use_tasked_jobs in your universe_wsgi.ini and restart. That's all it should take from a fresh galaxy install to get, iirc, at least BWA and a few other tools working. If you want a super trivial example to play with, change the tool .xml for text tool like "change case" to have <parallelism method="basic"></parallelism> and give that a shot. If you decide to try this out, do keep in mind that this feature is not at all complete and while there's a long list of things we still want to experiment with along these lines suggestions (and especially contributions) are absolutely welcome. -Dannon On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:

...

Hi all,

The comments on this issue suggest that the Galaxy team is/were working on splitting large jobs over multiple nodes/CPUs:

https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs

Is there any relevant page on the wiki I should be aware of?

Specifically I am hoping for a general framework where one of the tool inputs can be marked as "embarrassingly parallel" meaning it can be subdivided easily (e.g. multiple sequences in FASTA or FASTQ format, multiple annotations in BED format, multiple lines in tabular format) and the outputs can all be easily combined (e.g. by concatenation in the same order as the input was split).

Thanks,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Fields, Christopher J

5:41 p.m.

Ah, was just about to ask about this as well, nice to know something is already in place (as experimental as it might be). Thanks Dannon! chris On Feb 15, 2012, at 11:08 AM, Dannon Baker wrote:

...

It's definitely an experimental feature at this point, and there's no wiki, but basic support for breaking jobs into tasks does exist. It needs a lot more work and can go in a few different directions to make it better, but check out the wrappers with <parallelism> defined, and enable use_tasked_jobs in your universe_wsgi.ini and restart. That's all it should take from a fresh galaxy install to get, iirc, at least BWA and a few other tools working. If you want a super trivial example to play with, change the tool .xml for text tool like "change case" to have <parallelism method="basic"></parallelism> and give that a shot.

If you decide to try this out, do keep in mind that this feature is not at all complete and while there's a long list of things we still want to experiment with along these lines suggestions (and especially contributions) are absolutely welcome.

-Dannon

On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:

...
Hi all,

The comments on this issue suggest that the Galaxy team is/were working on splitting large jobs over multiple nodes/CPUs:

https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs

Is there any relevant page on the wiki I should be aware of?

Specifically I am hoping for a general framework where one of the tool inputs can be marked as "embarrassingly parallel" meaning it can be subdivided easily (e.g. multiple sequences in FASTA or FASTQ format, multiple annotations in BED format, multiple lines in tabular format) and the outputs can all be easily combined (e.g. by concatenation in the same order as the input was split).

Thanks,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Peter Cock

5:47 p.m.

On Wed, Feb 15, 2012 at 5:08 PM, Dannon Baker <dannonbaker@me.com> wrote:

...

It's definitely an experimental feature at this point, and there's no wiki, but basic support for breaking jobs into tasks does exist. It needs a lot more work and can go in a few different directions to make it better,

Not what I was hoping to hear, but a promising start :)

...

but check out the wrappers with <parallelism> defined, and enable use_tasked_jobs in your universe_wsgi.ini and restart. That's all it should take from a fresh galaxy install to get, iirc, at least BWA and a few other tools working. If you want a super trivial example to play with, change the tool .xml for text tool like "change case" to have <parallelism method="basic"></parallelism> and give that a shot.

Excellent - that saved me searching blindly. $ cd tools $ grep parallelism */*.xml samtools/sam_bitwise_flag_filter.xml: <parallelism method="basic"></parallelism> sr_mapping/bowtie_wrapper.xml: <parallelism method="basic"></parallelism> sr_mapping/bwa_color_wrapper.xml: <parallelism method="basic"></parallelism> sr_mapping/bwa_wrapper.xml: <parallelism method="basic"></parallelism> Are those four tools being used on Galaxy Main already with this basic parallelism in place? Looking at the code in lib/galaxy/jobs/splitters/basic.py its comments suggest it only works on tools with one input and one output file (although that seems a bit fuzzy as you could be using BWA with a FASTA history item as the reference - would that fail?). I see also interesting things in lib/galaxy/jobs/splitters/multi.py Is that even more experimental? It looks like it could be used to say BWA's read file was to be split, but the reference file shared. Regarding the merging of the out, I see there is a default merge method in lib/galaxy/datatypes/data.py which just concatenates the files. I am surprised at that - it seems like a very bad idea in general - consider many binary files, or XML. Why not put this as the default for text and subclasses thereof? There is also one example where the merge method gets overridden, lib/galaxy/datatypes/tabular.py which avoids the repetition of any headers when merging SAM files. That should be enough clues to implement other customized merge code for other datatypes.

...

If you decide to try this out, do keep in mind that this feature is not at all complete and while there's a long list of things we still want to experiment with along these lines suggestions (and especially contributions) are absolutely welcome.

OK then, I hope to have a play with this shortly. Thanks, Peter

Dannon Baker

6:07 p.m.

Are those four tools being used on Galaxy Main already with this basic parallelism in place? Main still runs these jobs in the standard non-split fashion, and as a resource that is occasionally saturated (and thus doesn't necessarily have extra resources to parallelize to) will probably continue doing so as long as there's significant overhead involved in splitting the files. Fancy scheduling could minimize the issue, but as it is during heavy load you would actually have lower total throughput due to the splitting overhead. Looking at the code in lib/galaxy/jobs/splitters/basic.py its comments suggest it only works on tools with one input and one output file (although that seems a bit fuzzy as you could be using BWA with a FASTA history item as the reference - would that fail?). I haven't tried it, but probably. I see also interesting things in lib/galaxy/jobs/splitters/multi.py Is that even more experimental? It looks like it could be used to say BWA's read file was to be split, but the reference file shared. Yes. Regarding the merging of the out, I see there is a default merge method in lib/galaxy/datatypes/data.py which just concatenates the files. I am surprised at that - it seems like a very bad idea in general - consider many binary files, or XML. Why not put this as the default for text and subclasses thereof? I can't think of a better reasonable default behavior for "Data", though you're obviously right that each datatype subclass will need to define particular behaviors for merging files. OK then, I hope to have a play with this shortly. Good luck, let me know how it goes, and again - contributions are certainly welcome :) -Dannon

Peter Cock

16 Feb 16 Feb

10:15 a.m.

On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker <dannonbaker@me.com> wrote:

...

Main still runs these jobs in the standard non-split fashion, and as a resource that is occasionally saturated (and thus doesn't necessarily have extra resources to parallelize to) will probably continue doing so as long as there's significant overhead involved in splitting the files. Fancy scheduling could minimize the issue, but as it is during heavy load you would actually have lower total throughput due to the splitting overhead.

Because the splitting (currently) happens on the main server?

...

...
Regarding the merging of the out, I see there is a default merge method in lib/galaxy/datatypes/data.py which just concatenates the files. I am surprised at that - it seems like a very bad idea in general - consider many binary files, or XML. Why not put this as the default for text and subclasses thereof?

I can't think of a better reasonable default behavior for "Data", though you're obviously right that each datatype subclass will need to define particular behaviors for merging files.

The default should raise an error (and better yet, refuse to do the split in the first place). Zen of Python: In the face of ambiguity, refuse the temptation to guess. Peter

Dannon Baker

10:45 a.m.

On Feb 16, 2012, at 5:15 AM, Peter Cock wrote:

...

On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker <dannonbaker@me.com> wrote:

...
Main still runs these jobs in the standard non-split fashion, and as a resource that is occasionally saturated (and thus doesn't necessarily have extra resources to parallelize to) will probably continue doing so as long as there's significant overhead involved in splitting the files. Fancy scheduling could minimize the issue, but as it is during heavy load you would actually have lower total throughput due to the splitting overhead.

Because the splitting (currently) happens on the main server?

No, because the splitting process is work which has to happen somewhere. Ignoring possible benefits from things that haven't been implemented yet, in a situation where your cluster is saturated with work you are unable to take advantage of the parallelism and splitting files apart is only adding more work, reducing total job throughput. That splitting always happens on the head node is not ideal, and needs to be configurable. I have a fork somewhere that attempts to address this but it needs work.

Peter Cock

10:47 a.m.

On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker <dannonbaker@me.com> wrote:

...

Good luck, let me know how it goes, and again - contributions are certainly welcome :)

I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py for class Sequence assumes four lines per sequence. This would make sense as the split method of the Fastq class (after grooming to remove any line wrapping) but is a very bad idea on most sequence file formats (e.g. FASTA). It looks like a little refactoring is needed, defining a Sequence split method which raises not implemented, and moving the current code to the Fastq class, then writing something similar but allowing multiple lines per record for the Fasta class. Does that sound reasonable? I'll do this on a new branch for review... Peter

Peter Cock

12:19 p.m.

On Thu, Feb 16, 2012 at 10:47 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker <dannonbaker@me.com> wrote:

...
Good luck, let me know how it goes, and again - contributions are certainly welcome :)

I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py for class Sequence assumes four lines per sequence. This would make sense as the split method of the Fastq class (after grooming to remove any line wrapping) but is a very bad idea on most sequence file formats (e.g. FASTA).

It looks like a little refactoring is needed, defining a Sequence split method which raises not implemented, and moving the current code to the Fastq class, then writing something similar but allowing multiple lines per record for the Fasta class.

Does that sound reasonable? I'll do this on a new branch for review...

Refactoring lib/galaxy/datatypes/sequence.py split method here, https://bitbucket.org/peterjc/galaxy-central/changeset/762777618073 This is part of a work-in-progress "split_blast" branch to try splitting BLAST jobs, for which I will need to split FASTA files as inputs, and also merge BLAST XML output: https://bitbucket.org/peterjc/galaxy-central/src/split_blast Peter

Peter Cock

4:28 p.m.

Hi Dan, I think I need a little more advice - what is the role of the script scripts/extract_dataset_part.py and the JSON files created when splitting FASTQ files in lib/galaxy/datatypes/sequence.py, and then used by the class' process_split_file method? Why is there no JSON file created by the base data class in lib/galaxy/datatypes/data.py and no method process_split_file? Is the JSON thing part of a partial and unfinished rewrite of the splitter code? On the assumption that not all splitters bother with the JSON, I am trying a little hack to scripts/extract_dataset_part.py to abort silently if there is no JSON file: https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3 This seems to be working with my current attempt at a FASTA splitter (not checked in yes, only partly implemented and tested). Peter

Peter Cock

6:24 p.m.

On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Hi Dan,

I think I need a little more advice - what is the role of the script scripts/extract_dataset_part.py and the JSON files created when splitting FASTQ files in lib/galaxy/datatypes/sequence.py, and then used by the class' process_split_file method?

Why is there no JSON file created by the base data class in lib/galaxy/datatypes/data.py and no method process_split_file?

Is the JSON thing part of a partial and unfinished rewrite of the splitter code?

On the assumption that not all splitters bother with the JSON, I am trying a little hack to scripts/extract_dataset_part.py to abort silently if there is no JSON file: https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

This seems to be working with my current attempt at a FASTA splitter (not checked in yes, only partly implemented and tested).

I've checked in my FASTA splitting, which now seems to be working OK with my BLAST tests. So far this only does splitting into chunks of the requested number of sequences, rather than the option to split the whole file into a given number of pieces. https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9 I also need to look at merging multiple BLAST XML outputs, but this is looking promising. Peter

Dannon Baker

6:32 p.m.

Very cool, I'll check it out! The addition of the JSON files is indeed very new and was likely unfinished with respect to the base splitter. -Dannon On Feb 16, 2012, at 1:24 PM, Peter Cock wrote:

...

On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi Dan,

I think I need a little more advice - what is the role of the script scripts/extract_dataset_part.py and the JSON files created when splitting FASTQ files in lib/galaxy/datatypes/sequence.py, and then used by the class' process_split_file method?

Why is there no JSON file created by the base data class in lib/galaxy/datatypes/data.py and no method process_split_file?

Is the JSON thing part of a partial and unfinished rewrite of the splitter code?

On the assumption that not all splitters bother with the JSON, I am trying a little hack to scripts/extract_dataset_part.py to abort silently if there is no JSON file: https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

This seems to be working with my current attempt at a FASTA splitter (not checked in yes, only partly implemented and tested).

I've checked in my FASTA splitting, which now seems to be working OK with my BLAST tests. So far this only does splitting into chunks of the requested number of sequences, rather than the option to split the whole file into a given number of pieces. https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

I also need to look at merging multiple BLAST XML outputs, but this is looking promising.

Peter

Fields, Christopher J

6:42 p.m.

On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:

...

On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Hi Dan,

I think I need a little more advice - what is the role of the script scripts/extract_dataset_part.py and the JSON files created when splitting FASTQ files in lib/galaxy/datatypes/sequence.py, and then used by the class' process_split_file method?

Why is there no JSON file created by the base data class in lib/galaxy/datatypes/data.py and no method process_split_file?

Is the JSON thing part of a partial and unfinished rewrite of the splitter code?

On the assumption that not all splitters bother with the JSON, I am trying a little hack to scripts/extract_dataset_part.py to abort silently if there is no JSON file: https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

This seems to be working with my current attempt at a FASTA splitter (not checked in yes, only partly implemented and tested).

I've checked in my FASTA splitting, which now seems to be working OK with my BLAST tests. So far this only does splitting into chunks of the requested number of sequences, rather than the option to split the whole file into a given number of pieces. https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

Cool! Seems like a perfectly fine start. I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).

...

I also need to look at merging multiple BLAST XML outputs, but this is looking promising.

Peter

Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…) chris

Peter Cock

9:02 p.m.

On Thu, Feb 16, 2012 at 6:42 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:

...

On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:

...
I've checked in my FASTA splitting, which now seems to be working OK with my BLAST tests.

(If this was unclear, I mean checked into my branch - I don't have commit privileges to the main repository. When/if this is ready I'll ask for it to be merged in though.)

...

...
So far this only does splitting into chunks of the requested number of sequences, rather than the option to split the whole file into a given number of pieces. https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

Cool! Seems like a perfectly fine start. I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).

Yes, I should be able to get that from Galaxy's metadata if known - much like how the FASTQ splitter works. It only needs to be an estimate anyway - which is what I think Galaxy does for large files - if we get it wrong then rather than using n sub-jobs as suggested, we might use n+1 or n-1.

...

...
I also need to look at merging multiple BLAST XML outputs, but this is looking promising.

Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…)

Well, given the NCBI's historic practise of producing 'XML' output which was the concatenation of several XML files, some tools will tolerate this out of practicality - the Biopython BLAST XML parser for example. But yes, some care is needed over the header/footer to ensure a valid XML output is created by the merge. This may also require renumbering queries... I will check. Peter

Peter Cock

17 Feb 17 Feb

12:29 p.m.

On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:

...

On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:

...
Cool! Seems like a perfectly fine start. I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).

Yes, I should be able to get that from Galaxy's metadata if known - much like how the FASTQ splitter works. It only needs to be an estimate anyway - which is what I think Galaxy does for large files - if we get it wrong then rather than using n sub-jobs as suggested, we might use n+1 or n-1.

Done, and it seems to be working nicely now. If we don't know the sequence count, I divide the file based on the total size in bytes - which avoids any extra IO. https://bitbucket.org/peterjc/galaxy-central/changeset/26a0c0aa776d Taking advantage of this I have switched the BLAST tools from saying split the query into batches of 500 sequences (which worked fine but only gave benefits if doing genome scale queries) to just split the query into four parts (which will be done based on the sequence count if known, or the file size if not). This way any multi-query BLAST will get divided and run in parallel, not just the larger jobs. This gives a nice improvement (over yesterday's progress) with small tasks like 10 query sequences against a big database like NR or NT. https://bitbucket.org/peterjc/galaxy-central/changeset/1fb89ae798be Peter

Peter Cock

22 Feb 22 Feb

6:59 p.m.

On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:

...

On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:

...
On Feb 16, 2012, at 12:24 PM, Peter wrote:

...
I also need to look at merging multiple BLAST XML outputs, but this is looking promising.

Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…)

Well, given the NCBI's historic practise of producing 'XML' output which was the concatenation of several XML files, some tools will tolerate this out of practicality - the Biopython BLAST XML parser for example.

But yes, some care is needed over the header/footer to ensure a valid XML output is created by the merge. This may also require renumbering queries... I will check.

Basic BLAST XML merging implemented and apparently working: https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26 This does not currently attempt to remap the iteration numbers or automatically assigned query names, e.g. you can have this kind of thing in the middle of the XML at a merge point: <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>Query_1</Iteration_query-ID> That isn't a problem for some tools, e.g. my code in Galaxy to convert BLAST XML to tabular, but I suspect it could cause trouble elsewhere. If anyone has specific suggestions for what to test, that would be great. If this is an issue, then the merge code needs a little more work to edit these values. I think the FASTA split code could be reviewed for inclusion though. Dan - do you want to look at that? Would a clean branch help? Peter

dannonbaker＠me.com

7:07 p.m.

Awesome, I'll take a look. And, if you're able to pull it together easily enough, clean branches are always nice. -Dannon On Feb 22, 2012, at 10:59 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote: Basic BLAST XML merging implemented and apparently working: https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26 This does not currently attempt to remap the iteration numbers or automatically assigned query names, e.g. you can have this kind of thing in the middle of the XML at a merge point: <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>Query_1</Iteration_query-ID> That isn't a problem for some tools, e.g. my code in Galaxy to convert BLAST XML to tabular, but I suspect it could cause trouble elsewhere. If anyone has specific suggestions for what to test, that would be great. If this is an issue, then the merge code needs a little more work to edit these values. I think the FASTA split code could be reviewed for inclusion though. Dan - do you want to look at that? Would a clean branch help? Peter

Peter Cock

23 Feb 23 Feb

1:48 p.m.

On Wed, Feb 22, 2012 at 7:07 PM, <dannonbaker@me.com> wrote:

...

Awesome, I'll take a look. And, if you're able to pull it together easily enough, clean branches are always nice.

-Dannon

It is all on one new branch, but this covers FASTA splitting (ready), splitting in the BLAST+ wrapper (ready bar merging datatypes), XML merging (may need more work). It has also occurred to me I may need to implement HTML merging (or even remove this as a BLAST output option - do people use it?). https://bitbucket.org/peterjc/galaxy-central/src/split_blast All the commits should be self contained allowing the FASTA splitting bits to be transplanted/cherry-picked. If you want I'll do that on a new branch focused on FASTA splitting only. But before I do, I'd appreciate any initial comments you might have from a first inspection. Thanks, Peter

Fields, Christopher J

16 Feb 16 Feb

1:53 p.m.

On Feb 16, 2012, at 4:47 AM, Peter Cock wrote:

...

On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker <dannonbaker@me.com> wrote:

...
Good luck, let me know how it goes, and again - contributions are certainly welcome :)

I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py for class Sequence assumes four lines per sequence. This would make sense as the split method of the Fastq class (after grooming to remove any line wrapping) but is a very bad idea on most sequence file formats (e.g. FASTA).

It looks like a little refactoring is needed, defining a Sequence split method which raises not implemented, and moving the current code to the Fastq class, then writing something similar but allowing multiple lines per record for the Fasta class.

Does that sound reasonable? I'll do this on a new branch for review...

Peter

Makes sense from my perspective; splits have to be defined based on data type. It could be as low-level as defining a simple iterator per record, then a wrapper that allows a specific chunk-size. The split file creation could almost be abstracted completely away into a common method. As Peter implies, maybe a simple API for defining a split method would be all that is needed. Might also be useful on any merge step, 'cat'-like merges won't work for every format but would be a suitable default. chris

Peter Cock

2:07 p.m.

On Thu, Feb 16, 2012 at 1:53 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:

...

Makes sense from my perspective; splits have to be defined based on data type. It could be as low-level as defining a simple iterator per record, then a wrapper that allows a specific chunk-size. The split file creation could almost be abstracted completely away into a common method.

I'm trying to understand exactly how the current code creates the splits, but yes - something like that is what I would expect.

...

As Peter implies, maybe a simple API for defining a split method would be all that is needed. Might also be useful on any merge step, 'cat'-like merges won't work for every format but would be a suitable default.

Yes, for a lot of file types concatenation is fine. Again, like the splitting, this has to be and is defined at the data type level (which is a heirachy of classes in Galaxy). Peter

Bram Slabbinck

20 Feb 20 Feb

8:08 a.m.

Hi Dannon, If I may further elaborate on this issue, I would like to mention that this kind of functionality is also supported by the Sun Grid Engine in the form of 'array jobs'. With this functionality you can execute a job multiple times in an independent way, only differing for instance in the parameter settings. From your description below, it seems similar to the Galaxy parallelism tag. Is there or do you foresee any implementation of this SGE functionality through the drmaa interface in Galaxy? If not, is there anybody who has achieved this through some custom coding? We would be highly interested in this. thanks Bram On 15/02/2012 18:08, Dannon Baker wrote:

...

It's definitely an experimental feature at this point, and there's no wiki, but basic support for breaking jobs into tasks does exist. It needs a lot more work and can go in a few different directions to make it better, but check out the wrappers with<parallelism> defined, and enable use_tasked_jobs in your universe_wsgi.ini and restart. That's all it should take from a fresh galaxy install to get, iirc, at least BWA and a few other tools working. If you want a super trivial example to play with, change the tool .xml for text tool like "change case" to have<parallelism method="basic"></parallelism> and give that a shot.

If you decide to try this out, do keep in mind that this feature is not at all complete and while there's a long list of things we still want to experiment with along these lines suggestions (and especially contributions) are absolutely welcome.

-Dannon

On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:

...
Hi all,

The comments on this issue suggest that the Galaxy team is/were working on splitting large jobs over multiple nodes/CPUs:

https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs

Is there any relevant page on the wiki I should be aware of?

Specifically I am hoping for a general framework where one of the tool inputs can be marked as "embarrassingly parallel" meaning it can be subdivided easily (e.g. multiple sequences in FASTA or FASTQ format, multiple annotations in BED format, multiple lines in tabular format) and the outputs can all be easily combined (e.g. by concatenation in the same order as the input was split).

Thanks,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

-- ========================================================== Bram Slabbinck, PhD

Bioinformatics& Systems Biology Division VIB Department of Plant Systems Biology, UGent Technologiepark 927, 9052 Gent, BELGIUM

Email: Bram.Slabbinck@psb.ugent.be WWW: http://bioinformatics.psb.ugent.be ========================================================== Please consider the environment before printing this email

Peter Cock

8:13 a.m.

On Mon, Feb 20, 2012 at 8:08 AM, Bram Slabbinck <brsla@psb.vib-ugent.be> wrote:

...

Hi Dannon,

If I may further elaborate on this issue, I would like to mention that this kind of functionality is also supported by the Sun Grid Engine in the form of 'array jobs'. With this functionality you can execute a job multiple times in an independent way, only differing for instance in the parameter settings. From your description below, it seems similar to the Galaxy parallelism tag. Is there or do you foresee any implementation of this SGE functionality through the drmaa interface in Galaxy? If not, is there anybody who has achieved this through some custom coding? We would be highly interested in this.

thanks Bram

I was wondering about why Galaxy submits N separate jobs to SGE after splitting (identical bar their working directory). I'm not sure if all the other cluster back ends supported can do this, but basic dependencies is possible using SGE. That means the cluster could take care of scheduling the split jobs, the N processing jobs, and the final merge job (i.e. three stages where for example it won't do the merge till all the N processing jobs are finished). My hunch is Galaxy is doing a lot of this 'housekeeping' internally in order to remain flexible regarding the cluster back end. Peter

Dannon Baker

1:12 p.m.

Peter has it right in that we need to do this internally to ensure functionality across a range of job runners. A side benefit is that it gives us direct access to the tasks so that we can eventually do interesting things with scheduling, resubmission, feedback, etc. If the overhead looks to be a performance issue I could see having an override that would allow pushing task scheduling to the underlying cluster, but that functionality would come later. -Dannon On Feb 20, 2012, at 3:13 AM, Peter Cock wrote:

...

On Mon, Feb 20, 2012 at 8:08 AM, Bram Slabbinck <brsla@psb.vib-ugent.be> wrote:

...
Hi Dannon,

If I may further elaborate on this issue, I would like to mention that this kind of functionality is also supported by the Sun Grid Engine in the form of 'array jobs'. With this functionality you can execute a job multiple times in an independent way, only differing for instance in the parameter settings. From your description below, it seems similar to the Galaxy parallelism tag. Is there or do you foresee any implementation of this SGE functionality through the drmaa interface in Galaxy? If not, is there anybody who has achieved this through some custom coding? We would be highly interested in this.

thanks Bram

I was wondering about why Galaxy submits N separate jobs to SGE after splitting (identical bar their working directory). I'm not sure if all the other cluster back ends supported can do this, but basic dependencies is possible using SGE. That means the cluster could take care of scheduling the split jobs, the N processing jobs, and the final merge job (i.e. three stages where for example it won't do the merge till all the N processing jobs are finished).

My hunch is Galaxy is doing a lot of this 'housekeeping' internally in order to remain flexible regarding the cluster back end.

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

5034

Age (days ago)

5042

Last active (days ago)

List overview

Download

22 comments

5 participants

participants (5)

Bram Slabbinck
Dannon Baker
dannonbaker＠me.com
Fields, Christopher J
Peter Cock

Splitting large jobs over multiple nodes/CPUs?

tags

participants (5)