Parallelism tag and job splitter
Hi Galaxy-Team, After reading a message on this mailing list about the job splitter I began to investigate what and how this is used. Unfortunately I have been unable to find any documentation on your website for it. Am I blind and missing it or is it yet to be properly documented? Sorry if this turns to be out a pointless exercise, but it would be extremely useful for my Galaxy development. Cheers, Ed
On Tue, Oct 30, 2012 at 11:20 PM, Edward Hills <ehills666@gmail.com> wrote:
Hi Galaxy-Team,
After reading a message on this mailing list about the job splitter I began to investigate what and how this is used. Unfortunately I have been unable to find any documentation on your website for it.
Am I blind and missing it or is it yet to be properly documented?
Sorry if this turns to be out a pointless exercise, but it would be extremely useful for my Galaxy development.
Cheers, Ed
Hi Ed, To enable this you need to add a <parallelism> tag to the tool's XML file, and enable the feature in universe_wsgi.ini with something like this: use_tasked_jobs = True local_task_queue_workers = 4 I'm not aware of any documentation, I've been mostly working from the Python source code in order to get it to work on the BLAST+ wrappers and some of my other tool wrappers. In all the cases I've used there is a single FASTA file being split, sometimes some common input files which are unchanged, and a single output file being merged. Peter
Thanks Peter. My next question is, I have found that VCF files don't get split properly as the header is not included in the second file as is usually required by tools (such as vcf-subset). I have read the code and am happy to implement this functionality but am not to sure where this would best be done. I see a class Text ( data ) which looks like every datatype is sent to. Would it be best to implement a VCF class which is called when the datatype is VCF? Cheers, Ed On Wed, Oct 31, 2012 at 12:35 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Tue, Oct 30, 2012 at 11:20 PM, Edward Hills <ehills666@gmail.com> wrote:
Hi Galaxy-Team,
After reading a message on this mailing list about the job splitter I began to investigate what and how this is used. Unfortunately I have been unable to find any documentation on your website for it.
Am I blind and missing it or is it yet to be properly documented?
Sorry if this turns to be out a pointless exercise, but it would be extremely useful for my Galaxy development.
Cheers, Ed
Hi Ed,
To enable this you need to add a <parallelism> tag to the tool's XML file, and enable the feature in universe_wsgi.ini with something like this:
use_tasked_jobs = True local_task_queue_workers = 4
I'm not aware of any documentation, I've been mostly working from the Python source code in order to get it to work on the BLAST+ wrappers and some of my other tool wrappers. In all the cases I've used there is a single FASTA file being split, sometimes some common input files which are unchanged, and a single output file being merged.
Peter
On Wednesday, October 31, 2012, Edward Hills wrote:
Thanks Peter.
My next question is, I have found that VCF files don't get split properly as the header is not included in the second file as is usually required by tools (such as vcf-subset). I have read the code and am happy to implement this functionality but am not to sure where this would best be done.
I see a class Text ( data ) which looks like every datatype is sent to. Would it be best to implement a VCF class which is called when the datatype is VCF?
Cheers, Ed
VCF is I assume defined as a subclass of Text, so inherits the naive simple splitting implemented for text files (which doesn't know about headers). Have a look at the SAM splitting code (under lib/galaxy/datatypes/*.py) as an example where header aware splitting was done. You'll probably need to implement something similar. Peter
Hi Peter, thanks again. Turns out that it has been implemented by the looks of it in lib/galaxy/datatypes/tabular.py under class Vcf. However, despite this, it is always the Text class in data.py that is loaded and not the proper Vcf one. Can you point me in the direction of where the type is chosen? Cheers, Ed On Wed, Oct 31, 2012 at 9:46 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Wednesday, October 31, 2012, Edward Hills wrote:
Thanks Peter.
My next question is, I have found that VCF files don't get split properly as the header is not included in the second file as is usually required by tools (such as vcf-subset). I have read the code and am happy to implement this functionality but am not to sure where this would best be done.
I see a class Text ( data ) which looks like every datatype is sent to. Would it be best to implement a VCF class which is called when the datatype is VCF?
Cheers, Ed
VCF is I assume defined as a subclass of Text, so inherits the naive simple splitting implemented for text files (which doesn't know about headers).
Have a look at the SAM splitting code (under lib/galaxy/datatypes/*.py) as an example where header aware splitting was done. You'll probably need to implement something similar.
Peter
On Thu, Nov 1, 2012 at 1:48 AM, Edward Hills <ehills666@gmail.com> wrote:
Hi Peter, thanks again.
Turns out that it has been implemented by the looks of it in lib/galaxy/datatypes/tabular.py under class Vcf.
Yes, looking at the Vcf class it lacks a merge method (the Sam class earlier in the file defines its own - do something similar).
However, despite this, it is always the Text class in data.py that is loaded and not the proper Vcf one.
Python inheritance means if the Vcf class it lacks a merge method, it would call the parent class' method (the Tabular class, if it had one), or the grandparent class's method (the Text class). So it is falling back on the Text merge which doesn't know about headers. (As an aside, I would like the Tabular merge to be a bit more clever about #header lines, but this isn't trivial as some tabular files contain #comment lines too.)
Can you point me in the direction of where the type is chosen?
Your tool's XML should specify the output format - although there could be complications if for example you are doing a dynamic format selection based on one of the parameters. Peter
participants (2)
-
Edward Hills
-
Peter Cock