Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Hi all,
Just my 2 cents.
This is a really great idea to have dynamically (down-)loadable datatypes, and a tool config tag to express a datatype dependency is right on the money. I agree with Greg in having hesitations about adding that feature though. The purpose (at least as far I see it) of the tool shed is to allow the community to share its productivity. New tools written by one group can be used by another group that may not have adequate skill, resources, or time to create the same tool on their own. One issue this model can suffer from, however, is over-proliferation of contributions. In this case, new tools with the same, overlapping, or very similar functions might be developed independently by multiple groups who then want to contribute to the tool shed. I don't know how often this situation arises or what official contingencies are in place to manage them, but it is important to manage that situation carefully. If it occurs with any appreciable frequency, then eventually there are many clusters of tools available that do almost the same thing but not quite. This is bad for the user, bad for the maintainer, complicates communication between researchers, etc. This model can work nicely if the frequency of very simliar tool submissions is small, and even better if there is some management for cleaning out broken or redundant tools.
When you allow custom datatypes to enter the picture, however, the story can become hairy much more quickly. Having a limited set of officially supplied / supported datatypes forces the contributors of new tools to use datatypes drawn from a standard set. Without that constraint, the number of datatype variants could explode. Now the concern is not only that multiple contributors may submit very similar tool variants, or that each of them might choose to create their own datatypes to optimize their methods, but also that contributors of tools which are functionally dissimilar but manipulate the same general types of data will write their tools using new datatypes that are variants of each other. Tools are essentially typed by the datatypes they accept and produce, so you won't be able to chain these tools together very easliy at all. Most pairs of tools will have the "wrong" datatype, on input or output, for what a user wants to do. The general trend is then proliferation of clusters of redundant tools, clusters of redundant datatypes, and growing sparsity in the "tool graph" (think of datatypes as vertices and tools as directed [hyper]edges).
So, a move in the direction of supporting something like a "TypeShed" would require careful consideration consist of at least either a well defined policy for managing *Shed rot and capability to execute it or a very slick tool / datatype versioning system with flexible control for users and some also very slick method for maintaining implicit conversions between the datatypes in a datatype cluster (ideally automatically generated). I think at least the implicit conversion part can be done, even if not in a fully automated manner, then by a combination of policy and engineering. For policy, you can define, identify, or construct a canonical datatype in each cluster and require that a contributor of a variant datatype submit methods for implicit conversion to/from the canonical datatype in that cluster. One idea that could help reduce complexity is to potentially place some additional structure on datatypes and take the canonical datatype for a cluster to be a form of the union (mathematical, not the "union" from C) of the variants in the cluster, which would simplify implicit conversations somewhat. Or, if there's some reason for this, there can also be a set of "canonical" datatypes for each cluster, so long as they are all guaranteed to be mutually implicitly convertible. For a policy to manage *Shed rot, the most direct approach is to moderate and require approval for each submission, but I could imagine that responsibility quickly overwhelming the poor team responsible. Unless I drastically overestimate the frequency with which submissions might be made (which is entirely possible), that poor team's operations could wind up looking not unlike the USPTO.
Anyway, my general point is that there are many non-trivial factors to consider in the question of creating a TypeShed. But, if done right, the benefits could be huge, besides the likely awesomeness of the engineering involved.
Finally, let me echo Greg again, and say to please send additional thought and feedback. What do you think about the points I raised? What else is there to consider that hasn't occurred to me yet? What would be the benefits and potential pitfalls?
Best, Eric
________________________________________ From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] on behalf of Jim Johnson [jj@umn.edu] Sent: Friday, October 07, 2011 2:06 PM To: galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
I agree with the risks you cited.
There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework.
Like all interesting problems, I don't think there is an "obviously right" answer ;-}
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Paniagua, Eric Sent: Friday, October 07, 2011 5:53 PM To: jj@umn.edu; galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Hi all,
Just my 2 cents.
This is a really great idea to have dynamically (down-)loadable datatypes, and a tool config tag to express a datatype dependency is right on the money. I agree with Greg in having hesitations about adding that feature though. The purpose (at least as far I see it) of the tool shed is to allow the community to share its productivity. New tools written by one group can be used by another group that may not have adequate skill, resources, or time to create the same tool on their own. One issue this model can suffer from, however, is over-proliferation of contributions. In this case, new tools with the same, overlapping, or very similar functions might be developed independently by multiple groups who then want to contribute to the tool shed. I don't know how often this situation arises or what official contingencies are in place to manage them, but it is important to manage that situation carefully. If it occurs with any appreciable frequency, then eventually there are many clusters of tools available that do almost the same thing but not quite. This is bad for the user, bad for the maintainer, complicates communication between researchers, etc. This model can work nicely if the frequency of very simliar tool submissions is small, and even better if there is some management for cleaning out broken or redundant tools.
When you allow custom datatypes to enter the picture, however, the story can become hairy much more quickly. Having a limited set of officially supplied / supported datatypes forces the contributors of new tools to use datatypes drawn from a standard set. Without that constraint, the number of datatype variants could explode. Now the concern is not only that multiple contributors may submit very similar tool variants, or that each of them might choose to create their own datatypes to optimize their methods, but also that contributors of tools which are functionally dissimilar but manipulate the same general types of data will write their tools using new datatypes that are variants of each other. Tools are essentially typed by the datatypes they accept and produce, so you won't be able to chain these tools together very easliy at all. Most pairs of tools will have the "wrong" datatype, on input or output, for what a user wants to do. The general trend is then proliferation of clusters of redundant tools, clusters of redundant datatypes, and growing sparsity in the "tool graph" (think of datatypes as vertices and tools as directed [hyper]edges).
So, a move in the direction of supporting something like a "TypeShed" would require careful consideration consist of at least either a well defined policy for managing *Shed rot and capability to execute it or a very slick tool / datatype versioning system with flexible control for users and some also very slick method for maintaining implicit conversions between the datatypes in a datatype cluster (ideally automatically generated). I think at least the implicit conversion part can be done, even if not in a fully automated manner, then by a combination of policy and engineering. For policy, you can define, identify, or construct a canonical datatype in each cluster and require that a contributor of a variant datatype submit methods for implicit conversion to/from the canonical datatype in that cluster. One idea that could help reduce complexity is to potentially place some additional structure on datatypes and take the canonical datatype for a cluster to be a form of the union (mathematical, not the "union" from C) of the variants in the cluster, which would simplify implicit conversations somewhat. Or, if there's some reason for this, there can also be a set of "canonical" datatypes for each cluster, so long as they are all guaranteed to be mutually implicitly convertible. For a policy to manage *Shed rot, the most direct approach is to moderate and require approval for each submission, but I could imagine that responsibility quickly overwhelming the poor team responsible. Unless I drastically overestimate the frequency with which submissions might be made (which is entirely possible), that poor team's operations could wind up looking not unlike the USPTO.
Anyway, my general point is that there are many non-trivial factors to consider in the question of creating a TypeShed. But, if done right, the benefits could be huge, besides the likely awesomeness of the engineering involved.
Finally, let me echo Greg again, and say to please send additional thought and feedback. What do you think about the points I raised? What else is there to consider that hasn't occurred to me yet? What would be the benefits and potential pitfalls?
Best, Eric
________________________________________ From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] on behalf of Jim Johnson [jj@umn.edu] Sent: Friday, October 07, 2011 2:06 PM To: galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we're facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example "CompressedFastQ", with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds - efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats - we'd have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Mon, Oct 10, 2011 at 5:09 PM, Duddy, John jduddy@illumina.com wrote:
I agree with the risks you cited.
There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework.
Like all interesting problems, I don't think there is an "obviously right" answer ;-}
John Duddy
Indeed. I'm going with lobbying the Galaxy to include new datatypes when I need them (InterProScan XML in on my todo list, perhaps v4 and v5 as two types), but I've been able to get a long with with "tabular" as a tool output.
Peter
There are a number of well defined formats that are exchanged between applications, e.g. BAM, gtf, etc, I wouldn't advocate proliferating those.
I see the need for Toolshed datatypes more for the intermediate file formats used within a suite of commands. These can be helpful in guiding a user to select appropriate inputs for successive steps in an analysis.
For example, when developing the 90 some tool wrappers for the mothur metagenomic package, there are many file formats that get passed among the mothur commands. It greatly simplifies the user's experience if the outputs are typed so as to correctly filter the acceptable inputs to another command. I fear the amount of time I would spend providing user support if the outputs and inputs were generically typed.
I'm also seeing a similar need as I am creating creating tool wrappers for the GMAP/GSNAP mapping commands. While input to GSNAP and GMAP can be fastq and output in SAM format, some of the more interesting use cases involve creating additional map stores, where specific datatypes would guide the user in setting the tool parameters correctly.
JJ
James E Johnson Minnesota Supercomputing Institute, University of Minnesota
On 10/10/11 11:09 AM, Duddy, John wrote:
I agree with the risks you cited.
There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework.
Like all interesting problems, I don't think there is an "obviously right" answer ;-}
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Paniagua, Eric Sent: Friday, October 07, 2011 5:53 PM To: jj@umn.edu; galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Hi all,
Just my 2 cents.
This is a really great idea to have dynamically (down-)loadable datatypes, and a tool config tag to express a datatype dependency is right on the money. I agree with Greg in having hesitations about adding that feature though. The purpose (at least as far I see it) of the tool shed is to allow the community to share its productivity. New tools written by one group can be used by another group that may not have adequate skill, resources, or time to create the same tool on their own. One issue this model can suffer from, however, is over-proliferation of contributions. In this case, new tools with the same, overlapping, or very similar functions might be developed independently by multiple groups who then want to contribute to the tool shed. I don't know how often this situation arises or what official contingencies are in place to manage them, but it is important to manage that situation carefully. If it occurs with any appreciable frequency, then eventually there are many clusters of tools available that do almost the same thing but not quite. This is bad for the user, bad for the maintainer, complicates communication between researchers, etc. This model can work nicely if the frequency of very simliar tool submissions is small, and even better if there is some management for cleaning out broken or redundant tools.
When you allow custom datatypes to enter the picture, however, the story can become hairy much more quickly. Having a limited set of officially supplied / supported datatypes forces the contributors of new tools to use datatypes drawn from a standard set. Without that constraint, the number of datatype variants could explode. Now the concern is not only that multiple contributors may submit very similar tool variants, or that each of them might choose to create their own datatypes to optimize their methods, but also that contributors of tools which are functionally dissimilar but manipulate the same general types of data will write their tools using new datatypes that are variants of each other. Tools are essentially typed by the datatypes they accept and produce, so you won't be able to chain these tools together very easliy at all. Most pairs of tools will have the "wrong" datatype, on input or output, for what a user wants to do. The general trend is then proliferation of clusters of redundant tools, clusters of redundant datatypes, and growing sparsity in the "tool graph" (think of datatypes as vertices and tools as directed [hyper]edges).
So, a move in the direction of supporting something like a "TypeShed" would require careful consideration consist of at least either a well defined policy for managing *Shed rot and capability to execute it or a very slick tool / datatype versioning system with flexible control for users and some also very slick method for maintaining implicit conversions between the datatypes in a datatype cluster (ideally automatically generated). I think at least the implicit conversion part can be done, even if not in a fully automated manner, then by a combination of policy and engineering. For policy, you can define, identify, or construct a canonical datatype in each cluster and require that a contributor of a variant datatype submit methods for implicit conversion to/from the canonical datatype in that cluster. One idea that could help reduce complexity is to potentially place some additional structure on datatypes and take the canonical datatype for a cluster to be a form of the union (mathematical, not the "union" from C) of the variants in the cluster, which would simplify implicit conversations somewhat. Or, if there's some reason for this, there can also be a set of "canonical" datatypes for each cluster, so long as they are all guaranteed to be mutually implicitly convertible. For a policy to manage *Shed rot, the most direct approach is to moderate and require approval for each submission, but I could imagine that responsibility quickly overwhelming the poor team responsible. Unless I drastically overestimate the frequency with which submissions might be made (which is entirely possible), that poor team's operations could wind up looking not unlike the USPTO.
Anyway, my general point is that there are many non-trivial factors to consider in the question of creating a TypeShed. But, if done right, the benefits could be huge, besides the likely awesomeness of the engineering involved.
Finally, let me echo Greg again, and say to please send additional thought and feedback. What do you think about the points I raised? What else is there to consider that hasn't occurred to me yet? What would be the benefits and potential pitfalls?
Best, Eric
From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] on behalf of Jim Johnson [jj@umn.edu] Sent: Friday, October 07, 2011 2:06 PM To: galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """
...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we're facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example "CompressedFastQ", with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds - efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats - we'd have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Peter has the right idea here - we will add support for appropriate data types to the Galaxy distribution. Of course, the key word here is "appropriate", but any industry-standard data format should fall under this category.
On Oct 10, 2011, at 12:46 PM, Peter Cock wrote:
On Mon, Oct 10, 2011 at 5:09 PM, Duddy, John jduddy@illumina.com wrote:
I agree with the risks you cited.
There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework.
Like all interesting problems, I don't think there is an "obviously right" answer ;-}
John Duddy
Indeed. I'm going with lobbying the Galaxy to include new datatypes when I need them (InterProScan XML in on my todo list, perhaps v4 and v5 as two types), but I've been able to get a long with with "tabular" as a tool output.
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Hello Jim,
On Oct 10, 2011, at 1:01 PM, Jim Johnson wrote:
There are a number of well defined formats that are exchanged between applications, e.g. BAM, gtf, etc, I wouldn't advocate proliferating those.
I see the need for Toolshed datatypes more for the intermediate file formats used within a suite of commands. These can be helpful in guiding a user to select appropriate inputs for successive steps in an analysis.
For example, when developing the 90 some tool wrappers for the mothur metagenomic package, there are many file formats that get passed among the mothur commands. It greatly simplifies the user's experience if the outputs are typed so as to correctly filter the acceptable inputs to another command. I fear the amount of time I would spend providing user support if the outputs and inputs were generically typed.
An approach for simplifying this is to include one or more exported Galaxy workflows in the tool shed repository along with the tools. The workflows cannot currently be automatically imported into Galaxy, but they can be manually imported, providing the user an idea of the steps in the analyses for which the tools are intended. Additional features related to Galaxy workflows included in Galaxy tool shed repositories will be available in future Galaxy releases.
I'm also seeing a similar need as I am creating creating tool wrappers for the GMAP/GSNAP mapping commands. While input to GSNAP and GMAP can be fastq and output in SAM format, some of the more interesting use cases involve creating additional map stores, where specific datatypes would guide the user in setting the tool parameters correctly.
JJ
James E Johnson Minnesota Supercomputing Institute, University of Minnesota
On 10/10/11 11:09 AM, Duddy, John wrote:
I agree with the risks you cited.
There is a risk in the other direction that I think is even scarier - without the ability to add data types, tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as "data" or "text". While this works, it has the same drawbacks as typeless programming languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability to perform transparent type conversions - in other words, the tools have to take over responsibilities from the framework.
Like all interesting problems, I don't think there is an "obviously right" answer ;-}
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Paniagua, Eric Sent: Friday, October 07, 2011 5:53 PM To: jj@umn.edu; galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Hi all,
Just my 2 cents.
This is a really great idea to have dynamically (down-)loadable datatypes, and a tool config tag to express a datatype dependency is right on the money. I agree with Greg in having hesitations about adding that feature though. The purpose (at least as far I see it) of the tool shed is to allow the community to share its productivity. New tools written by one group can be used by another group that may not have adequate skill, resources, or time to create the same tool on their own. One issue this model can suffer from, however, is over-proliferation of contributions. In this case, new tools with the same, overlapping, or very similar functions might be developed independently by multiple groups who then want to contribute to the tool shed. I don't know how often this situation arises or what official contingencies are in place to manage them, but it is important to manage that situation carefully. If it occurs with any appreciable frequency, then eventually there ar!
e many clusters of tools available that do almost the same thing but not quite. This is bad for the user, bad for the maintainer, complicates communication between researchers, etc. This model can work nicely if the frequency of very simliar tool submissions is small, and even better if there is some management for cleaning out broken or redundant tools.
When you allow custom datatypes to enter the picture, however, the story can become hairy much more quickly. Having a limited set of officially supplied / supported datatypes forces the contributors of new tools to use datatypes drawn from a standard set. Without that constraint, the number of datatype variants could explode. Now the concern is not only that multiple contributors may submit very similar tool variants, or that each of them might choose to create their own datatypes to optimize their methods, but also that contributors of tools which are functionally dissimilar but manipulate the same general types of data will write their tools using new datatypes that are variants of each other. Tools are essentially typed by the datatypes they accept and produce, so you won't be able to chain these tools together very easliy at all. Most pairs of tools will have the "wrong" datatype, on input or output, for what a user wants to do. The general trend is then prolifer!
ation of clusters of redundant tools, clusters of redundant datatypes, and growing sparsity in the "tool graph" (think of datatypes as vertices and tools as directed [hyper]edges).
So, a move in the direction of supporting something like a "TypeShed" would require careful consideration consist of at least either a well defined policy for managing *Shed rot and capability to execute it or a very slick tool / datatype versioning system with flexible control for users and some also very slick method for maintaining implicit conversions between the datatypes in a datatype cluster (ideally automatically generated). I think at least the implicit conversion part can be done, even if not in a fully automated manner, then by a combination of policy and engineering. For policy, you can define, identify, or construct a canonical datatype in each cluster and require that a contributor of a variant datatype submit methods for implicit conversion to/from the canonical datatype in that cluster. One idea that could help reduce complexity is to potentially place some additional structure on datatypes and take the canonical datatype for a cluster to be a form of th!
e union (mathematical, not the "union" from C) of the variants in the cluster, which would simplify implicit conversations somewhat. Or, if there's some reason for this, there can also be a set of "canonical" datatypes for each cluster, so long as they are all guaranteed to be mutually implicitly convertible. For a policy to manage *Shed rot, the most direct approach is to moderate and require approval for each submission, but I could imagine that responsibility quickly overwhelming the poor team responsible. Unless I drastically overestimate the frequency with which submissions might be made (which is entirely possible), that poor team's operations could wind up looking not unlike the USPTO.
Anyway, my general point is that there are many non-trivial factors to consider in the question of creating a TypeShed. But, if done right, the benefits could be huge, besides the likely awesomeness of the engineering involved.
Finally, let me echo Greg again, and say to please send additional thought and feedback. What do you think about the points I raised? What else is there to consider that hasn't occurred to me yet? What would be the benefits and potential pitfalls?
Best, Eric
From: galaxy-dev-bounces@lists.bx.psu.edu [galaxy-dev-bounces@lists.bx.psu.edu] on behalf of Jim Johnson [jj@umn.edu] Sent: Friday, October 07, 2011 2:06 PM To: galaxy-dev@lists.bx.psu.edu Cc: Greg Von Kuster Subject: Re: [galaxy-dev] Tool shed and datatypes
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """
...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we're facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example "CompressedFastQ", with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds - efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats - we'd have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
We've digested this topic a bit here at Galaxy Central, and agree that at some point ( maybe soon for very basic functionality ) we need to provide support for new data types in tool shed repositories. It would be very helpful ( and significantly speed up the development process ) if the community could provide at least 2 different tools that use data types not included in the Galaxy distribution ( sending me a tarball that includes all the tool dependencies, including the new data type class would be ideal ). When I get them I'll add this new feature set to my development list.
Thanks everyone for all the input on this!
Greg Von Kuster
On Oct 7, 2011, at 2:05 PM, Jim Johnson wrote:
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Greg,
The mothur_toolsuite in the ToolShed contains a file with added datatypes for metagenomics (used by mothur and some by qiime): mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py The README has info on how I incorporated mothur into our local galaxy server.
I'm also working on GMAP/GSNAP ( http://research-pub.gene.com/gmap/ ) So far I've created a GmapDB class, analogous to the ngsindex.BowtieIndex class, but with more metadata. I'm also adding a IntervalIndexTree class for indexing maps of splice junctions, introns, and SNPs. I'll send you this as soon as I've got it working.
Thanks,
JJ
On 10/17/11 1:06 PM, Greg Von Kuster wrote:
We've digested this topic a bit here at Galaxy Central, and agree that at some point ( maybe soon for very basic functionality ) we need to provide support for new data types in tool shed repositories. It would be very helpful ( and significantly speed up the development process ) if the community could provide at least 2 different tools that use data types not included in the Galaxy distribution ( sending me a tarball that includes all the tool dependencies, including the new data type class would be ideal ). When I get them I'll add this new feature set to my development list.
Thanks everyone for all the input on this!
Greg Von Kuster
On Oct 7, 2011, at 2:05 PM, Jim Johnson wrote:
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """
...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Jim,
Sounds great - this will be very helpful!
Greg
On Oct 18, 2011, at 11:03 AM, Jim Johnson wrote:
Greg,
The mothur_toolsuite in the ToolShed contains a file with added datatypes for metagenomics (used by mothur and some by qiime): mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py The README has info on how I incorporated mothur into our local galaxy server.
I'm also working on GMAP/GSNAP ( http://research-pub.gene.com/gmap/ ) So far I've created a GmapDB class, analogous to the ngsindex.BowtieIndex class, but with more metadata. I'm also adding a IntervalIndexTree class for indexing maps of splice junctions, introns, and SNPs. I'll send you this as soon as I've got it working.
Thanks,
JJ
On 10/17/11 1:06 PM, Greg Von Kuster wrote:
We've digested this topic a bit here at Galaxy Central, and agree that at some point ( maybe soon for very basic functionality ) we need to provide support for new data types in tool shed repositories. It would be very helpful ( and significantly speed up the development process ) if the community could provide at least 2 different tools that use data types not included in the Galaxy distribution ( sending me a tarball that includes all the tool dependencies, including the new data type class would be ideal ). When I get them I'll add this new feature set to my development list.
Thanks everyone for all the input on this!
Greg Von Kuster
On Oct 7, 2011, at 2:05 PM, Jim Johnson wrote:
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Greg,
I put the gmap tool suite in the galaxy Tool Shed, let me know if there is more I should do.
It has 5 galaxy tools:
GMAP - Genomic Mapping and Alignment Program for mRNA and EST sequences GSNAP - Genomic Short-read Nucleotide Alignment Program GMAP Build - a database genome index for GMAP and GSNAP ( calls: gmap_build, iit_store, snpindex, cmetindex, atoiindex ) GMAP SNP Index - build index files for known SNPs (calls: iit_store, snpindex) GMAP IIT - Create a map store for known genes or SNPs (calls: iit_store)
It uses these added datatypes:
% grep -E '(^class | file_ext)' lib/galaxy/datatypes/gmap.py class GmapDB( Text ): file_ext = 'gmapdb' class GmapSnpIndex( Text ): file_ext = 'gmapsnpindex' class IntervalIndexTree( Text ): file_ext = 'iit' class SpliceSitesIntervalIndexTree( IntervalIndexTree ): file_ext = 'splicesites.iit' class IntronsIntervalIndexTree( IntervalIndexTree ): file_ext = 'introns.iit' class SNPsIntervalIndexTree( IntervalIndexTree ): file_ext = 'snps.iit' class IntervalAnnotation( Text ): file_ext = 'gmap_annotation' class SpliceSiteAnnotation(IntervalAnnotation): file_ext = 'gmap_splicesites' class IntronAnnotation(IntervalAnnotation): file_ext = 'gmap_introns' class SNPAnnotation(IntervalAnnotation): file_ext = 'gmap_snps'
I added a requirement tag for the datatypes to the tool-configs:
% grep 'requirement.*datatype' *.xml gmap_build.xml: <requirement type="datatype">gmapdb</requirement> gmap_build.xml: <requirement type="datatype">gmap_snps</requirement> gmap.xml: <requirement type="datatype">gmapdb</requirement> gmap.xml: <requirement type="datatype">gmap_annotation</requirement> gmap.xml: <requirement type="datatype">gmap_splicesites</requirement> gmap.xml: <requirement type="datatype">gmap_introns</requirement> gmap.xml: <requirement type="datatype">gmap_snps</requirement> gsnap.xml: <requirement type="datatype">gmapdb</requirement> gsnap.xml: <requirement type="datatype">gmapsnpindex</requirement> gsnap.xml: <requirement type="datatype">splicesites.iit</requirement> gsnap.xml: <requirement type="datatype">introns.iit</requirement> iit_store.xml: <requirement type="datatype">gmap_annotation</requirement> iit_store.xml: <requirement type="datatype">gmap_snps</requirement> iit_store.xml: <requirement type="datatype">iit</requirement> iit_store.xml: <requirement type="datatype">splicesites.iit</requirement> iit_store.xml: <requirement type="datatype">introns.iit</requirement> iit_store.xml: <requirement type="datatype">snps.iit</requirement> snpindex.xml: <requirement type="datatype">gmapsnpindex</requirement> snpindex.xml: <requirement type="datatype">gmapdb</requirement> snpindex.xml: <requirement type="datatype">gmap_snps</requirement> snpindex.xml: <requirement type="datatype">snps.iit</requirement>
Thanks,
JJ
On 10/18/11 10:18 AM, Greg Von Kuster wrote:
Jim,
Sounds great - this will be very helpful!
Greg
On Oct 18, 2011, at 11:03 AM, Jim Johnson wrote:
Greg,
The mothur_toolsuite in the ToolShed contains a file with added datatypes for metagenomics (used by mothur and some by qiime): mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py The README has info on how I incorporated mothur into our local galaxy server.
I'm also working on GMAP/GSNAP ( http://research-pub.gene.com/gmap/ ) So far I've created a GmapDB class, analogous to the ngsindex.BowtieIndex class, but with more metadata. I'm also adding a IntervalIndexTree class for indexing maps of splice junctions, introns, and SNPs. I'll send you this as soon as I've got it working.
Thanks,
JJ
On 10/17/11 1:06 PM, Greg Von Kuster wrote:
We've digested this topic a bit here at Galaxy Central, and agree that at some point ( maybe soon for very basic functionality ) we need to provide support for new data types in tool shed repositories. It would be very helpful ( and significantly speed up the development process ) if the community could provide at least 2 different tools that use data types not included in the Galaxy distribution ( sending me a tarball that includes all the tool dependencies, including the new data type class would be ideal ). When I get them I'll add this new feature set to my development list.
Thanks everyone for all the input on this!
Greg Von Kuster
On Oct 7, 2011, at 2:05 PM, Jim Johnson wrote:
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """
...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools.
Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
From: Greg Von Kuster [mailto:greg at bx.psu.edu] Sent: Wednesday, October 05, 2011 6:25 PM To: Duddy, John Cc: galaxy-dev at lists.bx.psu.edu Subject: Re: [galaxy-dev] Tool shed and datatypes
Hello John,
The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this?
Thanks!
On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy at illumina.com
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg at bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Excerpts from Jim Johnson's message of 2011-10-21 17:13:02 +0000:
I put the gmap tool suite in the galaxy Tool Shed, let me know if there is more I should do.
Awesome!
I added a requirement tag for the datatypes to the tool-configs:
% grep 'requirement.*datatype' *.xml gmap_build.xml: <requirement type="datatype">gmapdb</requirement>
Requirement tags for datatypes are an interesting idea, but I'm wondering if this is something we should require? It seems like all this information is implicit -- a tool requires a datatype if it has an input or output parameter that references that type. Is there other information that should go in the requirement tag?
Thanks Jim,
This is on my development plan, but it may take a few days for me to get heavily into it. I'll get back to you hopefully some time next week.
Greg
On Oct 21, 2011, at 1:13 PM, Jim Johnson wrote:
Greg,
I put the gmap tool suite in the galaxy Tool Shed, let me know if there is more I should do.
It has 5 galaxy tools: GMAP - Genomic Mapping and Alignment Program for mRNA and EST sequences GSNAP - Genomic Short-read Nucleotide Alignment Program GMAP Build - a database genome index for GMAP and GSNAP ( calls: gmap_build, iit_store, snpindex, cmetindex, atoiindex ) GMAP SNP Index - build index files for known SNPs (calls: iit_store, snpindex) GMAP IIT - Create a map store for known genes or SNPs (calls: iit_store)
It uses these added datatypes: % grep -E '(^class | file_ext)' lib/galaxy/datatypes/gmap.py class GmapDB( Text ): file_ext = 'gmapdb' class GmapSnpIndex( Text ): file_ext = 'gmapsnpindex' class IntervalIndexTree( Text ): file_ext = 'iit' class SpliceSitesIntervalIndexTree( IntervalIndexTree ): file_ext = 'splicesites.iit' class IntronsIntervalIndexTree( IntervalIndexTree ): file_ext = 'introns.iit' class SNPsIntervalIndexTree( IntervalIndexTree ): file_ext = 'snps.iit' class IntervalAnnotation( Text ): file_ext = 'gmap_annotation' class SpliceSiteAnnotation(IntervalAnnotation): file_ext = 'gmap_splicesites' class IntronAnnotation(IntervalAnnotation): file_ext = 'gmap_introns' class SNPAnnotation(IntervalAnnotation): file_ext = 'gmap_snps'
I added a requirement tag for the datatypes to the tool-configs: % grep 'requirement.*datatype' *.xml gmap_build.xml: <requirement type="datatype">gmapdb</requirement> gmap_build.xml: <requirement type="datatype">gmap_snps</requirement> gmap.xml: <requirement type="datatype">gmapdb</requirement> gmap.xml: <requirement type="datatype">gmap_annotation</requirement> gmap.xml: <requirement type="datatype">gmap_splicesites</requirement> gmap.xml: <requirement type="datatype">gmap_introns</requirement> gmap.xml: <requirement type="datatype">gmap_snps</requirement> gsnap.xml: <requirement type="datatype">gmapdb</requirement> gsnap.xml: <requirement type="datatype">gmapsnpindex</requirement> gsnap.xml: <requirement type="datatype">splicesites.iit</requirement> gsnap.xml: <requirement type="datatype">introns.iit</requirement> iit_store.xml: <requirement type="datatype">gmap_annotation</requirement> iit_store.xml: <requirement type="datatype">gmap_snps</requirement> iit_store.xml: <requirement type="datatype">iit</requirement> iit_store.xml: <requirement type="datatype">splicesites.iit</requirement> iit_store.xml: <requirement type="datatype">introns.iit</requirement> iit_store.xml: <requirement type="datatype">snps.iit</requirement> snpindex.xml: <requirement type="datatype">gmapsnpindex</requirement> snpindex.xml: <requirement type="datatype">gmapdb</requirement> snpindex.xml: <requirement type="datatype">gmap_snps</requirement> snpindex.xml: <requirement type="datatype">snps.iit</requirement>
Thanks,
JJ
On 10/18/11 10:18 AM, Greg Von Kuster wrote:
Jim,
Sounds great - this will be very helpful!
Greg
On Oct 18, 2011, at 11:03 AM, Jim Johnson wrote:
Greg,
The mothur_toolsuite in the ToolShed contains a file with added datatypes for metagenomics (used by mothur and some by qiime): mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py The README has info on how I incorporated mothur into our local galaxy server.
I'm also working on GMAP/GSNAP ( http://research-pub.gene.com/gmap/ ) So far I've created a GmapDB class, analogous to the ngsindex.BowtieIndex class, but with more metadata. I'm also adding a IntervalIndexTree class for indexing maps of splice junctions, introns, and SNPs. I'll send you this as soon as I've got it working.
Thanks,
JJ
On 10/17/11 1:06 PM, Greg Von Kuster wrote:
We've digested this topic a bit here at Galaxy Central, and agree that at some point ( maybe soon for very basic functionality ) we need to provide support for new data types in tool shed repositories. It would be very helpful ( and significantly speed up the development process ) if the community could provide at least 2 different tools that use data types not included in the Galaxy distribution ( sending me a tarball that includes all the tool dependencies, including the new data type class would be ideal ). When I get them I'll add this new feature set to my development list.
Thanks everyone for all the input on this!
Greg Von Kuster
On Oct 7, 2011, at 2:05 PM, Jim Johnson wrote:
Greg,
It would be great if there were a way to expand upon the core datatypes using the ToolShed.
Would it be possible to have a separate datatype repository within the ToolShed?
Datatype name="" description="" datatype_dependencies=[] definition=<python code>
The tool config could be expanded to have requirement for datatypes. <requirement type="datatype">ssmap</requirement>
Table datatype Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) name | character varying(255) | version | character varying(40) | description | text | definition | text | UNIQUE (name)
Table datatype_datatype_association Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------- id | integer | not null default nextval('datatype_id_seq'::regclass) datatype_id | integer | requires_id | integer | FOREIGN KEY (datatype_id) REFERENCES datatype(id) FOREIGN KEY (requires_id) REFERENCES datatype(id)
Then for my mothur metagenomics tools I could define:
name="ssmap" description="Secondary Structure Map" version="1.0" datatype_dependencies=[tabular] definition= from galaxy.datatypes.tabular import Tabular class SecondaryStructureMap(Tabular): file_ext = 'ssmap' def __init__(self, **kwd): """Initialize secondary structure map datatype""" Tabular.__init__( self, **kwd ) self.column_names = ['Map']
def sniff( self, filename ): """ Determines whether the file is a secondary structure map format A single column with an integer value which indicates the row that this row maps to. check you make sure is structMap[10] = 380 then structMap[380] = 10. """ ...
Then the align.check.xml tool_config could require the 'ssmap' datatype:
<tool id="mothur_align_check" name="Align.check" version="1.19.0"> <description>Calculate the number of potentially misaligned bases</description> <requirements> <requirement type="binary">mothur</requirement> <requirement type="datatype">ssmap</requirement> </requirements>
John,
I've been following this message thread, and it seems it's gone in a direction that differs from your initial question about the possibility for Galaxy to handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed tools are automatically installed. There are some complexities to consider in attempting this. One of the issues to consider is that the work for adding support for a new datatype to Galaxy lies outside of the intended function of the tool shed. If new support is added to the Galaxy code base, an entry for that new datatype should be manually added to the table at the same time. There may be benefits to enabling automatic changes to datatype entries that already exist in the file (e.g., adding a new converter for an existing datatype entry), but perhaps adding a completely new datatype to the file may not be appropriate. I'll continue to think about this - send additional thought and feedback, as doing so is always helpful
Thanks!
Greg
On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
> One of the things we’re facing is the sheer size of a whole human genome at 30x coverage. An effective way to deal with that is by compressing the FASTQ files. That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other tools crash when reading compressed FASTQ filesfiles. One way to address that would be to introduce a new type, for example “CompressedFastQ”, with a conversion to FASTQ defined. BWA could take both types as input. This would allow the best of both worlds – efficient storage and use by all existing tools. > > Another example would be adding the CASAVA tools to Galaxy. Some of the statistics generation tools use custom file formats. To be able to make the use of those tools optional and configurable, they should be separate from the aligner, but that would require that Galaxy be made aware of the custom file formats – we’d have to add a datatype. > > John Duddy > Sr. Staff Software Engineer > Illumina, Inc. > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 858-736-3584 > E-mail: jduddy at illumina.com > > From: Greg Von Kuster [mailto:greg at bx.psu.edu] > Sent: Wednesday, October 05, 2011 6:25 PM > To: Duddy, John > Cc: galaxy-dev at lists.bx.psu.edu > Subject: Re: [galaxy-dev] Tool shed and datatypes > > Hello John, > > The Galaxy tool shed currently is not enabled to automatically edit the datatypes_conf.xml file, although I could add this feature if the need exists. Can you elaborate on what you are looking to do regarding this? > > Thanks! > > > On Oct 5, 2011, at 1:52 PM, Duddy, John wrote: > > > Can we introduce new file types via tools in the tool shed? It seems Galaxy can load them if they are in the datatypes configuration file. Does tool installation automate the editing of that file? > > > John Duddy > Sr. Staff Software Engineer > Illumina, Inc. > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 858-736-3584 > E-mail: jduddy at illumina.com > > ___________________________________________________________ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ > > Greg Von Kuster > Galaxy Development Team > greg at bx.psu.edu >
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
On 10/21/11 12:29 PM, James Taylor wrote:
Excerpts from Jim Johnson's message of 2011-10-21 17:13:02 +0000:
I put the gmap tool suite in the galaxy Tool Shed, let me know if there is more I should do.
Awesome!
I added a requirement tag for the datatypes to the tool-configs:
% grep 'requirement.*datatype' *.xml gmap_build.xml:<requirement type="datatype">gmapdb</requirement>
Requirement tags for datatypes are an interesting idea, but I'm wondering if this is something we should require? It seems like all this information is implicit -- a tool requires a datatype if it has an input or output parameter that references that type. Is there other information that should go in the requirement tag?
That is certainly correct that the tag would be redundant, the tool config parser could identify the list of datatype formats.
I was just trying to think of some way to indicate that additional datatypes were required above those in the central distribution. My goal would be to have the installation of tools from the Tool Shed also be able to install the extra datatypes that those tools require.
Having datatypes specified separately in the Tool Shed from tools would hopefully promote less redundancy of datatypes and better interoperability among developers tools. For example the metagenomics applications mothur and qiime have many specific formats that are internal to their tools, but also a few that might be used to migrate data between those applications. We'd need a way to avoid name clashes, perhaps adopting a namespace pattern for the file_ext attribute.
Hello Jim,
I've implemented support for proprietary datatypes that use class modules included in tool shed repositories. To see how this works, you'll need at least change set revision 6479:4d131422777f, which is currently available only from our central repo at https://bitbucket.org/galaxy/galaxy-central.
I've documented the way this works in the following 2 sections of the tool shed wiki. In the second section, I've taken the liberty of using your gmap tool repository as an example. i hope you don't mind. I've written the document section assuming that your gmap repository includes the 2 changes I've described below.
http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_... http://wiki.g2.bx.psu.edu/Tool%20Shed#Including_proprietary_data_types_that_...
There are 2 categories of datatypes that are currently supported:
1. data types that subclass from the datatype classes included in the Galaxy distribution - these require no code files that define proprietary datatype classes to be included in the tool shed repository, and are documented in the first wiki section listed above.
2. datatypes that use proprietary classes defined in code files included in the tool shed repository - documented in the second wiki section listed above. Your gmap tool suite falls into this category.
If you make the following changes to your gmap tool suite, your proprietary data types will automatically load into a local Galaxy instance when the Galaxy admin installs your tool suite to that instance. The data types will be loaded at the time of installation as well as whenever the Galaxy server is stopped / restarted. I'll send you a separate message detailing the changes you'll need to make to your mothur tool suite.
CHANGE 1 ---------------- Add a file named datatypes_conf.xml to your repository. This is the approach I'm using to support proprietary datatypes included in tool shed repositories instead f your proposed addition of datatypes in the tool config's <requirements> tag set. The datatypes_conf.xml file can be located anywhere in the repository, but the the obvious location for your gmap repository is your ~/tool-data directory.
This file should contain the following datatype definitions.
<?xml version="1.0"?> <datatypes> <datatype_files> <datatype_file name="gmap.py"/> </datatype_files> <registration> <datatype extension="gmapdb" type="galaxy.datatypes.gmap:GmapDB" display_in_upload="False"/> <datatype extension="gmapsnpindex" type="galaxy.datatypes.gmap:GmapSnpIndex" display_in_upload="False"/> <datatype extension="iit" type="galaxy.datatypes.gmap:IntervalIndexTree" display_in_upload="True"/> <datatype extension="splicesites.iit" type="galaxy.datatypes.gmap:SpliceSitesIntervalIndexTree" display_in_upload="True"/> <datatype extension="introns.iit" type="galaxy.datatypes.gmap:IntronsIntervalIndexTree" display_in_upload="True"/> <datatype extension="snps.iit" type="galaxy.datatypes.gmap:SNPsIntervalIndexTree" display_in_upload="True"/> <datatype extension="gmap_annotation" type="galaxy.datatypes.gmap:IntervalAnnotation" display_in_upload="False"/> <datatype extension="gmap_splicesites" type="galaxy.datatypes.gmap:SpliceSiteAnnotation" display_in_upload="True"/> <datatype extension="gmap_introns" type="galaxy.datatypes.gmap:IntronAnnotation" display_in_upload="True"/> <datatype extension="gmap_snps" type="galaxy.datatypes.gmap:SNPAnnotation" display_in_upload="True"/> </registration> <sniffers> <sniffer type="galaxy.datatypes.gmap:IntervalAnnotation"/> <sniffer type="galaxy.datatypes.gmap:SpliceSiteAnnotation"/> <sniffer type="galaxy.datatypes.gmap:IntronAnnotation"/> <sniffer type="galaxy.datatypes.gmap:SNPAnnotation"/> </sniffers> </datatypes>
I noticed that your README in your current gmap repository on the main Galaxy tool shed includes the following datatype definitions, but they refer to classes that are not included in your repository so I've eliminated them from the above datatypes_conf.xml file. You may need to add the classes to your current gmap.py datatypes class file and add them to the above datatypes_conf.xml file if your tools actually require them.
<datatype extension="tally.iit" type="galaxy.datatypes.gmap:TallyIntervalIndexTree" display_in_upload="True"/> <datatype extension="gsnap_tally" type="galaxy.datatypes.gmap:TallyAnnotation" display_in_upload="True"/> <datatype extension="gsnap" type="galaxy.datatypes.gmap:GsnapResult" display_in_upload="True"/>
CHANGE 2 ---------------- Modules that include proprietary datatype class definitions cannot use relative import references for imported modules. Imports must be defined as absolute from the galaxy subdirectory inside the Galaxy root's lib subdirectory. So for your ~/lib/galaxy/datatypes/gmap.py datatypes module in your gmap repository, the following changes are necessary.
Your current imports look like this:
import logging import os,os.path,re import data from data import Text from galaxy import util from metadata import MetadataElement
But they need to be changed to this - note the elimination of relative imports:
import logging import os,os.path,re import galaxy.datatypes.data from galaxy.datatypes.data import Text from galaxy import util from galaxy.datatypes.metadata import MetadataElement
Thanks very much for helping out with this, and please let me know if you bump into any problems.
Greg Von Kuster
On Oct 21, 2011, at 1:13 PM, Jim Johnson wrote:
Greg,
I put the gmap tool suite in the galaxy Tool Shed, let me know if there is more I should do.
It has 5 galaxy tools: GMAP - Genomic Mapping and Alignment Program for mRNA and EST sequences GSNAP - Genomic Short-read Nucleotide Alignment Program GMAP Build - a database genome index for GMAP and GSNAP ( calls: gmap_build, iit_store, snpindex, cmetindex, atoiindex ) GMAP SNP Index - build index files for known SNPs (calls: iit_store, snpindex) GMAP IIT - Create a map store for known genes or SNPs (calls: iit_store)
It uses these added datatypes: % grep -E '(^class | file_ext)' lib/galaxy/datatypes/gmap.py class GmapDB( Text ): file_ext = 'gmapdb' class GmapSnpIndex( Text ): file_ext = 'gmapsnpindex' class IntervalIndexTree( Text ): file_ext = 'iit' class SpliceSitesIntervalIndexTree( IntervalIndexTree ): file_ext = 'splicesites.iit' class IntronsIntervalIndexTree( IntervalIndexTree ): file_ext = 'introns.iit' class SNPsIntervalIndexTree( IntervalIndexTree ): file_ext = 'snps.iit' class IntervalAnnotation( Text ): file_ext = 'gmap_annotation' class SpliceSiteAnnotation(IntervalAnnotation): file_ext = 'gmap_splicesites' class IntronAnnotation(IntervalAnnotation): file_ext = 'gmap_introns' class SNPAnnotation(IntervalAnnotation): file_ext = 'gmap_snps'
I added a requirement tag for the datatypes to the tool-configs: % grep 'requirement.*datatype' *.xml gmap_build.xml: <requirement type="datatype">gmapdb</requirement> gmap_build.xml: <requirement type="datatype">gmap_snps</requirement> gmap.xml: <requirement type="datatype">gmapdb</requirement> gmap.xml: <requirement type="datatype">gmap_annotation</requirement> gmap.xml: <requirement type="datatype">gmap_splicesites</requirement> gmap.xml: <requirement type="datatype">gmap_introns</requirement> gmap.xml: <requirement type="datatype">gmap_snps</requirement> gsnap.xml: <requirement type="datatype">gmapdb</requirement> gsnap.xml: <requirement type="datatype">gmapsnpindex</requirement> gsnap.xml: <requirement type="datatype">splicesites.iit</requirement> gsnap.xml: <requirement type="datatype">introns.iit</requirement> iit_store.xml: <requirement type="datatype">gmap_annotation</requirement> iit_store.xml: <requirement type="datatype">gmap_snps</requirement> iit_store.xml: <requirement type="datatype">iit</requirement> iit_store.xml: <requirement type="datatype">splicesites.iit</requirement> iit_store.xml: <requirement type="datatype">introns.iit</requirement> iit_store.xml: <requirement type="datatype">snps.iit</requirement> snpindex.xml: <requirement type="datatype">gmapsnpindex</requirement> snpindex.xml: <requirement type="datatype">gmapdb</requirement> snpindex.xml: <requirement type="datatype">gmap_snps</requirement> snpindex.xml: <requirement type="datatype">snps.iit</requirement>
Thanks,
JJ
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Hi Jim,
Here are the changes you'll need to make to your mothur tool suite.
CHANGE 1 ---------------- Add the following datatypes.conf.xml file to your repository.
<?xml version="1.0"?> <datatypes> <datatype_files> <datatype_file name="metagenomics.py"/> </datatype_files> <registration> <datatype extension="otu" type="galaxy.datatypes.metagenomics:Otu" display_in_upload="true"/> <datatype extension="list" type="galaxy.datatypes.metagenomics:OtuList" display_in_upload="true"/> <datatype extension="sabund" type="galaxy.datatypes.metagenomics:Sabund" display_in_upload="true"/> <datatype extension="rabund" type="galaxy.datatypes.metagenomics:Rabund" display_in_upload="true"/> <datatype extension="shared" type="galaxy.datatypes.metagenomics:SharedRabund" display_in_upload="true"/> <datatype extension="relabund" type="galaxy.datatypes.metagenomics:RelAbund" display_in_upload="true"/> <datatype extension="names" type="galaxy.datatypes.metagenomics:Names" display_in_upload="true"/> <datatype extension="design" type="galaxy.datatypes.metagenomics:Design" display_in_upload="true"/> <datatype extension="summary" type="galaxy.datatypes.metagenomics:Summary" display_in_upload="true"/> <datatype extension="groups" type="galaxy.datatypes.metagenomics:Group" display_in_upload="true"/> <datatype extension="oligos" type="galaxy.datatypes.metagenomics:Oligos" display_in_upload="true"/> <datatype extension="align" type="galaxy.datatypes.metagenomics:SequenceAlignment" display_in_upload="true"/> <datatype extension="accnos" type="galaxy.datatypes.metagenomics:AccNos" display_in_upload="true"/> <datatype extension="map" type="galaxy.datatypes.metagenomics:SecondaryStructureMap" display_in_upload="true"/> <datatype extension="align.check" type="galaxy.datatypes.metagenomics:AlignCheck" display_in_upload="true"/> <datatype extension="align.report" type="galaxy.datatypes.metagenomics:AlignReport" display_in_upload="true"/> <datatype extension="filter" type="galaxy.datatypes.metagenomics:LaneMask" display_in_upload="true"/> <datatype extension="dist" type="galaxy.datatypes.metagenomics:DistanceMatrix" display_in_upload="true"/> <datatype extension="pair.dist" type="galaxy.datatypes.metagenomics:PairwiseDistanceMatrix" display_in_upload="true"/> <datatype extension="square.dist" type="galaxy.datatypes.metagenomics:SquareDistanceMatrix" display_in_upload="true"/> <datatype extension="lower.dist" type="galaxy.datatypes.metagenomics:LowerTriangleDistanceMatrix" display_in_upload="true"/> <datatype extension="ref.taxonomy" type="galaxy.datatypes.metagenomics:RefTaxonomy" display_in_upload="true"> <converter file="ref_to_seq_taxonomy_converter.xml" target_datatype="seq.taxonomy"/> </datatype> <datatype extension="seq.taxonomy" type="galaxy.datatypes.metagenomics:SequenceTaxonomy" display_in_upload="true"/> <datatype extension="rdp.taxonomy" type="galaxy.datatypes.metagenomics:RDPSequenceTaxonomy" display_in_upload="true"/> <datatype extension="cons.taxonomy" type="galaxy.datatypes.metagenomics:ConsensusTaxonomy" display_in_upload="true"/> <datatype extension="tax.summary" type="galaxy.datatypes.metagenomics:TaxonomySummary" display_in_upload="true"/> <datatype extension="freq" type="galaxy.datatypes.metagenomics:Frequency" display_in_upload="true"/> <datatype extension="quan" type="galaxy.datatypes.metagenomics:Quantile" display_in_upload="true"/> <datatype extension="filtered.quan" type="galaxy.datatypes.metagenomics:FilteredQuantile" display_in_upload="true"/> <datatype extension="masked.quan" type="galaxy.datatypes.metagenomics:MaskedQuantile" display_in_upload="true"/> <datatype extension="filtered.masked.quan" type="galaxy.datatypes.metagenomics:FilteredMaskedQuantile" display_in_upload="true"/> <datatype extension="axes" type="galaxy.datatypes.metagenomics:Axes" display_in_upload="true"/> <datatype extension="sff.flow" type="galaxy.datatypes.metagenomics:SffFlow" display_in_upload="true"/> </registration> </datatypes>
I'm probably not correctly handling the converter for your ref.taxonomy data type - I've not been able to find the ref_to_seq_taxonomy_converter.xml file. Can you pass it along to me so I can see if I have some debugging to do?
Also, I've eliminated the following entry from your README in the above file because the Newick class is not included in your metagenomics.py class module. It seems you may have include the Newick class in your local copy of ~/lib/galaxy/datatypes/data.py. If your tools use this class, it should be added to either your metagenomics.py class file or another class file in your repository and the value of the "type" attribute in the following should be changed accordingly.
<datatype extension="tre" type="galaxy.datatypes.data:Newick" display_in_upload="true"/>
CHANGE 2 ---------------
The following relative imports in your metagenomics.py class module:
import data from sniff import *
need to look like this:
from galaxy.datatypes import data from galaxy.datatypes.sniff import *
CHANGE 3 --------------- You can optionally choose to remove your suite_config.xml file from your repository as it is no longer used in any way.
Thanks!
Greg Von Kuster
On Oct 18, 2011, at 11:03 AM, Jim Johnson wrote:
Greg,
The mothur_toolsuite in the ToolShed contains a file with added datatypes for metagenomics (used by mothur and some by qiime): mothur_toolsuite/mothur/lib/galaxy/datatypes/metagenomics.py The README has info on how I incorporated mothur into our local galaxy server.
I'm also working on GMAP/GSNAP ( http://research-pub.gene.com/gmap/ ) So far I've created a GmapDB class, analogous to the ngsindex.BowtieIndex class, but with more metadata. I'm also adding a IntervalIndexTree class for indexing maps of splice junctions, introns, and SNPs. I'll send you this as soon as I've got it working.
Thanks,
JJ
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
galaxy-dev@lists.galaxyproject.org