There are a number of well defined formats that are exchanged between
applications, e.g. BAM, gtf, etc, I wouldn't advocate proliferating those.
I see the need for Toolshed datatypes more for the intermediate file formats used within
a suite of commands. These can be helpful in guiding a user to select appropriate inputs
for successive steps in an analysis.
For example, when developing the 90 some tool wrappers for the mothur metagenomic
package, there are many file formats that get passed among the mothur commands. It
greatly simplifies the user's experience if the outputs are typed so as to correctly
filter the acceptable inputs to another command. I fear the amount of time I would spend
providing user support if the outputs and inputs were generically typed.
An approach for simplifying this is to include one or more exported Galaxy workflows in
the tool shed repository along with the tools. The workflows cannot currently be
automatically imported into Galaxy, but they can be manually imported, providing the user
an idea of the steps in the analyses for which the tools are intended. Additional
features related to Galaxy workflows included in Galaxy tool shed repositories will be
available in future Galaxy releases.
I'm also seeing a similar need as I am creating creating tool wrappers for the
GMAP/GSNAP mapping commands. While input to GSNAP and GMAP can be fastq and output in
SAM format, some of the more interesting use cases involve creating additional map stores,
where specific datatypes would guide the user in setting the tool parameters correctly.
JJ
James E Johnson
Minnesota Supercomputing Institute, University of Minnesota
On 10/10/11 11:09 AM, Duddy, John wrote:
> I agree with the risks you cited.
>
> There is a risk in the other direction that I think is even scarier - without the
ability to add data types, tool authors may be forced to use a "typeless"
system, declaring all inputs/outputs as "data" or "text". While this
works, it has the same drawbacks as typeless programming languages - deferring error
detection to runtime, impairing the ability to perform static analysis, inability to
perform transparent type conversions - in other words, the tools have to take over
responsibilities from the framework.
>
> Like all interesting problems, I don't think there is an "obviously
right" answer ;-}
>
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jduddy(a)illumina.com
>
>
> -----Original Message-----
> From: galaxy-dev-bounces(a)lists.bx.psu.edu
[mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Paniagua, Eric
> Sent: Friday, October 07, 2011 5:53 PM
> To: jj(a)umn.edu; galaxy-dev(a)lists.bx.psu.edu
> Cc: Greg Von Kuster
> Subject: Re: [galaxy-dev] Tool shed and datatypes
>
> Hi all,
>
> Just my 2 cents.
>
> This is a really great idea to have dynamically (down-)loadable datatypes, and a tool
config tag to express a datatype dependency is right on the money. I agree with Greg in
having hesitations about adding that feature though. The purpose (at least as far I see
it) of the tool shed is to allow the community to share its productivity. New tools
written by one group can be used by another group that may not have adequate skill,
resources, or time to create the same tool on their own. One issue this model can suffer
from, however, is over-proliferation of contributions. In this case, new tools with the
same, overlapping, or very similar functions might be developed independently by multiple
groups who then want to contribute to the tool shed. I don't know how often this
situation arises or what official contingencies are in place to manage them, but it is
important to manage that situation carefully. If it occurs with any appreciable
frequency, then eventually there ar!
e many clusters of tools available that do almost the same thing but not quite. This is
bad for the user, bad for the maintainer, complicates communication between researchers,
etc. This model can work nicely if the frequency of very simliar tool submissions is
small, and even better if there is some management for cleaning out broken or redundant
tools.
>
> When you allow custom datatypes to enter the picture, however, the story can become
hairy much more quickly. Having a limited set of officially supplied / supported
datatypes forces the contributors of new tools to use datatypes drawn from a standard set.
Without that constraint, the number of datatype variants could explode. Now the concern
is not only that multiple contributors may submit very similar tool variants, or that each
of them might choose to create their own datatypes to optimize their methods, but also
that contributors of tools which are functionally dissimilar but manipulate the same
general types of data will write their tools using new datatypes that are variants of each
other. Tools are essentially typed by the datatypes they accept and produce, so you
won't be able to chain these tools together very easliy at all. Most pairs of tools
will have the "wrong" datatype, on input or output, for what a user wants to do.
The general trend is then prolifer!
ation of clusters of redundant tools, clusters of redundant datatypes, and growing
sparsity in the "tool graph" (think of datatypes as vertices and tools as
directed [hyper]edges).
>
> So, a move in the direction of supporting something like a "TypeShed" would
require careful consideration consist of at least either a well defined policy for
managing *Shed rot and capability to execute it or a very slick tool / datatype versioning
system with flexible control for users and some also very slick method for maintaining
implicit conversions between the datatypes in a datatype cluster (ideally automatically
generated). I think at least the implicit conversion part can be done, even if not in a
fully automated manner, then by a combination of policy and engineering. For policy, you
can define, identify, or construct a canonical datatype in each cluster and require that a
contributor of a variant datatype submit methods for implicit conversion to/from the
canonical datatype in that cluster. One idea that could help reduce complexity is to
potentially place some additional structure on datatypes and take the canonical datatype
for a cluster to be a form of th!
e union (mathematical, not the "union" from C) of the variants in the cluster,
which would simplify implicit conversations somewhat. Or, if there's some reason for
this, there can also be a set of "canonical" datatypes for each cluster, so long
as they are all guaranteed to be mutually implicitly convertible. For a policy to manage
*Shed rot, the most direct approach is to moderate and require approval for each
submission, but I could imagine that responsibility quickly overwhelming the poor team
responsible. Unless I drastically overestimate the frequency with which submissions might
be made (which is entirely possible), that poor team's operations could wind up
looking not unlike the USPTO.
>
> Anyway, my general point is that there are many non-trivial factors to consider in
the question of creating a TypeShed. But, if done right, the benefits could be huge,
besides the likely awesomeness of the engineering involved.
>
> Finally, let me echo Greg again, and say to please send additional thought and
feedback. What do you think about the points I raised? What else is there to consider
that hasn't occurred to me yet? What would be the benefits and potential pitfalls?
>
> Best,
> Eric
>
> ________________________________________
> From: galaxy-dev-bounces(a)lists.bx.psu.edu [galaxy-dev-bounces(a)lists.bx.psu.edu] on
behalf of Jim Johnson [jj(a)umn.edu]
> Sent: Friday, October 07, 2011 2:06 PM
> To: galaxy-dev(a)lists.bx.psu.edu
> Cc: Greg Von Kuster
> Subject: Re: [galaxy-dev] Tool shed and datatypes
>
> Greg,
>
> It would be great if there were a way to expand upon the core datatypes using the
ToolShed.
>
> Would it be possible to have a separate datatype repository within the ToolShed?
>
> Datatype
> name=""
> description=""
> datatype_dependencies=[]
> definition=<python code>
>
>
> The tool config could be expanded to have requirement for datatypes.
> <requirement type="datatype">ssmap</requirement>
>
>
>
>
> Table datatype
> Column | Type | Modifiers
>
-------------+-----------------------------+---------------------------------------------------
> id | integer | not null default
nextval('datatype_id_seq'::regclass)
> name | character varying(255) |
> version | character varying(40) |
> description | text |
> definition | text |
> UNIQUE (name)
>
> Table datatype_datatype_association
> Column | Type | Modifiers
>
-------------+-----------------------------+---------------------------------------------------
> id | integer | not null default
nextval('datatype_id_seq'::regclass)
> datatype_id | integer |
> requires_id | integer |
> FOREIGN KEY (datatype_id) REFERENCES datatype(id)
> FOREIGN KEY (requires_id) REFERENCES datatype(id)
>
>
> Then for my mothur metagenomics tools I could define:
>
> name="ssmap" description="Secondary Structure Map"
version="1.0" datatype_dependencies=[tabular]
> definition=
> from galaxy.datatypes.tabular import Tabular
> class SecondaryStructureMap(Tabular):
> file_ext = 'ssmap'
> def __init__(self, **kwd):
> """Initialize secondary structure map
datatype"""
> Tabular.__init__( self, **kwd )
> self.column_names = ['Map']
>
> def sniff( self, filename ):
> """
> Determines whether the file is a secondary structure map format
> A single column with an integer value which indicates the row that this row
maps to.
> check you make sure is structMap[10] = 380 then structMap[380] = 10.
> """
> ...
>
>
>
>
> Then the align.check.xml tool_config could require the 'ssmap' datatype:
>
> <tool id="mothur_align_check" name="Align.check"
version="1.19.0">
> <description>Calculate the number of potentially misaligned
bases</description>
> <requirements>
> <requirement type="binary">mothur</requirement>
> <requirement type="datatype">ssmap</requirement>
> </requirements>
>
>
>
>
>
>
>
>
>
>> John,
>>
>> I've been following this message thread, and it seems it's gone in a
direction that differs from your initial question about the possibility for Galaxy to
handle automatic editing of the datatypes_conf.xml file when certain Galaxy tool shed
tools are automatically installed. There are some complexities to consider in attempting
this. One of the issues to consider is that the work for adding support for a new
datatype to Galaxy lies outside of the intended function of the tool shed. If new support
is added to the Galaxy code base, an entry for that new datatype should be manually added
to the table at the same time. There may be benefits to enabling automatic changes to
datatype entries that already exist in the file (e.g., adding a new converter for an
existing datatype entry), but perhaps adding a completely new datatype to the file may not
be appropriate. I'll continue to think about this - send additional thought and
feedback, as doing so is always helpful
>>
>> Thanks!
>>
>> Greg
>>
>>
>> On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:
>>
>>> One of the things we're facing is the sheer size of a whole human genome
at 30x coverage. An effective way to deal with that is by compressing the FASTQ files.
That works for BWA and our ELAND, which can directly read a compressed FASTQ, but other
tools crash when reading compressed FASTQ filesfiles. One way to address that would be to
introduce a new type, for example "CompressedFastQ", with a conversion to FASTQ
defined. BWA could take both types as input. This would allow the best of both worlds -
efficient storage and use by all existing tools.
>>>
>>> Another example would be adding the CASAVA tools to Galaxy. Some of the
statistics generation tools use custom file formats. To be able to make the use of those
tools optional and configurable, they should be separate from the aligner, but that would
require that Galaxy be made aware of the custom file formats - we'd have to add a
datatype.
>>>
>>> John Duddy
>>> Sr. Staff Software Engineer
>>> Illumina, Inc.
>>> 9885 Towne Centre Drive
>>> San Diego, CA 92121
>>> Tel: 858-736-3584
>>> E-mail: jduddy at
illumina.com
>>>
>>> From: Greg Von Kuster [mailto:greg at
bx.psu.edu]
>>> Sent: Wednesday, October 05, 2011 6:25 PM
>>> To: Duddy, John
>>> Cc: galaxy-dev at
lists.bx.psu.edu
>>> Subject: Re: [galaxy-dev] Tool shed and datatypes
>>>
>>> Hello John,
>>>
>>> The Galaxy tool shed currently is not enabled to automatically edit the
datatypes_conf.xml file, although I could add this feature if the need exists. Can you
elaborate on what you are looking to do regarding this?
>>>
>>> Thanks!
>>>
>>>
>>> On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:
>>>
>>>
>>> Can we introduce new file types via tools in the tool shed? It seems Galaxy
can load them if they are in the datatypes configuration file. Does tool installation
automate the editing of that file?
>>>
>>>
>>> John Duddy
>>> Sr. Staff Software Engineer
>>> Illumina, Inc.
>>> 9885 Towne Centre Drive
>>> San Diego, CA 92121
>>> Tel: 858-736-3584
>>> E-mail: jduddy at
illumina.com
>>>
>>> ___________________________________________________________
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client. To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>>
>>>
http://lists.bx.psu.edu/
>>>
>>> Greg Von Kuster
>>> Galaxy Development Team
>>> greg at
bx.psu.edu
>>>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>
http://lists.bx.psu.edu/
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>
http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/