Hi John,

The general question, I think, is whether reproducibility is important.  If it is, then we should not introduce new behavior that adversely impacts it.  There are undoubtedly scenarios where reproducibility is not currently absolutely guaranteed, but those area of weakness should be corrected (as time and resources allow) when they are discovered if reproducibility is one of the desired features.

Please see my inline comments too.

On Jul 18, 2014, at 11:59 AM, John Chilton <jmchilton@gmail.com> wrote:

Does the current implementation really handle datatypes in reproducible manner - if I have a repo which in revision 1 defines foo1 as a text subtype, foo2 as a tabular type and foo3 as a new type in foo.py and then in revision 2 foo1 is defined as a binary subtype , foo2 and foo3 disappear and foo4 is a new type in foo.py (which no longer defines foo3) how could you possibly resolve that in a "reproducible" manner.

So you have:

repo_a revision 1:
foo1 datatype as text subtype
foo2 datatype as tabular
foo3 new datatype in foo.py

repo_a revision 2:
foo1 datatype as binary subtype
foo4 new type in foo.py

I would say that this is an example of a "bad practice" on the part of the repository owner, but, of course, this scenario can certainly occur.  In this case, the current implementation creates 2 separate installable revisions of repo_a which are loaded into the datatype's registry in a specific order.  If repo_a revision 1 was installed first, then it will always be loaded first, and the foo1 and foo4 datatypes contained in repo_a revision 2 will not be loaded because they are currently considered conflicting datatypes.  So currently, reproducibility is ensured, but the versions of foo1 and foo4 in revision 2 cannot be used.  This may not be ideal, but in order to allow both versions to be used, more than the datatype extensions will be needed in order to defferentiate datatypes (i.e., some named-spaced identifier similar to the Tool Shed's guid for tools).


Some of your tools are going to expect foo1 to be one thing - others something else. You are only going to place 1 copy of foo.py on the PYTHONPATH right (or at least python will only load one)? Is it going to define foo3 or foo4? In addition to lacking reproducibility within one instance - if you are somehow trying to preserve all the datatypes a repository has ever defined I feel like after a long stream of such updates - the behavior of the datatypes is going to vary from one installation to another that installed different repository versions. Hence - reproducibility across instances is subtly broken as well? 

None of this is a solution of course - this problem strikes me as being very difficult. 

That said - I think correctness and reproduciblity across instances is more important than reproducibility within the same instance over time - so for that reason I think there only being one installable revision of datatypes might be a big step forward relative to the status quo. Intuitively - if we are not namespacing/versioning datatypes - there should only be one definition and it should be the most recently installed one right?

It would also resolve this https://trello.com/c/oTq2Kewd problem - where unsniffable binary datatypes are treated as sniffiable if there was ever an installed version that was some sniff-able datatype.

-John

On Jul 17, 2014 12:35 PM, "Greg Von Kuster" <greg@bx.psu.edu> wrote:
This would be easy to implement, but could adversely affect reproducibility.  If a repository containing datatypes always had only a single installable revision (i.e., the chagelog tip), then any datatypes defined in an early changeset revision that are removed in a later changeset revision would no longer be available.

Greg

On Jul 17, 2014, at 1:30 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

> On Thu, Jul 17, 2014 at 6:10 PM, Björn Grüning
> <bjoern.gruening@gmail.com> wrote:
>>
>> ... but the problem will stay the same ... one [datatype definition] repository
>> can have multiple versions ...
>>
>
> I like your idea that like tool dependency definitions, this should be a special
> repository type on the ToolShed:
>
> Earlier, Björn Grüning <bjoern.gruening@gmail.com> wrote:
>>
>> Imho datatypes should be handled like "Tool dependency definitions".
>> There should be only one "installable revsion".
>>
>
> This is something Greg will have to comment on - there may be
> ramifications I'm not seeing.
>
> Peter
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/mailinglists/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/