tool_dependencies.xml format

older
[GSoC2013] Update

James Taylor

26 Aug 2013 26 Aug '13

2:05 p.m.

All, I've been seeing some examples of tool_depedencies.xml come across of the list, and I'm wondering if there are ways that it can be simplified. When we were first defining these features, we talked about having high level recipes for certain types of installs. This could greatly simplify things. For example, can this: <tool_dependency> <package name="requests" version="1.2.3"> <install version="1.0"> <actions> <action type="download_by_url">http://pypi.python.org/packages/source/r/requests/re quests-1.2.3.tar.gz</action> <action type="make_directory">$INSTALL_DIR/lib/python</action> <action type="shell_command">export PYTHONPATH=$PYTHONPATH:$INSTALL_DIR/lib/python && python setup.py install --home $INSTALL_DIR --install-scripts $INSTALL_DIR/bin</action> <action type="set_environment"> <environment_variable name="PYTHONPATH" action="append_to">$INSTALL_DIR/lib/python</environment_variable> <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR/bin</environment_variable> </action> </actions> </install> <readme> </readme> </package> </tool_dependency> Be simplified to: <tool_dependency> <package name="requests" version="1.2.3"> <install recipe="python_package_setuptools" url="http://pypi.python.org/packages/source/r/requests/requests-1.2.3.tar.gz" /> </tool_dependency> The assumptions are: when version is not provided, it is 1.0 (we've always maintained compatibility in the past for config files so hopefully this never changes), when installing a python package the install directories and environment variables that need to be set are always the same. Similar recipes could be: autoconf: default to configure; make; make install, allow providing configuration options make_install: just make; make install; allow providing make options python_virtualenv ruby_rbenv r_package ... Basically, most of the times the steps to install a particular package are boilerplate, this would remove a ton of duplication in the recipe files. Also, a likely less popular proposal would be to go one step further, tool_dependencies.yaml: recipe: python_package_setuptools name: requests version: 1.2.3 url: http://pypi.python.org/packages/source/r/requests/requests-${version}.tar.gz -- jt

Show replies by date

John Chilton

26 Aug 26 Aug

3:48 p.m.

James, et. al. I think it is interesting that there was push back on providing infrastructure (tool actions) for obtaining CBL from github and performing installs based on it because it was not in the tool shed and therefore less reproducible, but the team believes infrastructure should be put in place to support pypi. I understand there are any number of distinctions that could be made here - perhaps you have made the calculation pypi is more stable than github (either in terms or immutability or funding), perhaps the setuptools mechanism is more general and could potentially support grabbing these tar balls from the tool shed (or a tool shed adjacent object store). I think we all value reproduciblity here, but we make different calculations on what is reproducible. I think in terms of implementing the ideas James has laid out or similar things I have proposed, it might be beneficial to have some final answers on what external resources are allowed - both for obtaining a Galaxy IUC gold star and for the tool shed providing infrastructure to support their usage. I don't know if this takes for the form of the IUC voting or James and/or Greg issuing a proclamation, but it would be good to get firm answers on these two questions for the following sites rubygems, pypi, github, bitbucket, cpan, cran, sourceforge, and google code. It would also be great to have a process in place for deciding these questions for future repositories. Thanks, -John On Mon, Aug 26, 2013 at 9:05 AM, James Taylor <james@jamestaylor.org> wrote:

...

All,

I've been seeing some examples of tool_depedencies.xml come across of the list, and I'm wondering if there are ways that it can be simplified. When we were first defining these features, we talked about having high level recipes for certain types of installs. This could greatly simplify things. For example, can this:

<tool_dependency> <package name="requests" version="1.2.3"> <install version="1.0"> <actions> <action type="download_by_url">http://pypi.python.org/packages/source/r/requests/re quests-1.2.3.tar.gz</action> <action type="make_directory">$INSTALL_DIR/lib/python</action> <action type="shell_command">export PYTHONPATH=$PYTHONPATH:$INSTALL_DIR/lib/python && python setup.py install --home $INSTALL_DIR --install-scripts $INSTALL_DIR/bin</action> <action type="set_environment"> <environment_variable name="PYTHONPATH" action="append_to">$INSTALL_DIR/lib/python</environment_variable> <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR/bin</environment_variable> </action>

</actions> </install> <readme> </readme> </package> </tool_dependency>

Be simplified to:

<tool_dependency> <package name="requests" version="1.2.3"> <install recipe="python_package_setuptools"

url="http://pypi.python.org/packages/source/r/requests/requests-1.2.3.tar.gz" /> </tool_dependency>

The assumptions are: when version is not provided, it is 1.0 (we've always maintained compatibility in the past for config files so hopefully this never changes), when installing a python package the install directories and environment variables that need to be set are always the same.

Similar recipes could be:

autoconf: default to configure; make; make install, allow providing configuration options make_install: just make; make install; allow providing make options python_virtualenv ruby_rbenv r_package ...

Basically, most of the times the steps to install a particular package are boilerplate, this would remove a ton of duplication in the recipe files. Also, a likely less popular proposal would be to go one step further, tool_dependencies.yaml:

recipe: python_package_setuptools name: requests version: 1.2.3 url: http://pypi.python.org/packages/source/r/requests/requests-${version}.tar.gz

-- jt ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

James Taylor

3:59 p.m.

On Mon, Aug 26, 2013 at 11:48 AM, John Chilton <chilton@msi.umn.edu> wrote:

...

I think it is interesting that there was push back on providing infrastructure (tool actions) for obtaining CBL from github and performing installs based on it because it was not in the tool shed and therefore less reproducible, but the team believes infrastructure should be put in place to support pypi.

Well, first, I'm not sure what "the team" believes, I'm stating what I believe and engaging in a discussion with "the community". At some point this should evolve into what we are actually going to do and be codified in a spec as a Trello card, which is even then not set in stone. Second, I'm not suggesting we depend on PyPI. The nice thing about the second format I proposed on galaxy-dev is that we can easily parse out the URL and archive that file. Then someday we could provide a fallback repository where if the PyPI URL no longer works we still have it stored.

...

I think we all value reproduciblity here, but we make different calculations on what is reproducible. I think in terms of implementing the ideas James has laid out or similar things I have proposed, it might be beneficial to have some final answers on what external resources are allowed - both for obtaining a Galaxy IUC gold star and for the tool shed providing infrastructure to support their usage.

My focus is ensuring that we can archive things that pass through the toolshed. Tarballs from *anywhere* are easy enough to deal with. External version control repositories are a bit more challenging, especially when you are pulling just a particular file out, so that's where things got a little hinky for me. Since we don't have the archival mechanism in place yet anyway, this is more a philosophical discussion and setting the right precedent. And yes, keeping an archive of all the software in the world is a scary prospect, though compared to the amount of data we currently keep for people it is a blip. And I'm not sure how else we can really achieve the level of reproducibility we desire.

Nate Coraor

27 Aug 27 Aug

6:59 p.m.

On Aug 26, 2013, at 11:59 AM, James Taylor wrote:

...

On Mon, Aug 26, 2013 at 11:48 AM, John Chilton <chilton@msi.umn.edu> wrote:

...
I think it is interesting that there was push back on providing infrastructure (tool actions) for obtaining CBL from github and performing installs based on it because it was not in the tool shed and therefore less reproducible, but the team believes infrastructure should be put in place to support pypi.

Well, first, I'm not sure what "the team" believes, I'm stating what I believe and engaging in a discussion with "the community". At some point this should evolve into what we are actually going to do and be codified in a spec as a Trello card, which is even then not set in stone.

Second, I'm not suggesting we depend on PyPI. The nice thing about the second format I proposed on galaxy-dev is that we can easily parse out the URL and archive that file. Then someday we could provide a fallback repository where if the PyPI URL no longer works we still have it stored.

I concur here, the experience and lessons learned by long-established package and dependency managers can provide some useful guidance for us going forward. APT has long relied on a model of archiving upstream source (as well as distro-generated binary (dpkg) packages), cataloging changes as a set of patches, and maintaining an understanding of installed files, even those meant to be user-edited. I think there is a strong advantage for us doing this as well.

...

...
I think we all value reproduciblity here, but we make different calculations on what is reproducible. I think in terms of implementing the ideas James has laid out or similar things I have proposed, it might be beneficial to have some final answers on what external resources are allowed - both for obtaining a Galaxy IUC gold star and for the tool shed providing infrastructure to support their usage.

My focus is ensuring that we can archive things that pass through the toolshed. Tarballs from *anywhere* are easy enough to deal with. External version control repositories are a bit more challenging, especially when you are pulling just a particular file out, so that's where things got a little hinky for me.

Since we don't have the archival mechanism in place yet anyway, this is more a philosophical discussion and setting the right precedent.

And yes, keeping an archive of all the software in the world is a scary prospect, though compared to the amount of data we currently keep for people it is a blip. And I'm not sure how else we can really achieve the level of reproducibility we desire.

One additional step that will assist with long-term archival is generating static metadata and allowing the packaging and dependency systems to work outside of the Galaxy and Tool Shed applications. A package metadata catalog and package format that provided descriptions of packages on a generic webserver and installable without a running Galaxy instance are components that I believe are fairly important. As for user-edited files, the env.sh files, which are generated at install-time and then essentially untracked afterward scare me a bit. I think it'd be useful for the packaging system have a tighter concept of environment management. These are just my opinions, of course, and are going to be very APT/dpkg-biased simply due to my experience with and favor for Debian-based distros and dependency/package management, but I think there are useful concepts in this (and other systems) that we can draw from. Along those lines, one more idea I had thrown out a while ago was coming up with a way to incorporate (or at least automatically process so that we can convert to our format) the build definitions for other systems like MacPorts, BSD ports/pkgsrc, dpkg, rpm, etc. so that we can leverage the existing rules for building across our target platforms that have already been worked out by other package maintainers with more time. I think this aligns pretty well with Brad's thinking with CloudBioLinux, the difference in implementation being that we require multiple installable versions and platform independence. I am a bit worried that as we go down the "repackage (almost) all dependencies" path (which I do think is the right path), we also run the risk of most of our packages being out of date. That's almost a guaranteed outcome when even the huge packaging projects (Debian, Ubuntu, etc.) are rife with out-of-date packages. So being able to incorporate upstream build definitions may help us package dependencies quickly. --nate

...

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

John Chilton

9:14 p.m.

Before I went on that tangent, I should have said I of course agree with 100% of what James said in the original e-mail on this thread. For what it is worth, I believe the higher-level constructs he outlined are essential to the long term adoption of the tool shed. On Tue, Aug 27, 2013 at 1:59 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...

On Aug 26, 2013, at 11:59 AM, James Taylor wrote:

...
On Mon, Aug 26, 2013 at 11:48 AM, John Chilton <chilton@msi.umn.edu> wrote:

...
I think it is interesting that there was push back on providing infrastructure (tool actions) for obtaining CBL from github and performing installs based on it because it was not in the tool shed and therefore less reproducible, but the team believes infrastructure should be put in place to support pypi.

Well, first, I'm not sure what "the team" believes, I'm stating what I believe and engaging in a discussion with "the community". At some point this should evolve into what we are actually going to do and be codified in a spec as a Trello card, which is even then not set in stone.

Second, I'm not suggesting we depend on PyPI. The nice thing about the second format I proposed on galaxy-dev is that we can easily parse out the URL and archive that file. Then someday we could provide a fallback repository where if the PyPI URL no longer works we still have it stored.

I concur here, the experience and lessons learned by long-established package and dependency managers can provide some useful guidance for us going forward. APT has long relied on a model of archiving upstream source (as well as distro-generated binary (dpkg) packages), cataloging changes as a set of patches, and maintaining an understanding of installed files, even those meant to be user-edited. I think there is a strong advantage for us doing this as well.

...
...
I think we all value reproduciblity here, but we make different calculations on what is reproducible. I think in terms of implementing the ideas James has laid out or similar things I have proposed, it might be beneficial to have some final answers on what external resources are allowed - both for obtaining a Galaxy IUC gold star and for the tool shed providing infrastructure to support their usage.

My focus is ensuring that we can archive things that pass through the toolshed. Tarballs from *anywhere* are easy enough to deal with. External version control repositories are a bit more challenging, especially when you are pulling just a particular file out, so that's where things got a little hinky for me.

Since we don't have the archival mechanism in place yet anyway, this is more a philosophical discussion and setting the right precedent.

And yes, keeping an archive of all the software in the world is a scary prospect, though compared to the amount of data we currently keep for people it is a blip. And I'm not sure how else we can really achieve the level of reproducibility we desire.

One additional step that will assist with long-term archival is generating static metadata and allowing the packaging and dependency systems to work outside of the Galaxy and Tool Shed applications. A package metadata catalog and package format that provided descriptions of packages on a generic webserver and installable without a running Galaxy instance are components that I believe are fairly important.

As for user-edited files, the env.sh files, which are generated at install-time and then essentially untracked afterward scare me a bit. I think it'd be useful for the packaging system have a tighter concept of environment management.

These are just my opinions, of course, and are going to be very APT/dpkg-biased simply due to my experience with and favor for Debian-based distros and dependency/package management, but I think there are useful concepts in this (and other systems) that we can draw from.

Along those lines, one more idea I had thrown out a while ago was coming up with a way to incorporate (or at least automatically process so that we can convert to our format) the build definitions for other systems like MacPorts, BSD ports/pkgsrc, dpkg, rpm, etc. so that we can leverage the existing rules for building across our target platforms that have already been worked out by other package maintainers with more time. I think this aligns pretty well with Brad's thinking with CloudBioLinux, the difference in implementation being that we require multiple installable versions and platform independence.

The CloudBioLinux galaxy tool stuff used by Galaxy-P, CloudMan, and in integrated into tool shed installs with pull request 207 is platform independent (or as platform independent as the tool shed) and allows multiple installable versions.

...

I am a bit worried that as we go down the "repackage (almost) all dependencies" path (which I do think is the right path), we also run the risk of most of our packages being out of date. That's almost a guaranteed outcome when even the huge packaging projects (Debian, Ubuntu, etc.) are rife with out-of-date packages. So being able to incorporate upstream build definitions may help us package dependencies quickly.

--nate

...
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Bjoern Gruening

17 Sep 17 Sep

8:10 a.m.

Hi James, thanks for your thoughts on abstraction of common tasks. For most of these things we have now patches in bitbucket.

...

Similar recipes could be:

autoconf: default to configure; make; make install, allow providing configuration options

https://bitbucket.org/galaxy/galaxy-central/pull-request/218/implementation-...

...

make_install: just make; make install; allow providing make options

https://bitbucket.org/galaxy/galaxy-central/pull-request/217/implementation-...

...

python_virtualenv

Is that not supposed to work with the 'setup_virtualenv'?

...

ruby_rbenv

...

From John: https://bitbucket.org/galaxy/galaxy-central/pull-request/207/john-chiltons-a...

...

r_package

https://bitbucket.org/galaxy/galaxy-central/pull-request/219/implementation-... Cheers, Bjoern

...

...

Basically, most of the times the steps to install a particular package are boilerplate, this would remove a ton of duplication in the recipe files. Also, a likely less popular proposal would be to go one step further, tool_dependencies.yaml:

recipe: python_package_setuptools name: requests version: 1.2.3 url: http://pypi.python.org/packages/source/r/requests/requests-${version}.tar.gz

-- jt ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

4412

Age (days ago)

4434

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Bjoern Gruening
James Taylor
John Chilton
Nate Coraor