On Aug 26, 2013, at 11:59 AM, James Taylor wrote:
On Mon, Aug 26, 2013 at 11:48 AM, John Chilton <chilton@msi.umn.edu> wrote:
I think it is interesting that there was push back on providing infrastructure (tool actions) for obtaining CBL from github and performing installs based on it because it was not in the tool shed and therefore less reproducible, but the team believes infrastructure should be put in place to support pypi.
Well, first, I'm not sure what "the team" believes, I'm stating what I believe and engaging in a discussion with "the community". At some point this should evolve into what we are actually going to do and be codified in a spec as a Trello card, which is even then not set in stone.
Second, I'm not suggesting we depend on PyPI. The nice thing about the second format I proposed on galaxy-dev is that we can easily parse out the URL and archive that file. Then someday we could provide a fallback repository where if the PyPI URL no longer works we still have it stored.
I concur here, the experience and lessons learned by long-established package and dependency managers can provide some useful guidance for us going forward. APT has long relied on a model of archiving upstream source (as well as distro-generated binary (dpkg) packages), cataloging changes as a set of patches, and maintaining an understanding of installed files, even those meant to be user-edited. I think there is a strong advantage for us doing this as well.
I think we all value reproduciblity here, but we make different calculations on what is reproducible. I think in terms of implementing the ideas James has laid out or similar things I have proposed, it might be beneficial to have some final answers on what external resources are allowed - both for obtaining a Galaxy IUC gold star and for the tool shed providing infrastructure to support their usage.
My focus is ensuring that we can archive things that pass through the toolshed. Tarballs from *anywhere* are easy enough to deal with. External version control repositories are a bit more challenging, especially when you are pulling just a particular file out, so that's where things got a little hinky for me.
Since we don't have the archival mechanism in place yet anyway, this is more a philosophical discussion and setting the right precedent.
And yes, keeping an archive of all the software in the world is a scary prospect, though compared to the amount of data we currently keep for people it is a blip. And I'm not sure how else we can really achieve the level of reproducibility we desire.
One additional step that will assist with long-term archival is generating static metadata and allowing the packaging and dependency systems to work outside of the Galaxy and Tool Shed applications. A package metadata catalog and package format that provided descriptions of packages on a generic webserver and installable without a running Galaxy instance are components that I believe are fairly important. As for user-edited files, the env.sh files, which are generated at install-time and then essentially untracked afterward scare me a bit. I think it'd be useful for the packaging system have a tighter concept of environment management. These are just my opinions, of course, and are going to be very APT/dpkg-biased simply due to my experience with and favor for Debian-based distros and dependency/package management, but I think there are useful concepts in this (and other systems) that we can draw from. Along those lines, one more idea I had thrown out a while ago was coming up with a way to incorporate (or at least automatically process so that we can convert to our format) the build definitions for other systems like MacPorts, BSD ports/pkgsrc, dpkg, rpm, etc. so that we can leverage the existing rules for building across our target platforms that have already been worked out by other package maintainers with more time. I think this aligns pretty well with Brad's thinking with CloudBioLinux, the difference in implementation being that we require multiple installable versions and platform independence. I am a bit worried that as we go down the "repackage (almost) all dependencies" path (which I do think is the right path), we also run the risk of most of our packages being out of date. That's almost a guaranteed outcome when even the huge packaging projects (Debian, Ubuntu, etc.) are rife with out-of-date packages. So being able to incorporate upstream build definitions may help us package dependencies quickly. --nate
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/