Better packaging for toolshed binaries
Dear Galaxians, This email is about difficulties with the current approach for installing tool dependency binaries from the Galaxy Toolshed, and what might be done to improve the situation. It comes down to this: packaging software to run on different systems is tricky. It is a problem that has been solved by various Linux distributions with their packaging systems (RPM, deb, etc.), and package archives. The Galaxy Toolshed is trying to solve this problem again, but so far it doesn't work very well. There must be something better we can do. Since gaining a better understanding from the Galaxy Community Conference of what the Toolshed is trying to do (versioned tools, reproducibility), I have been working on switching over from locally installed tools to Toolshed versions. However, it has not gone well, and I think I am about to revert to my previous approach. Here's the problem: building software from source on any system requires certain tweaks to the build process which are dependent on the target platform. An example is the NCBI BLAST+ suite, which failed to build on my (EL6) system, because it couldn't run /usr/bin/touch. That's pretty dumb, and pretty simple to solve in isolation - it needs to be running /bin/touch instead. But the general point is this: it's not feasible (i.e. too much work, too hard) to produce build scripts to build software from source that work on any platform, even the common ones. Packaging source code for a given platform is a non-trivial task. The RPM and deb packagers are doing a good job here. It's a significant amount of work. I know that, as I've been packaging bioinformatics software as binary RPMs for EL6 for 18 months or so now, and have done nearly 300 packages. What do we want? Simply to be able to install a given version of some software, and all its dependencies, with a single click, or a single command, and have it Just Work (tm). It's the dependencies that make this hard. Things get installed in different ways on different systems. Does your platform need #include <bam.h>, or #include <bam/bam.h>? If the former, then you'll have to patch tophat, say, (in a trivial way) before building it. I think this is simply too hard to do by embedding some commands and conditionals in Toolshed XML build files. It seems to me that a number of people out there are currently having some issues installing tool dependencies from the Toolshed, because things are not building as expected. I think it's much easier for just one person to troubleshoot why things go wrong when they are packaging the software for a given platform, rather than for each end user (Galaxy admin) to wonder why a tool failed to install. So, what to do? My starting point is that I have packaged a large amount of bioinformatics software for EL6, which is freely available at http://rpm.agresearch.co.nz/. I'm after some Galaxy tool wrappers for the tools that we use here at AgResearch, which can simply make use of packages installed from this repo. Is there any interest in exploring the merits or otherwise of this approach in the Galaxy community? cheers, Simon Simon Guest Senior UNIX Technical Consultant AgResearch, New Zealand ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================
On Thu, Aug 29, 2013 at 5:45 AM, Guest, Simon <Simon.Guest@agresearch.co.nz> wrote:
Dear Galaxians,
This email is about difficulties with the current approach for installing tool dependency binaries from the Galaxy Toolshed, and what might be done to improve the situation. It comes down to this: packaging software to run on different systems is tricky. It is a problem that has been solved by various Linux distributions with their packaging systems (RPM, deb, etc.), and package archives. The Galaxy Toolshed is trying to solve this problem again, but so far it doesn't work very well. There must be something better we can do.
I agree with you, and as more people try to package thier tools and the dependencies, I think more will too :(
Since gaining a better understanding from the Galaxy Community Conference of what the Toolshed is trying to do (versioned tools, reproducibility), I have been working on switching over from locally installed tools to Toolshed versions. However, it has not gone well, and I think I am about to revert to my previous approach. Here's the problem: building software from source on any system requires certain tweaks to the build process which are dependent on the target platform. An example is the NCBI BLAST+ suite, which failed to build on my (EL6) system, because it couldn't run /usr/bin/touch. That's pretty dumb, and pretty simple to solve in isolation - it needs to be running /bin/touch instead.
Can we continue this specific example here?: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/015890.html ... http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016287.html Short answer, yes I know, a new install XML process being used on the Test Tool Shed which fixes this (but breaks in a not yet understood way on the Galaxy teams test cluster), awaiting release to the main Tool Shed.
But the general point is this: it's not feasible (i.e. too much work, too hard) to produce build scripts to build software from source that work on any platform, even the common ones. Packaging source code for a given platform is a non-trivial task. The RPM and deb packagers are doing a good job here. It's a significant amount of work. I know that, as I've been packaging bioinformatics software as binary RPMs for EL6 for 18 months or so now, and have done nearly 300 packages.
What do we want? Simply to be able to install a given version of some software, and all its dependencies, with a single click, or a single command, and have it Just Work (tm). It's the dependencies that make this hard. Things get installed in different ways on different systems. Does your platform need #include <bam.h>, or #include <bam/bam.h>? If the former, then you'll have to patch tophat, say, (in a trivial way) before building it. I think this is simply too hard to do by embedding some commands and conditionals in Toolshed XML build files.
Indeed - "nice" tools being packaged will have something like a ./configure script to take care of that, but not all :(
It seems to me that a number of people out there are currently having some issues installing tool dependencies from the Toolshed, because things are not building as expected. I think it's much easier for just one person to troubleshoot why things go wrong when they are packaging the software for a given platform, rather than for each end user (Galaxy admin) to wonder why a tool failed to install.
So, what to do? My starting point is that I have packaged a large amount of bioinformatics software for EL6, which is freely available at http://rpm.agresearch.co.nz/. I'm after some Galaxy tool wrappers for the tools that we use here at AgResearch, which can simply make use of packages installed from this repo.
Is there any interest in exploring the merits or otherwise of this approach in the Galaxy community?
There is a similar but probably larger set of Debian packages available via Debian-Med and Bio-Linux too. The catch here is can you install arbitrary versions of a tool in parallel? And I think the answer sadly is no. The idea of standard recipe templates (e.g. typical Python install) James outlined here might help: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016273.html Peter
On Thu, Aug 29, 2013 at 3:36 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
There is a similar but probably larger set of Debian packages available via Debian-Med and Bio-Linux too. The catch here is can you install arbitrary versions of a tool in parallel? And I think the answer sadly is no.
This is the crucial concern for us. The standard OS packaging approaches (RPM and DEB) do not support this except very poorly. This is something we absolutely need. There are other package managers that do a better job (I'm quite fond of Homebrew on OS X, NIX also looks nice) but would add more dependencies.
The idea of standard recipe templates (e.g. typical Python install) James outlined here might help: http://lists.bx.psu.edu/pipermail/galaxy-dev/2013-August/016273.html
I (as always) think a lot can be solved through abstraction. I'm envisioning a very high level description of what it takes to install a package, and then have different adapters to take that and install it for a given OS. -- James Taylor, Associate Professor, Biology/CS, Emory University
There is a similar but probably larger set of Debian packages available via Debian-Med and Bio-Linux too. The catch here is can you install arbitrary versions of a tool in parallel? And I think the answer sadly is no.
This is the crucial concern for us. The standard OS packaging approaches (RPM and DEB) do not support this except very poorly. This is something we absolutely need. There are other package managers that do a better job (I'm quite fond of Homebrew on OS X, NIX also looks nice) but would add more dependencies.
There are possibilities here, similar to things I've already been doing in my RPM packaging. If you want to install multiple versions side by side, when you (or more likely, me) are making the packages, you just make the version number part of the package name, and install it out of the way somewhere (e.g. /usr/libexec/tophat-2.0.9, rather than /usr/bin). Then, the package can provide a versioned environment module as per http://modules.sourceforge.net/. There could be a non-versioned environment module which just gives you the latest and greatest version. So: $ module load tophat/2.0.9 # now that version is on the path # start again ... $ module load tophat # the latest and greatest tophat becomes available We've been using this to provide multiple versions of small tools, but also bigger things like a version of Python more recent than the system one. (Software Collections may be better for the latter though - https://access.redhat.com/site/documentation/en-US/Red_Hat_Developer_Toolset...) I'm willing to explore the feasibility of overhauling the AgResearch RPM repo to support multiple versions of packages in this or a similar way if there's interest. There's clearly value in being able to select what version of a tool you run, if it can be done in a way that doesn't encumber those who just want to run a recent good version. Is there interest in this approach? (Note: I'm not committing to doing it just yet.) cheers, Simon ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================
There is a similar but probably larger set of Debian packages available via Debian-Med and Bio-Linux too. The catch here is can you install arbitrary versions of a tool in parallel? And I think the answer sadly is no.
This is the crucial concern for us. The standard OS packaging approaches (RPM and DEB) do not support this except very poorly. This is something we absolutely need. There are other package managers that do a better job (I'm quite fond of Homebrew on OS X, NIX also looks nice) but would add more dependencies.
There are possibilities here, similar to things I've already been doing in my RPM packaging.
If you want to install multiple versions side by side, when you (or more
On Thu, Aug 29, 2013 at 4:17 PM, Guest, Simon <Simon.Guest@agresearch.co.nz> wrote: likely, me) are making the packages, you just make the version number part of the package name, and install it out of the way somewhere (e.g. /usr/libexec/tophat-2.0.9, rather than /usr/bin). Then, the package can provide a versioned environment module as per http://modules.sourceforge.net/. There could be a non-versioned environment module which just gives you the latest and greatest version. So: This is the same thing that the tool shed does. From Greg: "Following best practices, repositories of type Tool dependency definition are named something like package_<name>_<version> (e.g., package_amos_3_1_0, package_ape_3_0, package_atlas_3_10, etc) and are contained in the Tool Dependency Packages category in the Tool Shed. The name of the repository contains the package name as well as the version because the contents of the repository must contain only the recipe for installing that specific version of that package. If a new version (say 3.1) of the ape package is introduced some time in the future, then a new repository named package_ape_3_1 should be created to contain the recipe for installing that version." Tool dependency definition repositories may only have one installable revision. Toolshed has some advantages over OS packages, but I do not understand why handling of multiple versions is considered by some among these.
$ module load tophat/2.0.9 # now that version is on the path
# start again ... $ module load tophat # the latest and greatest tophat becomes available
We've been using this to provide multiple versions of small tools, but
also bigger things like a version of Python more recent than the system one. (Software Collections may be better for the latter though - https://access.redhat.com/site/documentation/en-US/Red_Hat_Developer_Toolset... )
I'm willing to explore the feasibility of overhauling the AgResearch RPM
repo to support multiple versions of packages in this or a similar way if there's interest. There's clearly value in being able to select what version of a tool you run, if it can be done in a way that doesn't encumber those who just want to run a recent good version.
Is there interest in this approach? (Note: I'm not committing to doing it
just yet.)
cheers, Simon
======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (4)
-
Guest, Simon
-
James Taylor
-
John Chilton
-
Peter Cock