Towards Galaxy Linux (not) ? [was [RFC] Storing of tarballs and patches for tool_dependencies to enable reproducibility]
Hi Bjoern, I can see man years of effort being spent on solving this problem within Galaxy. I was going to title this email "Danger, Will Robinson", but I didn't want to be disrespectful. I think the path being embarked upon, tool dependency packaging, tool versioning, reproducibility, and long term archive of source tarballs is going to lead inevitably to creation of a new Linux distribution, which I guess will be called Galaxy Linux. The packaging and archival you are talking about is exactly the service provided by a Linux distribution. There's well established infrastructure to handle this, and years of experience have gone into solving the problems well. Surely the number of Linux distributions in the world now exceeds 100, but I don't see that the world will become a better place if we increase that number by one more. We at AgResearch can't be alone in having to pick a Linux distribution to run from the short list supported by our hardware vendor. I can't see Galaxy Linux being on that list anytime soon. So we have to make Galaxy run on the particular distribution we have here. For us that's CentOS 6. Now, I see scary mention of platform independence as a goal for Galaxy packaging, which I interpret as "will run on any Linux distribution". I think that's essentially infeasible. All you can do is write install scripts which you hope are portable (by following as many best practices as you know about), and then work patiently with users on strange platforms, to adapt each install script to work on that platform also. I think this is not a good use of anyone's time. How many Linux distributions do the Galaxy community actually care about today? The RHEL family is surely important, as is Ubuntu LTS. Anything else? I'd be quite interested to understand this, as it provides a context for the discussion, and ensures we're not just solving a hypothetical problem. I'm just starting work on a native packaging infrastructure for Galaxy, that will enable tool dependencies to use defined versions of natively installed packages. That frees me up to make my packages work nicely on the RHEL family. It looks like the RPMs themselves (including SRPMs obviously) will be hosted by the CentOS project before too long. Once they're there, they can easily be archived forever. Anyone else on that platform is welcome to use the same infrastructure. Then, all we really need is someone to handle the packaging effort for the other major Linux distributions (a small number, I hope), and the problem is essentially solved. Getting the Bio-Linux team interested in multi-version packaging would be a great next step. I'll be posting here when I have progress to report on my native packaging effort. cheers, Simon ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================
On Wed, Sep 18, 2013 at 2:24 AM, Guest, Simon <Simon.Guest@agresearch.co.nz> wrote:
Hi Bjoern,
I can see man years of effort being spent on solving this problem within Galaxy. I was going to title this email "Danger, Will Robinson", but I didn't want to be disrespectful. I think the path being embarked upon, tool dependency packaging, tool versioning, reproducibility, and long term archive of source tarballs is going to lead inevitably to creation of a new Linux distribution, which I guess will be called Galaxy Linux.
It is potentially broader than that - some people are trying to cover Mac OS X as well, and there are already Galaxy installations which send jobs to Windows machines but that isn't something the Tool Shed currently tackles or aims to tackle (as far as I know).
The packaging and archival you are talking about is exactly the service provided by a Linux distribution. There's well established infrastructure to handle this, and years of experience have gone into solving the problems well. Surely the number of Linux distributions in the world now exceeds 100, but I don't see that the world will become a better place if we increase that number by one more.
Of course, but that isn't really what the Galaxy team want either.
We at AgResearch can't be alone in having to pick a Linux distribution to run from the short list supported by our hardware vendor. I can't see Galaxy Linux being on that list anytime soon. So we have to make Galaxy run on the particular distribution we have here. For us that's CentOS 6.
We are also using CentOS, which for a while was dictated by our IT department, but I think things are more flexible now. Given most (non-cloud) Galaxy installations will be connected to pre-existing clusters, rarely will the Galaxy administrators be in a position to dictate which flavour of Linux the cluster or grid should run. i.e. Galaxy can't pick on Linux distribution as the only supported platform.
Now, I see scary mention of platform independence as a goal for Galaxy packaging, which I interpret as "will run on any Linux distribution". I think that's essentially infeasible. All you can do is write install scripts which you hope are portable (by following as many best practices as you know about), and then work patiently with users on strange platforms, to adapt each install script to work on that platform also. I think this is not a good use of anyone's time.
In general I agree it is an open ended problem, and I have spent more of my time than I expected on this. However, in many cases is it quite feasible - where the authors of the tool being wrapped for Galaxy already provide neutral Linux binaries which should work on any recent distribution, or use a standard configure/make system for compiling with only 'core' header files needed.
How many Linux distributions do the Galaxy community actually care about today? The RHEL family is surely important, as is Ubuntu LTS. Anything else? I'd be quite interested to understand this, as it provides a context for the discussion, and ensures we're not just solving a hypothetical problem.
If you broaden that to the RHEL family (which includes CentOS) and the Debian family (which includes Ubuntu and Bio-Linux) then I suspect that is a majority.
I'm just starting work on a native packaging infrastructure for Galaxy, that will enable tool dependencies to use defined versions of natively installed packages. That frees me up to make my packages work nicely on the RHEL family. It looks like the RPMs themselves (including SRPMs obviously) will be hosted by the CentOS project before too long. Once they're there, they can easily be archived forever. Anyone else on that platform is welcome to use the same infrastructure. Then, all we really need is someone to handle the packaging effort for the other major Linux distributions (a small number, I hope), and the problem is essentially solved. Getting the Bio-Linux team interested in multi-version packaging would be a great next step.
If any major Linux distributions could handle multiple versions of tools installed in parallel via their packaging infrastructure it would be great - at least for open source tools. Non-open source tools would still be problematic and need either manual install or scripting of some kind, as now.
I'll be posting here when I have progress to report on my native packaging effort.
cheers, Simon
That sounds promising, Thank you, Peter
Hi Simon, thank you very much for your comments!
I can see man years of effort being spent on solving this problem within Galaxy. I was going to title this email "Danger, Will Robinson", but I didn't want to be disrespectful. I think the path being embarked upon, tool dependency packaging, tool versioning, reproducibility, and long term archive of source tarballs is going to lead inevitably to creation of a new Linux distribution, which I guess will be called Galaxy Linux.
I'm not sure it is comparable to a entire Linux distribution, its more like an Appstore, like pypi, bioconductor or gems, and yes that is reinvented somehow. I want to point out, that the pool of bioinformatic applications are not so huge compared to an entire linux distribution and that many of them exists as pre-compiled binaries, which makes everything easier.
The packaging and archival you are talking about is exactly the service provided by a Linux distribution.
Sorry maybe I was misleading. I only want a central storage for binaries/tarballs where the source can not be trusted for long term. 'long term' and 'trusted' needs to be defined in such a discussion here. I do not think we should copy python packages that are stored in pypi. We should make it easy as possible to install them in our repository. If you do not trust pypi, we can offer a mirror. Some goes for gems. But what about packages that do not store different versions of packages? We should have a central place to store them. UCSC tools for example. Easy to install, but we need to store them somewhere.
There's well established infrastructure to handle this, and years of experience have gone into solving the problems well.
Sure, we can learn from them, or use them.
Surely the number of Linux distributions in the world now exceeds 100, but I don't see that the world will become a better place if we increase that number by one more.
I'm not talking about a new linux distribution. Galaxy is running everywhere, RHEL, OS-X, SUSE, "what ever is used in the Amazon Cloud" and we need to run Galaxy on top of that.
We at AgResearch can't be alone in having to pick a Linux distribution to run from the short list supported by our hardware vendor. I can't see Galaxy Linux being on that list anytime soon. So we have to make Galaxy run on the particular distribution we have here. For us that's CentOS 6.
Sure, agree.
Now, I see scary mention of platform independence as a goal for Galaxy packaging, which I interpret as "will run on any Linux distribution". I think that's essentially infeasible.
I hope not :) We should define a minimal subset of dependencies a Galaxy system needs. Python, libz, gccX.Y, libfortran and so on, that's it. That can be understood as some kind of abstraction layer. If your distribution can offer it, Galaxy is supported, otherwise take care of such an abstraction layer for your system.
All you can do is write install scripts which you hope are portable (by following as many best practices as you know about), and then work patiently with users on strange platforms, to adapt each install script to work on that platform also. I think this is not a good use of anyone's time.
I see your point, but as we support a minimal subset of requirements that argument does not hold. Moreover, I do not expect that issue in the Galaxyland. We are dealing with professional administrators/bioinformaticians running on large clusters, not with desktop users. I hope the set of different distributions that are really in use are minimalistic.
How many Linux distributions do the Galaxy community actually care about today? The RHEL family is surely important, as is Ubuntu LTS. Anything else?
Maybe a few Solaris Systems and do not forget OS-X.
I'd be quite interested to understand this, as it provides a context for the discussion, and ensures we're not just solving a hypothetical problem.
I'm just starting work on a native packaging infrastructure for Galaxy, that will enable tool dependencies to use defined versions of natively installed packages. That frees me up to make my packages work nicely on the RHEL family. It looks like the RPMs themselves (including SRPMs obviously) will be hosted by the CentOS project before too long. Once they're there, they can easily be archived forever. Anyone else on that platform is welcome to use the same infrastructure. Then, all we really need is someone to handle the packaging effort for the other major Linux distributions (a small number, I hope), and the problem is essentially solved.
Sure, but the problem is not solved, or? It's just transferred from your Linux metaphor to 'packaging formats'. How many different packaging formats we have ... do we need a new one ...
Getting the Bio-Linux team interested in multi-version packaging would be a great next step.
I really think that is the important part! We need to convince and cooperate to make a truly multi-versioning packaging system. We should have a look at Homebrew, sandboxed applications [1] and so on. But I think James, Greg and Co. have done that and we should now make it possible to have finally an reproducible bioinformatic workbench.
I'll be posting here when I have progress to report on my native packaging effort.
That would be great. I really appreciate your thoughts and I know I might be a little bit to optimistic and idealistic. Thanks, Bjoern [1] http://www.superlectures.com/guadec2013/sandboxed-applications-for-gnome
cheers, Simon
======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================
If I might chime in, I am a bit worried about all the automatic installation going on in galaxy, and it seems that the trend is to enhance this. A small R or python script calling into well known libraries that come from well known repositories (bioconductor etc… ) I can check. (Of course I install too much stuff from github, bioconductor etc… without checking).
I'm not sure it is comparable to a entire Linux distribution, its more like an Appstore, like pypi, bioconductor or gems, and yes that is
The app stores are checked by Apple or google for malicious code, the apps are sandboxed. There are many eyes for python, bioconductor packages and gems because much more people interact with them directly compared to galaxy-tools.
Sorry maybe I was misleading. I only want a central storage for binaries/tarballs where the source can not be trusted for long term. 'long term' and 'trusted' needs to be defined in such a discussion here. I do not think we should copy python packages that are stored in pypi. We should make it easy as possible to install them in our repository. If you do not trust pypi, we can offer a mirror. Some goes for gems.
Trusted for me means I trust the source not having dangerous code. I trust pypi more than some mirror, bioconductor base packages from more than some freshly published package that few people have used, tools from galaxy core developers more than from tool-shed etc… I know this is not the type of trust you were talking about. best, ido
Hi Ido,
If I might chime in, I am a bit worried about all the automatic installation going on in galaxy, and it seems that the trend is to enhance this. A small R or python script calling into well known libraries that come from well known repositories (bioconductor etc… ) I can check. (Of course I install too much stuff from github, bioconductor etc… without checking).
Yes, these are huge security concerns and every admin is advised to check the code beforehand. In case of binaries its hard or not possible at al. That's one reason I want to discuss that issue.
I'm not sure it is comparable to a entire Linux distribution, its more like an Appstore, like pypi, bioconductor or gems, and yes that is
The app stores are checked by Apple or google for malicious code, the apps are sandboxed. There are many eyes for python, bioconductor packages and gems because much more people interact with them directly compared to galaxy-tools.
Sure, the Galaxy Tool Shed is slowly getting there. The IUC (Intergalactic Utilities Commission) was founded in the end of 2012 and should be something like a reviewing instance for tools.
Sorry maybe I was misleading. I only want a central storage for binaries/tarballs where the source can not be trusted for long term. 'long term' and 'trusted' needs to be defined in such a discussion here. I do not think we should copy python packages that are stored in pypi. We should make it easy as possible to install them in our repository. If you do not trust pypi, we can offer a mirror. Some goes for gems.
Trusted for me means I trust the source not having dangerous code. I trust pypi more than some mirror, bioconductor base packages from more than some freshly published package that few people have used, tools from galaxy core developers more than from tool-shed etc… I know this is not the type of trust you were talking about.
That is, its twofold. One to trust the source to not infiltrate the system or do any harm, the other part is to trust the availability of data. Both are important imho. Cheers, Bjoern
best, ido
participants (5)
-
Bjoern Gruening
-
Björn Grüning
-
Guest, Simon
-
Ido Tamir
-
Peter Cock