Hi Simon, thank you very much for your comments!
I can see man years of effort being spent on solving this problem within Galaxy. I was going to title this email "Danger, Will Robinson", but I didn't want to be disrespectful. I think the path being embarked upon, tool dependency packaging, tool versioning, reproducibility, and long term archive of source tarballs is going to lead inevitably to creation of a new Linux distribution, which I guess will be called Galaxy Linux.
I'm not sure it is comparable to a entire Linux distribution, its more like an Appstore, like pypi, bioconductor or gems, and yes that is reinvented somehow. I want to point out, that the pool of bioinformatic applications are not so huge compared to an entire linux distribution and that many of them exists as pre-compiled binaries, which makes everything easier.
The packaging and archival you are talking about is exactly the service provided by a Linux distribution.
Sorry maybe I was misleading. I only want a central storage for binaries/tarballs where the source can not be trusted for long term. 'long term' and 'trusted' needs to be defined in such a discussion here. I do not think we should copy python packages that are stored in pypi. We should make it easy as possible to install them in our repository. If you do not trust pypi, we can offer a mirror. Some goes for gems. But what about packages that do not store different versions of packages? We should have a central place to store them. UCSC tools for example. Easy to install, but we need to store them somewhere.
There's well established infrastructure to handle this, and years of experience have gone into solving the problems well.
Sure, we can learn from them, or use them.
Surely the number of Linux distributions in the world now exceeds 100, but I don't see that the world will become a better place if we increase that number by one more.
I'm not talking about a new linux distribution. Galaxy is running everywhere, RHEL, OS-X, SUSE, "what ever is used in the Amazon Cloud" and we need to run Galaxy on top of that.
We at AgResearch can't be alone in having to pick a Linux distribution to run from the short list supported by our hardware vendor. I can't see Galaxy Linux being on that list anytime soon. So we have to make Galaxy run on the particular distribution we have here. For us that's CentOS 6.
Sure, agree.
Now, I see scary mention of platform independence as a goal for Galaxy packaging, which I interpret as "will run on any Linux distribution". I think that's essentially infeasible.
I hope not :) We should define a minimal subset of dependencies a Galaxy system needs. Python, libz, gccX.Y, libfortran and so on, that's it. That can be understood as some kind of abstraction layer. If your distribution can offer it, Galaxy is supported, otherwise take care of such an abstraction layer for your system.
All you can do is write install scripts which you hope are portable (by following as many best practices as you know about), and then work patiently with users on strange platforms, to adapt each install script to work on that platform also. I think this is not a good use of anyone's time.
I see your point, but as we support a minimal subset of requirements that argument does not hold. Moreover, I do not expect that issue in the Galaxyland. We are dealing with professional administrators/bioinformaticians running on large clusters, not with desktop users. I hope the set of different distributions that are really in use are minimalistic.
How many Linux distributions do the Galaxy community actually care about today? The RHEL family is surely important, as is Ubuntu LTS. Anything else?
Maybe a few Solaris Systems and do not forget OS-X.
I'd be quite interested to understand this, as it provides a context for the discussion, and ensures we're not just solving a hypothetical problem.
I'm just starting work on a native packaging infrastructure for Galaxy, that will enable tool dependencies to use defined versions of natively installed packages. That frees me up to make my packages work nicely on the RHEL family. It looks like the RPMs themselves (including SRPMs obviously) will be hosted by the CentOS project before too long. Once they're there, they can easily be archived forever. Anyone else on that platform is welcome to use the same infrastructure. Then, all we really need is someone to handle the packaging effort for the other major Linux distributions (a small number, I hope), and the problem is essentially solved.
Sure, but the problem is not solved, or? It's just transferred from your Linux metaphor to 'packaging formats'. How many different packaging formats we have ... do we need a new one ...
Getting the Bio-Linux team interested in multi-version packaging would be a great next step.
I really think that is the important part! We need to convince and cooperate to make a truly multi-versioning packaging system. We should have a look at Homebrew, sandboxed applications [1] and so on. But I think James, Greg and Co. have done that and we should now make it possible to have finally an reproducible bioinformatic workbench.
I'll be posting here when I have progress to report on my native packaging effort.
That would be great. I really appreciate your thoughts and I know I might be a little bit to optimistic and idealistic. Thanks, Bjoern [1] http://www.superlectures.com/guadec2013/sandboxed-applications-for-gnome
cheers, Simon
======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================