On Tue, Sep 17, 2013 at 10:05 AM, Bjoern Gruening <bjoern.gruening@gmail.com> wrote:
Hi,
I want to start a discussion about the storage of tarballs to guarantee the availability to some degree. Currently, I store most of my tarballs in my github account, if I do not trust the official ftp/http server. Anyone has the same problems/concerns? Is there an official guideline/recommendation?
James raised that topic in one thread ... and I really want to see that happen, to some extend:
http://dev.list.galaxyproject.org/tool-dependencies-xml-format-tp4661410p466...
I know it is an ambiguous task, but if Galaxy will be an reproducible system we need to think about that issue, discuss it and make a clear statement how far we want to go, what is feasible and what is not. Downstream it has many implications, already now for a few IUC members. For example it is hard to tell tool developers about reproducible tool_dependencies if no clear statement is ever made.
A few problems that I encountered during tool development:
- 'no stable links': tarballs on a 'lab'-website will change there links or delete old versions of tarballs
On the timescale of years I've seen that happen. Just recently for example the NumPy team removed the files for some beta and release candidates from their SourceForge download page: http://mail.scipy.org/pipermail/numpy-discussion/2013-September/067690.html
- github: I'm really not sure if the 'raw' API I use for fetching single files or tarballs from my github account is stable and will remain. I also think I can not put in GB of tarballs in my github account, but currently its the best option I have
A related issues with pointing at specific GitHub (or BitBucket) commits is that sometimes a project rewrites their repository (although to be clear is this bad practice and should be rare).
- Sometimes you need to apply patches, these need to be stored somewhere.
See my reply below about using the current Tool Shed system.
- If I store arbitrary tarballs in my github account and the installation routine in Galaxy, the user of my tools need a huge level of trust in my work. Moreover, the IUC can hardly control that (md5 checksums, next to each tarball?)
In my opinion we need a central storage, where we can put our tarballs and so one. (mirrored ...)
Some ideas:
Two separated tool shed areas for one account: 1. version controlled 2. non-version controlled for tarballs and redirection files/rules, to redirect old links, maybe even redirect old repositories to new ones (assuming the history is the same and so on?)
FTP Server with a few limitations, like file-size and authentication to make illegal file sharings harder.
Ask the github guys if they are willing to support us?
Build on top of Open Data initiatives, like the Open Data Portal in Swiss: http://www.bar.admin.ch/themen/01648/?lang=en
Any comments, ideas? Cheers, Bjoern
I see some parallels with the Galaxy egg cache, and also other data files which the Galaxy team are also hosting. These are all centrally managed by the Galaxy team which is a bottleneck. Patches and even smaller 3rd party tarballs could easily be included in the Tool Shed repository, except for the current restriction that a Tool dependency definition may currently only hold a single tool_dependencies.xml file. Regards, Peter