On Friday, 1 January 2016, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Hi Galaxy developers,
this is a RFC to get the implementation details right for a new action type in `tool_dependencies.xml`.
Since years we try to save a very crucial sustainability problem: **Non-sustainable links**!
A little bit of history ------------------------
At first we tried to [mirror tarballs](https://github.com/bgruening/download_store) with sceptical sustainability, like BioC or random FTP servers. But over time we encountered many more places which we can not trust. Google-Code, SourceForge etc ... We tried to mirror the entire BioC history by tracking the SVN history down and creating tarball for every revision ... a Herculean task ... but still limited in scope because there are so many other things that needs to be archived to make Galaxy and all tools sustainable.
In the end we ended up with the simplest solution, provide a community archive where everyone can drop tarballs that they want to be sustainable. The Galaxy Project was so generous and is funding the storage but we have plans to mirror and distribute the workload to universities and other institutes that want to help.
The biggest problem we needed to solve was the access to the archive. Who can drop tarballs? How do we control access to prevent abuse of this system?
We went ahead and the created the Cargo-Port: https://github.com/galaxyproject/cargo-port Access will be controlled by a community and via PR. Add your package and we will check the content (hopefully) automatically and the tarball will be mirrored to a storage server.
RFC ---
So far so good. This RFC is about the usage of Cargo-Port inside of Galaxy. I would like to propose a new action type that uses the Cargo-Port directly. It should replace `<action type="download_by_url" sha256sum="6387238383883...">` and `<action type="download_file">` and offer a more transparent and user-friendly solution. The current state of the art is quite cumbersome since we need to generate manually the checksum, offer the correct link and get the same information into Cargo-Port. I would like to streamline this a little bit and use this as a good opportunity to fix and work on https://github.com/galaxyproject/galaxy/issues/896.
Proposal `<action type="download_by_proxy">`: * attribute for Id, Version, Platform, Architecture * no URL, no checksum * attribute for the URL to cargo-port/urls.tsv * default to the current github repo * configurable via galaxy.ini * this action will more or less trigger this curl command: `$ curl https://raw.githubusercontent.com/galaxyproject/cargo-port/master/gsl.py | python - --package_id augustus_3_1` * which give us the freedom to change API, columns ... in Cargo-Port without updating Galaxy core * the only API that need to keep stable is `gsl` * `gsl` will try to download from the original URL, specified in Cargo-Port. If this does not work we will download our archived one. * Changing the current working dir? Is this what we want, e.g. automatically uncompress and change cwd like `download_by_url`. * We will need an attribute to not uncompress. A few tools need the tarballs uncompressed.
Single Point of Failure - a small remark ----------------------------------------
Previously, Galaxy packages relied entirely on the kindness of upstream to maintain existing packages indefinitely. Obviously not a sustainable practice. Every time a tarball was moved, we had to hope one of us retained a copy so that we could ensure reproducibility. With the advent of the Cargo Port, we now maintain a complete, redundant copy of every upstream tarball used in IUC and devteam repositories, additionally adding sha256sums for every file to ensure download integrity. The community is welcome to request that files they use in their packages be added as well. We believe this will help combat the single point of failure by providing at least one level of duplication. The Cargo Port is considering plans to provide mirrors of itself to various universities and another layer of redundancy.
Thanks for reading and we appreciate any comments.
Eric, Nitesh & Bjoern
Maybe a question for Nitesh, Would this replace or coexist with related but narrower in scope Bioarchive project? https://bioarchive.galaxyproject.org/ Peter