Hi Peter,

On 01/02/2016 11:54 PM, Peter Cock wrote:


On Friday, 1 January 2016, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Hi Galaxy developers,

this is a RFC to get the implementation details right for a new action
type in `tool_dependencies.xml`.

Since years we try to save a very crucial sustainability problem:
  **Non-sustainable links**!


A little bit of history
------------------------

At first we tried to [mirror
tarballs](https://github.com/bgruening/download_store) with sceptical
sustainability,
like BioC or random FTP servers.
But over time we encountered many more places which we can not trust.
Google-Code, SourceForge etc ...
We tried to mirror the entire BioC history by tracking the SVN history
down and creating tarball for every revision ...  a Herculean task ...
but still limited in scope because there are so many other things that
needs to be archived to make Galaxy and all tools sustainable.

In the end we ended up with the simplest solution, provide a community
archive where everyone can drop tarballs that they want to be
sustainable. The Galaxy Project was so generous and is funding the
storage but we have plans to mirror and distribute the workload to
universities and other institutes that want to help.

The biggest problem we needed to solve was the access to the archive.
Who can drop tarballs? How do we control access to prevent abuse of this
system?

We went ahead and the created the Cargo-Port:
    https://github.com/galaxyproject/cargo-port
Access will be controlled by a community and via PR. Add your package
and we will check the content (hopefully) automatically and the tarball
will be mirrored to a storage server.


RFC
---

So far so good. This RFC is about the usage of Cargo-Port inside of
Galaxy. I would like to propose a new action type that uses the
Cargo-Port directly. It should replace `<action type="download_by_url"
sha256sum="6387238383883...">` and `<action type="download_file">` and
offer a more transparent and user-friendly solution.
The current state of the art is quite cumbersome since we need to
generate manually the checksum, offer the correct link
and get the same information into Cargo-Port. I would like to streamline
this a little bit and use this as a good opportunity
to fix and work on https://github.com/galaxyproject/galaxy/issues/896.


Proposal `<action type="download_by_proxy">`:
 * attribute for Id, Version, Platform, Architecture
 * no URL, no checksum
 * attribute for the URL to cargo-port/urls.tsv
   * default to the current github repo
   * configurable via galaxy.ini
 * this action will more or less trigger this curl command: `$ curl
https://raw.githubusercontent.com/galaxyproject/cargo-port/master/gsl.py
| python - --package_id augustus_3_1`
   * which give us the freedom to change API, columns ... in Cargo-Port
without updating Galaxy core
   * the only API that need to keep stable is `gsl`
 * `gsl` will try to download from the original URL, specified in
Cargo-Port. If this does not work we will download our archived one.
 * Changing the current working dir? Is this what we want, e.g.
automatically uncompress and change cwd like `download_by_url`.
   * We will need an attribute to not uncompress. A few tools need the
tarballs uncompressed.


Single Point of Failure - a small remark
----------------------------------------

Previously, Galaxy packages relied entirely on the kindness of upstream
to maintain existing packages indefinitely. Obviously not a sustainable
practice. Every time a tarball was moved, we had to hope one of us
retained a copy so that we could ensure reproducibility. With the advent
of the Cargo Port, we now maintain a complete, redundant copy of every
upstream tarball used in IUC and devteam repositories, additionally
adding sha256sums for every file to ensure download integrity. The
community is welcome to request that files they use in their packages be
added as well. We believe this will help combat the single point of
failure by providing at least one level of duplication. The Cargo Port
is considering plans to provide mirrors of itself to various
universities and another layer of redundancy.


Thanks for reading and we appreciate any comments.

Eric, Nitesh & Bjoern

-- https://gist.github.com/bgruening/48297c27cd72cbadea7a


Maybe a question for Nitesh,

Would this replace or coexist with related but narrower in scope
Bioarchive project?
Different scope, coexist.

Bioarchive

The Cargo Port



 https://bioarchive.galaxyproject.org/

Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

-- 
Eric Rasche
Programmer II

Center for Phage Technology
Rm 312A, BioBio
Texas A&M University
College Station, TX 77843
404-692-2048
esr@tamu.edu