[RFC] Storing of tarballs and patches for tool_dependencies to enable reproducibility
Hi, I want to start a discussion about the storage of tarballs to guarantee the availability to some degree. Currently, I store most of my tarballs in my github account, if I do not trust the official ftp/http server. Anyone has the same problems/concerns? Is there an official guideline/recommendation? James raised that topic in one thread ... and I really want to see that happen, to some extend: http://dev.list.galaxyproject.org/tool-dependencies-xml-format-tp4661410p466... I know it is an ambiguous task, but if Galaxy will be an reproducible system we need to think about that issue, discuss it and make a clear statement how far we want to go, what is feasible and what is not. Downstream it has many implications, already now for a few IUC members. For example it is hard to tell tool developers about reproducible tool_dependencies if no clear statement is ever made. A few problems that I encountered during tool development: - 'no stable links': tarballs on a 'lab'-website will change there links or delete old versions of tarballs - github: I'm really not sure if the 'raw' API I use for fetching single files or tarballs from my github account is stable and will remain. I also think I can not put in GB of tarballs in my github account, but currently its the best option I have - Sometimes you need to apply patches, these need to be stored somewhere. - If I store arbitrary tarballs in my github account and the installation routine in Galaxy, the user of my tools need a huge level of trust in my work. Moreover, the IUC can hardly control that (md5 checksums, next to each tarball?) In my opinion we need a central storage, where we can put our tarballs and so one. (mirrored ...) Some ideas: Two separated tool shed areas for one account: 1. version controlled 2. non-version controlled for tarballs and redirection files/rules, to redirect old links, maybe even redirect old repositories to new ones (assuming the history is the same and so on?) FTP Server with a few limitations, like file-size and authentication to make illegal file sharings harder. Ask the github guys if they are willing to support us? Build on top of Open Data initiatives, like the Open Data Portal in Swiss: http://www.bar.admin.ch/themen/01648/?lang=en Any comments, ideas? Cheers, Bjoern
On Tue, Sep 17, 2013 at 10:05 AM, Bjoern Gruening <bjoern.gruening@gmail.com> wrote:
Hi,
I want to start a discussion about the storage of tarballs to guarantee the availability to some degree. Currently, I store most of my tarballs in my github account, if I do not trust the official ftp/http server. Anyone has the same problems/concerns? Is there an official guideline/recommendation?
James raised that topic in one thread ... and I really want to see that happen, to some extend:
http://dev.list.galaxyproject.org/tool-dependencies-xml-format-tp4661410p466...
I know it is an ambiguous task, but if Galaxy will be an reproducible system we need to think about that issue, discuss it and make a clear statement how far we want to go, what is feasible and what is not. Downstream it has many implications, already now for a few IUC members. For example it is hard to tell tool developers about reproducible tool_dependencies if no clear statement is ever made.
A few problems that I encountered during tool development:
- 'no stable links': tarballs on a 'lab'-website will change there links or delete old versions of tarballs
On the timescale of years I've seen that happen. Just recently for example the NumPy team removed the files for some beta and release candidates from their SourceForge download page: http://mail.scipy.org/pipermail/numpy-discussion/2013-September/067690.html
- github: I'm really not sure if the 'raw' API I use for fetching single files or tarballs from my github account is stable and will remain. I also think I can not put in GB of tarballs in my github account, but currently its the best option I have
A related issues with pointing at specific GitHub (or BitBucket) commits is that sometimes a project rewrites their repository (although to be clear is this bad practice and should be rare).
- Sometimes you need to apply patches, these need to be stored somewhere.
See my reply below about using the current Tool Shed system.
- If I store arbitrary tarballs in my github account and the installation routine in Galaxy, the user of my tools need a huge level of trust in my work. Moreover, the IUC can hardly control that (md5 checksums, next to each tarball?)
In my opinion we need a central storage, where we can put our tarballs and so one. (mirrored ...)
Some ideas:
Two separated tool shed areas for one account: 1. version controlled 2. non-version controlled for tarballs and redirection files/rules, to redirect old links, maybe even redirect old repositories to new ones (assuming the history is the same and so on?)
FTP Server with a few limitations, like file-size and authentication to make illegal file sharings harder.
Ask the github guys if they are willing to support us?
Build on top of Open Data initiatives, like the Open Data Portal in Swiss: http://www.bar.admin.ch/themen/01648/?lang=en
Any comments, ideas? Cheers, Bjoern
I see some parallels with the Galaxy egg cache, and also other data files which the Galaxy team are also hosting. These are all centrally managed by the Galaxy team which is a bottleneck. Patches and even smaller 3rd party tarballs could easily be included in the Tool Shed repository, except for the current restriction that a Tool dependency definition may currently only hold a single tool_dependencies.xml file. Regards, Peter
I've had occasion to use some of the samtools misc utilities in galaxy tools. Should those also be copied to the $INSTALL_DIR/bin when package_samtools is installed? Thanks, JJ -- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota
Hi JJ,
I've had occasion to use some of the samtools misc utilities in galaxy tools. Should those also be copied to the $INSTALL_DIR/bin when package_samtools is installed?
Sorry, I do not get the question :( Are you asking to enhance the existing samtools definition?
Thanks,
JJ
-- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota _______________________________________________ galaxy-iuc mailing list galaxy-iuc@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-iuc
On 9/17/13 7:08 AM, Bjoern Gruening wrote:
Hi JJ,
I've had occasion to use some of the samtools misc utilities in galaxy tools. Should those also be copied to the $INSTALL_DIR/bin when package_samtools is installed? Sorry, I do not get the question :( Are you asking to enhance the existing samtools definition?
Thanks,
JJ
-- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota _______________________________________________ galaxy-iuc mailing list galaxy-iuc@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-iuc
Yes. I would enhance the existing samtools definition. I added in: http://testtoolshed.g2.bx.psu.edu/view/jjohnson/package_samtools_0_1_19 <action type="shell_command">chmod ugo+rx misc/*.p?</action> <action type="shell_command">mkdir misc/bin</action> <action type="shell_command">cp -p `find misc -type f -perm -555` misc/bin/</action> <action type="move_file"> <source>samtools</source> <destination>$INSTALL_DIR/bin</destination> </action> <action type="move_file"> <source>bcftools/bcftools</source> <destination>$INSTALL_DIR/bin</destination> </action> <action type="move_file"> <source>bcftools/vcfutils.pl</source> <destination>$INSTALL_DIR/bin</destination> </action> <action type="move_directory_files"> <source_directory>misc/bin</source_directory> <destination_directory>$INSTALL_DIR/bin</destination_directory> </action> The other option would be to create additional packages that download the same samtools source, but install bcftools and misc. -- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota
Hi, oh I see. Great! I would vote to include them in the main samtools package and extend the help text a little bit, so that the search will find it. If you like you can add it to: https://github.com/bgruening/galaxytools/tree/master/orphan_tool_dependencie... I have granted jjohnson write permission to the IUC samtools account. Thanks, Bjoern
On 9/17/13 7:08 AM, Bjoern Gruening wrote:
Hi JJ,
I've had occasion to use some of the samtools misc utilities in galaxy tools. Should those also be copied to the $INSTALL_DIR/bin when package_samtools is installed? Sorry, I do not get the question :( Are you asking to enhance the existing samtools definition?
Thanks,
JJ
On Tue, Sep 17, 2013 at 7:43 AM, Bjoern Gruening <bjoern.gruening@gmail.com> wrote:
Hi,
oh I see. Great! I would vote to include them in the main samtools package and extend the help text a little bit, so that the search will find it.
Seconded, this would be helpful for some stuff I have been working on as well. -John
If you like you can add it to: https://github.com/bgruening/galaxytools/tree/master/orphan_tool_dependencie...
I have granted jjohnson write permission to the IUC samtools account.
Thanks, Bjoern
On 9/17/13 7:08 AM, Bjoern Gruening wrote:
Hi JJ,
I've had occasion to use some of the samtools misc utilities in galaxy tools. Should those also be copied to the $INSTALL_DIR/bin when package_samtools is installed? Sorry, I do not get the question :( Are you asking to enhance the existing samtools definition?
Thanks,
JJ
_______________________________________________ galaxy-iuc mailing list galaxy-iuc@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-iuc
participants (4)
-
Bjoern Gruening
-
Jim Johnson
-
John Chilton
-
Peter Cock