Galaxy-less Tool Installing

John Chilton

16 Jul 2013 16 Jul '13

1:27 a.m.

One of my goals for the GCC was to sell the idea that tool shed repositories need to be installable without a database present. I talked with James Taylor and Enis Afgan about this idea briefly and they seemed to believe this was a good idea - I kept meaning to discuss it with Greg but I never got a good opportunity. Though in past Greg has made this sound potentially doable and has never objected to the goal overtly. I have two specific use cases in mind (CloudBioLinux and LWR), but perhaps the higher-level justification is something along the lines that a lot of effort from Greg and others (Dave, Bjorn, Peter, Nate) has gone into building a modular dependency system that could very easily be leveraged by applications other than Galaxy, so the extra steps that could be taken to make this possible should to make the codebase as broadly useful and to encourage adoption. The Galaxy community could benefit from other applications potentially utilizing and populating the tool shed and Galaxy tool developers would be further incentized to write good, modular dependencies and publish them to the tool shed. A high-level task decomposition would be something like this: 1. Rework installing tool shed repositories to not require a database. A kind of messy way to do this might be adding a use_database flag throughout. A cleaner way might be to use allow the core functionality to work with callbacks or plugins that performed the database interactions. 2. Separate the core functionality out of the Galaxy code base entirely into a reusable, stand-alone library. I would love buy in from the Galaxy team on item 2 above, but it is not strictly needed for my goals - I imagine I could write a script to pull it out Galaxy and build the library automatically or even just have the Galaxy codebase present when using Galaxy-less tool shed dependencies. Buy in on item 1 by the Galaxy team (specifically Greg and Dave B.) however is needed, are there any objections to this idea? Do you have any broad advice on how to approach this to ensure the changes make sense, work with your long term vision, and end up in Galaxy? Of all the things on my TODO list for the next year, this is probably the most potentially broadly interesting to this weeks BOSC codefest attendees, so I was going to attempt to sell this as something to work on. The sales pitch would include building a little tool shed version of the module command - http://linux.die.net/man/1/module to demonstrate this work and have something immediately useful produced. The idea would be to create a command-line tool for utilizing tool shed dependencies. # Unlike standard module, install procedure is available. Probably could # default to main tool shed and latest installable revision % tsmodule repo:install galaxyp/tint % tsmodule repo:install toolshed.g2.bx.psu.edu/galaxyp/tint/ab43b5ba7a4e # module lets you list packages, I guess tool shed version would need # repository and package listings: % tsmodule repo:list toolshed.g2.bx.psu.edu/galaxyp/tint/ab43b5ba7a4e % tsmodule package:list tint_proteomics_scripts/1.19.19/galaxyp/tint/ab43b5ba7a4e # Finally, a use command would source the env.sh script and make dependency # available in the command-line (might require starting new shell?): % tsmodule package:use tint_proteomics_scripts % tsmodule package:use tint_proteomics_scripts/1.19.19 % tsmodule package:use int_proteomics_scripts/1.19.19/galaxyp/tint/ab43b5ba7a4e # use apps that would be available to tools with valid requirements tags. % iQuantCLI This would be different from using the API scripts because there would be no API, Galaxy instance, or Galaxy database involved - just the Galaxy code. If this was able to split into its own Python library, one could imagine even allowing something like tsmodule to be installable right from pip and recursively fetch a toolshed_client library or something like that.

Show replies by date

Michael Cotterell

16 Jul 16 Jul

1:40 a.m.

...

This would be different from using the API scripts because there would be no API, Galaxy instance, or Galaxy database involved - just the Galaxy code. If this was able to split into its own Python library, one could imagine even allowing something like tsmodule to be installable right from pip and recursively fetch a toolshed_client library or something like that.

I especially like this idea. I wasn't at GCC, but my boss was, so you may have seen some of the stuff I'm working on with Web Services. Right now, the transition to the tool shed it awesome, except for when it comes time to actually run the tools for adding web services. You see right now, the tool my colleagues and I have developed dynamically creates tool config files based on WSDLs and WADLs. The problem is, at the moment, our tool actually edits the tools_config.xml file directly in order to add the web service operation tools to galaxy. Your idea, if I understand it correctly, would mean that it might be possible for my tool to utilize this other method of installation in order to add/remove the generated tools to Galaxy. I'd definitely like to hear more about this. -- Sincerely, Michael E. Cotterell Ph.D. Student in Computer Science, University of Georgia Graduate RA & TA, University of Georgia mepcotterell@gmail.com mepcott@uga.edu mec@cs.uga.edu http://michaelcotterell.com/

Greg Von Kuster

2:01 a.m.

Hi John, It's really too bad that we didn't find time to discuss this in person at the GCC. Until now, I've not heard from anyone that installation from the tool shed without requiring a Galaxy database is important, so I'm lacking some context on this (I assume your statement "without a database present" refers to the Galaxy database). Some concerns that immediately pop into my mind are the following - this is not a complete list. 1) How do you ensure dependency relationships between installed repositories? 2) How do you manage locating installed repository contents? 3) How do you manage the current state of an installed repostiroy and determine if it can be updated? 4) How do you manage tool version lineage for installed repositories that contain tools (this plays an importnt role in ensuring reproducibility)? 5) How do you maintain the state of an installed repository, enabling it to be repaired if it or any of it's dependencies are in an error state? Since I've never considered re-engineering the tool shed installation process so that it would function in an environment without a Galaxy database, I'm not sure how much effort would need to go into doing so, or where to start. I'll have to think about this for a while. Greg Von Kuster On Jul 15, 2013, at 7:27 PM, John Chilton <chilton@msi.umn.edu> wrote:

...

One of my goals for the GCC was to sell the idea that tool shed repositories need to be installable without a database present. I talked with James Taylor and Enis Afgan about this idea briefly and they seemed to believe this was a good idea - I kept meaning to discuss it with Greg but I never got a good opportunity. Though in past Greg has made this sound potentially doable and has never objected to the goal overtly.

I have two specific use cases in mind (CloudBioLinux and LWR), but perhaps the higher-level justification is something along the lines that a lot of effort from Greg and others (Dave, Bjorn, Peter, Nate) has gone into building a modular dependency system that could very easily be leveraged by applications other than Galaxy, so the extra steps that could be taken to make this possible should to make the codebase as broadly useful and to encourage adoption. The Galaxy community could benefit from other applications potentially utilizing and populating the tool shed and Galaxy tool developers would be further incentized to write good, modular dependencies and publish them to the tool shed.

A high-level task decomposition would be something like this:

1. Rework installing tool shed repositories to not require a database. A kind of messy way to do this might be adding a use_database flag throughout. A cleaner way might be to use allow the core functionality to work with callbacks or plugins that performed the database interactions.

2. Separate the core functionality out of the Galaxy code base entirely into a reusable, stand-alone library.

I would love buy in from the Galaxy team on item 2 above, but it is not strictly needed for my goals - I imagine I could write a script to pull it out Galaxy and build the library automatically or even just have the Galaxy codebase present when using Galaxy-less tool shed dependencies.

Buy in on item 1 by the Galaxy team (specifically Greg and Dave B.) however is needed, are there any objections to this idea? Do you have any broad advice on how to approach this to ensure the changes make sense, work with your long term vision, and end up in Galaxy?

Of all the things on my TODO list for the next year, this is probably the most potentially broadly interesting to this weeks BOSC codefest attendees, so I was going to attempt to sell this as something to work on. The sales pitch would include building a little tool shed version of the module command - http://linux.die.net/man/1/module to demonstrate this work and have something immediately useful produced.

The idea would be to create a command-line tool for utilizing tool shed dependencies.

# Unlike standard module, install procedure is available. Probably could # default to main tool shed and latest installable revision

% tsmodule repo:install galaxyp/tint % tsmodule repo:install toolshed.g2.bx.psu.edu/galaxyp/tint/ab43b5ba7a4e

# module lets you list packages, I guess tool shed version would need # repository and package listings:

% tsmodule repo:list toolshed.g2.bx.psu.edu/galaxyp/tint/ab43b5ba7a4e % tsmodule package:list tint_proteomics_scripts/1.19.19/galaxyp/tint/ab43b5ba7a4e

# Finally, a use command would source the env.sh script and make dependency # available in the command-line (might require starting new shell?):

% tsmodule package:use tint_proteomics_scripts % tsmodule package:use tint_proteomics_scripts/1.19.19 % tsmodule package:use int_proteomics_scripts/1.19.19/galaxyp/tint/ab43b5ba7a4e

# use apps that would be available to tools with valid requirements tags. % iQuantCLI

This would be different from using the API scripts because there would be no API, Galaxy instance, or Galaxy database involved - just the Galaxy code. If this was able to split into its own Python library, one could imagine even allowing something like tsmodule to be installable right from pip and recursively fetch a toolshed_client library or something like that. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

John Chilton

17 Jul 17 Jul

12:32 a.m.

Oops. Sorry if this seems like this is coming out of the blue, I feel like I have been bringing this up in different ways for over a year. Obviously, these are the kinds of questions I was hoping you would bring up and your advice will be most welcome. Here is are my initial thoughts in a different order. 2) How do you manage locating installed repository contents? Take an example installed repository: toolshed.g2.bx.psu.edu/repos/galaxyp/tint/ab43b5ba7a4e/ It seems you create a very regular file structure. Can't I just traverse the shed_tools directory infer repository metadata (tool shed, user, module, changeset) from the path? Likewise, can I scan tool_dependencies directory for package information? 5) How do you maintain the state of an installed repository, enabling it to be repaired if it or any of it's dependencies are in an error state? 3) How do you manage the current state of an installed repostiroy and determine if it can be updated? Create a per repository json/yaml/xml file with relevant metadata: toolshed.g2.bx.psu.edu/repos/galaxyp/tint/ab43b5ba7a4e.metadata.json Alternatively, an sqlite database with the same model could be used as Dr. Taylor suggested in his response. I will need to think about that. 1) How do you ensure dependency relationships between installed repositories? Is there any of this information that is not determinable from the file system files? I suspect no since you clone down all of the repository files right? But if there is some metadata that cannot be inferred, it can be added to the per repository metadata file. 4) How do you manage tool version lineage for installed repositories that contain tools (this plays an importnt role in ensuring reproducibility)? As for all of these, that is an important question and I will need to think about it, but that would seem to be Galaxy's job. When Galaxy uses a tool, it will need to track this - it doesn't seem that it should be a module systems job to track its own use. Certainly for my two use cases (CloudBioLinux & LWR) I just want to be able to install the tools without the database, all of the metadata will eventually need to land up in Galaxy. More specifically, in the case of CBL I was hoping to install all of the repository during image creation time without a database present and then have Galaxy slurp up these metadata files on startup once you have a running cloud instance and populate whatever it needs to into the relevant tables you have already defined. From that point it would look just like a normally installed tool. In the case of the LWR, the tool will already need to be installed in Galaxy. This mechanism will just allow its dependencies to be reinstalled on a remote server without a shared filesystem. The precise tracking you are doing in the database would be what enables Galaxy to tell the LWR exactly what to install. From that point though, the LWR doesn't need to track this information it is just running jobs for Galaxy and it should be Galaxy's responsibility to track its use. -John On Mon, Jul 15, 2013 at 7:01 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:

...

Hi John,

It's really too bad that we didn't find time to discuss this in person at the GCC. Until now, I've not heard from anyone that installation from the tool shed without requiring a Galaxy database is important, so I'm lacking some context on this (I assume your statement "without a database present" refers to the Galaxy database).

Some concerns that immediately pop into my mind are the following - this is not a complete list.

1) How do you ensure dependency relationships between installed repositories?

2) How do you manage locating installed repository contents?

3) How do you manage the current state of an installed repostiroy and determine if it can be updated?

4) How do you manage tool version lineage for installed repositories that contain tools (this plays an importnt role in ensuring reproducibility)?

5) How do you maintain the state of an installed repository, enabling it to be repaired if it or any of it's dependencies are in an error state?

Since I've never considered re-engineering the tool shed installation process so that it would function in an environment without a Galaxy database, I'm not sure how much effort would need to go into doing so, or where to start. I'll have to think about this for a while.

Greg Von Kuster

On Jul 15, 2013, at 7:27 PM, John Chilton <chilton@msi.umn.edu> wrote:

...
One of my goals for the GCC was to sell the idea that tool shed repositories need to be installable without a database present. I talked with James Taylor and Enis Afgan about this idea briefly and they seemed to believe this was a good idea - I kept meaning to discuss it with Greg but I never got a good opportunity. Though in past Greg has made this sound potentially doable and has never objected to the goal overtly.

I have two specific use cases in mind (CloudBioLinux and LWR), but perhaps the higher-level justification is something along the lines that a lot of effort from Greg and others (Dave, Bjorn, Peter, Nate) has gone into building a modular dependency system that could very easily be leveraged by applications other than Galaxy, so the extra steps that could be taken to make this possible should to make the codebase as broadly useful and to encourage adoption. The Galaxy community could benefit from other applications potentially utilizing and populating the tool shed and Galaxy tool developers would be further incentized to write good, modular dependencies and publish them to the tool shed.

A high-level task decomposition would be something like this:

1. Rework installing tool shed repositories to not require a database. A kind of messy way to do this might be adding a use_database flag throughout. A cleaner way might be to use allow the core functionality to work with callbacks or plugins that performed the database interactions.

2. Separate the core functionality out of the Galaxy code base entirely into a reusable, stand-alone library.

I would love buy in from the Galaxy team on item 2 above, but it is not strictly needed for my goals - I imagine I could write a script to pull it out Galaxy and build the library automatically or even just have the Galaxy codebase present when using Galaxy-less tool shed dependencies.

Buy in on item 1 by the Galaxy team (specifically Greg and Dave B.) however is needed, are there any objections to this idea? Do you have any broad advice on how to approach this to ensure the changes make sense, work with your long term vision, and end up in Galaxy?

Of all the things on my TODO list for the next year, this is probably the most potentially broadly interesting to this weeks BOSC codefest attendees, so I was going to attempt to sell this as something to work on. The sales pitch would include building a little tool shed version of the module command - http://linux.die.net/man/1/module to demonstrate this work and have something immediately useful produced.

The idea would be to create a command-line tool for utilizing tool shed dependencies.

# Unlike standard module, install procedure is available. Probably could # default to main tool shed and latest installable revision

% tsmodule repo:install galaxyp/tint % tsmodule repo:install toolshed.g2.bx.psu.edu/galaxyp/tint/ab43b5ba7a4e

# module lets you list packages, I guess tool shed version would need # repository and package listings:

% tsmodule repo:list toolshed.g2.bx.psu.edu/galaxyp/tint/ab43b5ba7a4e % tsmodule package:list tint_proteomics_scripts/1.19.19/galaxyp/tint/ab43b5ba7a4e

# Finally, a use command would source the env.sh script and make dependency # available in the command-line (might require starting new shell?):

% tsmodule package:use tint_proteomics_scripts % tsmodule package:use tint_proteomics_scripts/1.19.19 % tsmodule package:use int_proteomics_scripts/1.19.19/galaxyp/tint/ab43b5ba7a4e

# use apps that would be available to tools with valid requirements tags. % iQuantCLI

This would be different from using the API scripts because there would be no API, Galaxy instance, or Galaxy database involved - just the Galaxy code. If this was able to split into its own Python library, one could imagine even allowing something like tsmodule to be installable right from pip and recursively fetch a toolshed_client library or something like that. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

2) How do you manage locating installed repository contents?

James Taylor

16 Jul 16 Jul

8:16 p.m.

On Mon, Jul 15, 2013 at 7:27 PM, John Chilton <chilton@msi.umn.edu> wrote:

...

One of my goals for the GCC was to sell the idea that tool shed repositories need to be installable without a database present. I

John, as I've mentioned in the past I'm very strongly in favor of this. There is absolutely no reason why the ToolShed needs to be tied to Galaxy. For dependency installation we've now got a nice system that does version isolation well, and I've had several groups express interest in using the wrappers from the ToolShed in environments other than Galaxy.

...

from Greg and others (Dave, Bjorn, Peter, Nate) has gone into building a modular dependency system that could very easily be leveraged by applications other than Galaxy, so the extra steps that could be taken to make this possible should to make the codebase as broadly useful and to encourage adoption.

Exactly!

...

1. Rework installing tool shed repositories to not require a database. A kind of messy way to do this might be adding a use_database flag throughout. A cleaner way might be to use allow the core functionality to work with callbacks or plugins that performed the database interactions.

Personally I would definitely prefer a plugin model. Basically, a layer isolating the underlying datastore used during tool installation, dependency resolution, et cetera. However, it occurs to me that even if the tool installation component is isolation from Galaxy, it could still use the same model and database. All package management systems need some way to maintain state, a sqlite database is not an unreasonable choice.

...

2. Separate the core functionality out of the Galaxy code base entirely into a reusable, stand-alone library.

This is a long term goal for us. I think at first we'd like to at least see it all happen in the same repository so that we can still support a checkout and run scenario. But decoupling the various web applications and share components that make up the Galaxy ecosystem is very important.

...

include building a little tool shed version of the module command - http://linux.die.net/man/1/module to demonstrate this work and have something immediately useful produced.

YES! Absolutely. One of the reasons for the way tool dependency injection was implemented was to support a command line module system. Coupling this with the toolshed to also enable dependency installation and management completes the puzzle. (and yes to all the various commands, through I figured it would be gx_module ;)

...

This would be different from using the API scripts because there would be no API, Galaxy instance, or Galaxy database involved - just the Galaxy code. If this was able to split into its own Python library, one could imagine even allowing something like tsmodule to be installable right from pip and recursively fetch a toolshed_client library or something like that.

Absolutely.

John Chilton

17 Jul 17 Jul

12:44 a.m.

On Tue, Jul 16, 2013 at 1:16 PM, James Taylor <james@jamestaylor.org> wrote:

...

On Mon, Jul 15, 2013 at 7:27 PM, John Chilton <chilton@msi.umn.edu> wrote:

...
One of my goals for the GCC was to sell the idea that tool shed repositories need to be installable without a database present. I

John, as I've mentioned in the past I'm very strongly in favor of this. There is absolutely no reason why the ToolShed needs to be tied to Galaxy. For dependency installation we've now got a nice system that does version isolation well, and I've had several groups express interest in using the wrappers from the ToolShed in environments other than Galaxy.

...
from Greg and others (Dave, Bjorn, Peter, Nate) has gone into building a modular dependency system that could very easily be leveraged by applications other than Galaxy, so the extra steps that could be taken to make this possible should to make the codebase as broadly useful and to encourage adoption.

Exactly!

...
1. Rework installing tool shed repositories to not require a database. A kind of messy way to do this might be adding a use_database flag throughout. A cleaner way might be to use allow the core functionality to work with callbacks or plugins that performed the database interactions.

Personally I would definitely prefer a plugin model. Basically, a layer isolating the underlying datastore used during tool installation, dependency resolution, et cetera. However, it occurs to me that even if the tool installation component is isolation from Galaxy, it could still use the same model and database. All package management systems need some way to maintain state, a sqlite database is not an unreasonable choice.

An sqlite database wasn't what I was imaging, but it seems thoroughly reasonable and the cleanest thing to implement.

...

...
2. Separate the core functionality out of the Galaxy code base entirely into a reusable, stand-alone library.

This is a long term goal for us. I think at first we'd like to at least see it all happen in the same repository so that we can still support a checkout and run scenario. But decoupling the various web applications and share components that make up the Galaxy ecosystem is very important.

Agreed.

...

...
include building a little tool shed version of the module command - http://linux.die.net/man/1/module to demonstrate this work and have something immediately useful produced.

YES! Absolutely. One of the reasons for the way tool dependency injection was implemented was to support a command line module system. Coupling this with the toolshed to also enable dependency installation and management completes the puzzle.

(and yes to all the various commands, through I figured it would be gx_module ;)

...
This would be different from using the API scripts because there would be no API, Galaxy instance, or Galaxy database involved - just the Galaxy code. If this was able to split into its own Python library, one could imagine even allowing something like tsmodule to be installable right from pip and recursively fetch a toolshed_client library or something like that.

Absolutely.

4372

Age (days ago)

4373

Last active (days ago)

List overview

Download

5 comments

5 participants

participants (5)

Greg Von Kuster
James Taylor
John Chilton
John Chilton
Michael Cotterell