Using Mesos to Enable distributed computing under Galaxy?
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI ( https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me... ) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system. I'm curious to see what other people think. Kyle
I have created a Trello card for this (https://trello.com/c/m2x1CXmi) and I have attached a more flushed out IRC conversation for additional public comment/posterity. I think this is exciting stuff - though I need to get my head around mesos and how it would interact with Galaxy more fully. I think some important (perhaps obvious) concerns are: - Integration at framework level must be optional - this shouldn't be a required dependency. - Existing runners and talk parallelism stuff must continue to function with or without mesos. Other thoughts? -John 16:55 < jmchilton> kellrott: Liked the mesos e-mail. Trying to come up with something intelligent to say before responding :). Do you any have specific use cases/tools in mind? 16:56 < kellrott> Starting to move our tools to spark 16:56 < kellrott> That's the big draw for me 16:57 < kellrott> And honestly, it seems better to keep hacking away with things like https://bitbucket.org/galaxy/galaxy-central/pull-request/175/parameter-based... 16:57 < mrscribe> Title: galaxy / galaxy-central / Pull request #175: Parameter Based Bam file parallelization Bitbucket (at bitbucket.org) 17:04 < jmchilton> kellrott: Did you mean "it seems better to keep hacking away with things like" or did you mean it seems better not to... 17:04 < kellrott> missed a 'than' 17:05 < kellrott> but yes, pull #175 is a bit of a hack and wouldn't be needed if there was a really parallelization system 17:06 < kellrott> One of the most relevant usages is any tool that uses a random sampling for statistics, ie consensus clustering 17:06 < jmchilton> Ahh, thought so. Yeah, I think the current task splitting framework stuff is suppose to be pretty transparent to the tool wrapper/application. For applications that depend on something like spark, you idea sounds pretty good to me. 17:07 < kellrott> no real way to support parallel computation of a background model in current system 17:08 < kellrott> When talking to different people, the lack of robust parallelization tends to be the biggest complaint 17:10 < kellrott> I think Mesos is a pretty robust solution, I just wanted to see if anybody had technical complaints 17:11 < jmchilton> hmm.... interesting. I must admit I had never really thought of that as being a problem. I will have to chew on what you said though. 17:11 < jmchilton> All of that said, I have no technical complaints and I am eager to help in any way I reasonably can with the mesos stuff. It seems pretty exciting. 17:11 < kellrott> Sorry, if my explanations are a bit choppy 17:12 < kellrott> I think the biggest complaint would be 'yet another dependency' 17:13 < jmchilton> We could do it in a way that it is optional though right? 17:13 < kellrott> Exactly 17:13 < jmchilton> ... though maybe specific tools would require it. 17:14 < jmchilton> I will create a Trello card, if anyone has objections they can note them there. 17:14 < kellrott> Its not hard to support single CPU calculations (Spark has a built in single CPU mode) 17:15 < kellrott> But tools that require it would probably not be very fast running on a single machine ;-) 17:15 < kellrott> I'll probably start working on the implementation, I just wanted to make sure people are open to merging the work into central 17:17 < jmchilton> Reserve the right to reject specific implementations :), but overall I (at least personally) like the idea. 17:18 < pvh_sa> kellrott, agreed 100% 17:18 < kellrott> Obviously I'm open to peoples critiques 17:19 < pvh_sa> the Galaxy execution engine is going to need replacing, its just a question of with what, and how can it be done in the least disruptive way 17:19 < kellrott> This has become the thing that it the 'rate limiter' for adoption by tool writers around here 17:20 < kellrott> are calculations demand robust parallel programming platforms 17:22 < jmchilton> huh... I see the need to support things other than MPI, but would you say there are problems with our current MPI support (essentially delegated external resource manager). 17:23 < kellrott> well, MPI is supported under mesos ;-) 17:23 < dannon> THe main problem here is that we wanted something that supports parallelization for tools that don't natively do it. 17:24 < dannon> new runners for different execution models, totally fine. 17:25 < pvh_sa> dannon, that's something that needs to cooperation of the execution engine and the type system, right? that's kind of how it works at the moment - tools state that certain inputs are splittable (i.e. are actually collections) 17:26 < dannon> Right. I could imagine the parallelization description langugae being fleshed out more to say "if mesos, do this, else do naieve file chunking, etc." 17:26 < pvh_sa> of course that needs extending so that collections of parameters can be provided as inputs, so that tools can explore a parameter space 17:27 < dannon> But at the core, it needs to be abstract 17:27 < kellrott> I think strait file chunking will only cover a fraction of parallalization needs 17:27 < dannon> kellrott: Absolutely, file chunking is designed for embarassingly parallel problems without shared memory/etc 17:27 < dannon> It just won't work for some things. 17:28 < pvh_sa> kellrott, is mesos also scheduling work parcels? or does it not do scheduling? 17:29 < jmchilton> My understanding is its a framework that you can plugin scheduling into... 17:30 < kellrott> you write a small 'executor' that the system starts offering resources to 17:31 < kellrott> then you can accept them, and you scripts are run on the remote machines 17:31 < pvh_sa> kellrott, ok... so the scheduling is external 17:31 < kellrott> like a queue system, but a two-way dialog rather then a single sided converation 17:32 < pvh_sa> your "executor" must handle what to do with resource offers, right? so it matches the resource offers to work that needs doing 17:32 < kellrott> I believe so 17:33 < pvh_sa> not sure what other people's sense is... 17:34 < kellrott> Sorry, flipped the lingo, you write a scheduler that then has the system launch executors based on resource offers 17:34 < kellrott> https://github.com/apache/mesos/blob/master/docs/App-Framework-Development-G... 17:34 < mrscribe> Title: mesos/docs/App-Framework-Development-Guide.md at master · apache/mesos · GitHub (at github.com) 17:34 < pvh_sa> ok dokey. so conceivably you could use celery with a mesos backend or something 17:35 < kellrott> Exactly, there is already a Torque framework 17:35 < pvh_sa> one of my bugbears with galaxy's execution engine at the moment is that it schedules tasks, not workflows... workflows simply become a set of tasks at execution time 17:37 < jmchilton> pvh_sa: What should Galaxy be doing instead? 17:38 < jmchilton> i.e. how does one "schedule a workflow" if not by individual tasks? 17:38 < pvh_sa> jmchilton, workflows should exist as objects in the execution environment so that you can e.g. pause and restart them 17:38 < pvh_sa> once you've got that kind of model, you can also re-execute failed tasks... at the moment if one task fails, i have to restart the workflow from scratch.... 17:39 < dannon> pvh_sa: That's the plan 17:39 < jmchilton> You can restart workflows now, dannon has added that. 17:39 < kellrott> I kind of like the 'separate tasks' workflow model, because I'm looking into managing 'workflows' using external clients through the API 17:39 < dannon> Actually, right now, you can restart workflows. 17:39 < dannon> yep. 17:39 < kellrott> that way I can read a workflow file, and then use the query engine to find previously run jobs that could fullfill the input needs of that workflow 17:40 < jmchilton> Pausing workflows seems like a reasonable request though... maybe more advanced UI that lets you visualize and interact with the workflow as a unit. 17:40 < dannon> And I think it's actually great that that sort of thing *does* run externally Kyle 17:40 < pvh_sa> jmchilton, yep, something UI related would be good for that. 17:40 < dannon> Very near up on my list is picking up some previous work with galaxy/messaging to integrate celery tasks instead of having all these status loops 17:41 < pvh_sa> dannon, yup for celery 17:41 < kellrott> it lets me restart workflows, even if is a completley different run of the workflow, and I 'forgot' about the original 17:41 < dannon> Yep, using celery, for sure. 17:41 < pvh_sa> i'm also thinking that we want to reason across provenance graphs 17:42 < pvh_sa> so e.g. a certain transform need not be repeated if it already exists in the set of data products... e.g. we've already created that BLAST db, just re-use it 17:42 < kellrott> thats exactly what I want (and parallel tools ;-) 17:42 < dannon> Right, and I think that's the sort of thing Kyle's query work will really enable. 17:43 < pvh_sa> kellrott, have you read the paper on using... datalog i think its called... for provenance recording? 17:43 < kellrott> I'm pretty sure we can do all of this via the api 17:43 < dannon> That's the idea, at least. 17:43 < kellrott> I don't think I've seen that one 17:43 < pvh_sa> yeah the API is king... a powerful API also lets us cleanly split UI from backend 17:44 < dannon> Exactly 17:44 < kellrott> I'm not even thinking UI. I'm thinking about script controlled workflows 17:44 < pvh_sa> "Datalog as a Lingua Franca for provenance querying and reasoning" 17:45 < pvh_sa> seems lots of talking the same language going on. yay. i'm busy hiring developers to work on exactly this kind of stuff 17:45 < pvh_sa> but yeah my 3-part agenda is basically 1) type system 2) execution engine 3) provenance representation 17:46 < kellrott> We are building toward running workflows on TCGA data, so I need a system where I can call a mutation calling workflow ~4000 times 17:47 < jmchilton> pvn_sa: Enhancing Galaxy or replacing it? :) 17:47 < pvh_sa> jmchilton, ripping out bits, keeping the project together though 17:47 < kellrott> Hopefully enhancing Galaxy 17:47 < pvh_sa> jmchilton, you've got 130 000 lines of code 17:47 < pvh_sa> and a vibrant community 17:48 < pvh_sa> any way forward needs to build on that. there's lots to do, but galaxy is like a central beacon we can all gravitate towards 17:49 < pvh_sa> anyway, the current galaxy is already v2.0 compared to the original galaxy described in the literature On Sat, Oct 26, 2013 at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI (https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me...) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Kyle This is something I am very interested in. The three parts below make sense to me. I would be very happy to discuss further and provide any help to move this forward. Regards On Oct 26, 2013, at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI (https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me...) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ravi K Madduri MCS, Argonne National Laboratory Computation Institute, University of Chicago
I don't think implementation will be very difficult. The bigger question is this a technology people are open to? The nearest competitor is YARN ( http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Mesos seems a bit more geared toward general purpose usage (with several existing frameworks), while YARN seems more specific to Hadoop. But I'd be glad to hear some other thoughts. Kyle On Mon, Oct 28, 2013 at 12:55 PM, Ravi K Madduri <madduri@mcs.anl.gov>wrote:
Kyle This is something I am very interested in. The three parts below make sense to me. I would be very happy to discuss further and provide any help to move this forward.
Regards On Oct 26, 2013, at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI ( https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me... ) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ravi K Madduri MCS, Argonne National Laboratory Computation Institute, University of Chicago
Hi Kyle, We have a similar ongoing development wherein we are working on integrating our Swift framework ( swift-lang.org ) with Galaxy. The goal is to enable Galaxy based applications to run on a variety of distributed resources via various integration schemes as suitable to application and underlying execution environment. Here is an abstract of a paper (co-authored with Ravi, who responded on this thread) we will be presenting in a workshop at the upcoming SC 13 conference: "The Galaxy platform is a web-based science portal for scientific computing supporting Life Sciences users community. While user-friendly and intuitive for doing small to medium scale computations, it currently has a limited support for large-scale, parallel and distributed computing. The Swift parallel scripting framework is capable of composing ordinary applications into parallel scripts that can be run on multi-scale distributed and performance computing platforms. In complex distributed environments, often the user end of application lifecycle slows down because of the technical complexities brought in by the scale, access methods and resource management nuances. Galaxy offers a simple way of designing, composing, executing, reusing, and reproducing application runs. An integration between Swift and Galaxy systems can accelerate science as well as bring the respective user communities together in an interactive, user-friendly, parallel and distributed data analysis environment enabled on a broad range of computational infrastructures." Kindly let us know if you need a hands on for the various tools we have already developed. Best, Ketan On Mon, Oct 28, 2013 at 3:07 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I don't think implementation will be very difficult. The bigger question is this a technology people are open to? The nearest competitor is YARN ( http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Mesos seems a bit more geared toward general purpose usage (with several existing frameworks), while YARN seems more specific to Hadoop. But I'd be glad to hear some other thoughts.
Kyle
On Mon, Oct 28, 2013 at 12:55 PM, Ravi K Madduri <madduri@mcs.anl.gov>wrote:
Kyle This is something I am very interested in. The three parts below make sense to me. I would be very happy to discuss further and provide any help to move this forward.
Regards On Oct 26, 2013, at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI ( https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me... ) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ravi K Madduri MCS, Argonne National Laboratory Computation Institute, University of Chicago
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ketan
You probably are a good person to get an opinion from. My plan isn't to write new frameworks, but rather use existing libraries that can communicate with Mesos to setup their parallel environments. But for Swift, you would probably want to write a new framework. Just looking at Swift, I imagine one of the harder parts is just getting the system setup on a cluster (ie distributing out files to remote nodes, making sure that you have a way to start processes on those nodes and have them know where to find the master), it seems like Swift could benefit from having a Mesos based framework. Do you think it would enable you to have a 'zero-config' startup of a distributed Swift application? Kyle On Mon, Oct 28, 2013 at 1:51 PM, Ketan Maheshwari < ketancmaheshwari@gmail.com> wrote:
Hi Kyle,
We have a similar ongoing development wherein we are working on integrating our Swift framework ( swift-lang.org ) with Galaxy. The goal is to enable Galaxy based applications to run on a variety of distributed resources via various integration schemes as suitable to application and underlying execution environment.
Here is an abstract of a paper (co-authored with Ravi, who responded on this thread) we will be presenting in a workshop at the upcoming SC 13 conference:
"The Galaxy platform is a web-based science portal for scientific computing supporting Life Sciences users community. While user-friendly and intuitive for doing small to medium scale computations, it currently has a limited support for large-scale, parallel and distributed computing. The Swift parallel scripting framework is capable of composing ordinary applications into parallel scripts that can be run on multi-scale distributed and performance computing platforms. In complex distributed environments, often the user end of application lifecycle slows down because of the technical complexities brought in by the scale, access methods and resource management nuances. Galaxy offers a simple way of designing, composing, executing, reusing, and reproducing application runs. An integration between Swift and Galaxy systems can accelerate science as well as bring the respective user communities together in an interactive, user-friendly, parallel and distributed data analysis environment enabled on a broad range of computational infrastructures."
Kindly let us know if you need a hands on for the various tools we have already developed.
Best, Ketan
On Mon, Oct 28, 2013 at 3:07 PM, Kyle Ellrott <kellrott@soe.ucsc.edu>wrote:
I don't think implementation will be very difficult. The bigger question is this a technology people are open to? The nearest competitor is YARN ( http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Mesos seems a bit more geared toward general purpose usage (with several existing frameworks), while YARN seems more specific to Hadoop. But I'd be glad to hear some other thoughts.
Kyle
On Mon, Oct 28, 2013 at 12:55 PM, Ravi K Madduri <madduri@mcs.anl.gov>wrote:
Kyle This is something I am very interested in. The three parts below make sense to me. I would be very happy to discuss further and provide any help to move this forward.
Regards On Oct 26, 2013, at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI ( https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me... ) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ravi K Madduri MCS, Argonne National Laboratory Computation Institute, University of Chicago
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ketan
Hi Kyle, Swift indeed is a complete framework for distributed computing. Distributing files out to cluster nodes, starting processes, bringing back result files to submit host is done out of the box (stagein-exec-stageout cycle). We can discuss offline if you are interested in giving it a shot. Best, Ketan On Mon, Oct 28, 2013 at 4:14 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
You probably are a good person to get an opinion from. My plan isn't to write new frameworks, but rather use existing libraries that can communicate with Mesos to setup their parallel environments. But for Swift, you would probably want to write a new framework. Just looking at Swift, I imagine one of the harder parts is just getting the system setup on a cluster (ie distributing out files to remote nodes, making sure that you have a way to start processes on those nodes and have them know where to find the master), it seems like Swift could benefit from having a Mesos based framework. Do you think it would enable you to have a 'zero-config' startup of a distributed Swift application?
Kyle
On Mon, Oct 28, 2013 at 1:51 PM, Ketan Maheshwari < ketancmaheshwari@gmail.com> wrote:
Hi Kyle,
We have a similar ongoing development wherein we are working on integrating our Swift framework ( swift-lang.org ) with Galaxy. The goal is to enable Galaxy based applications to run on a variety of distributed resources via various integration schemes as suitable to application and underlying execution environment.
Here is an abstract of a paper (co-authored with Ravi, who responded on this thread) we will be presenting in a workshop at the upcoming SC 13 conference:
"The Galaxy platform is a web-based science portal for scientific computing supporting Life Sciences users community. While user-friendly and intuitive for doing small to medium scale computations, it currently has a limited support for large-scale, parallel and distributed computing. The Swift parallel scripting framework is capable of composing ordinary applications into parallel scripts that can be run on multi-scale distributed and performance computing platforms. In complex distributed environments, often the user end of application lifecycle slows down because of the technical complexities brought in by the scale, access methods and resource management nuances. Galaxy offers a simple way of designing, composing, executing, reusing, and reproducing application runs. An integration between Swift and Galaxy systems can accelerate science as well as bring the respective user communities together in an interactive, user-friendly, parallel and distributed data analysis environment enabled on a broad range of computational infrastructures."
Kindly let us know if you need a hands on for the various tools we have already developed.
Best, Ketan
On Mon, Oct 28, 2013 at 3:07 PM, Kyle Ellrott <kellrott@soe.ucsc.edu>wrote:
I don't think implementation will be very difficult. The bigger question is this a technology people are open to? The nearest competitor is YARN ( http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Mesos seems a bit more geared toward general purpose usage (with several existing frameworks), while YARN seems more specific to Hadoop. But I'd be glad to hear some other thoughts.
Kyle
On Mon, Oct 28, 2013 at 12:55 PM, Ravi K Madduri <madduri@mcs.anl.gov>wrote:
Kyle This is something I am very interested in. The three parts below make sense to me. I would be very happy to discuss further and provide any help to move this forward.
Regards On Oct 26, 2013, at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI ( https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me... ) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ravi K Madduri MCS, Argonne National Laboratory Computation Institute, University of Chicago
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Ketan
-- Ketan
Hey Kyle, all, If anyone wants to play with running Galaxy jobs within an Apache Mesos environment I have added a prototype of this feature to the LWR. https://bitbucket.org/jmchilton/lwr/commits/555438d2fe266899338474b25c540fef... https://bitbucket.org/jmchilton/lwr/commits/9748b3035dbe3802d4136a6a1028df83... This work distributes jobs across a Mesos cluster and injects a MESOS_URL environment variable into the job runtime environment in case the jobs themselves want to take advantage of Mesos. The advantage of the LWR versus a traditional Galaxy runner is that the job can be staged to remote resources without shared disk. Prior to this I was imaging the LWR to be useful in cases where Galaxy and remote cluster don't share common disk but where there is in fact a shared scratch directory or something across the remote cluster as well a resource manager. The LWR Mesos framework however has the actual compute servers themselves stage the job up and down - so you could imagine distributing Galaxy across large clusters without any shared disk whatsoever - that could be very cool and help scale say cloud applications. Downsides of an LWR-based approach versus a Galaxy approach is that it is less mature and there is more stuff to configure - need to configure a Galaxy job_conf plugin and destination, need to configure the LWR itself, need to configure a message queue (for this variant of LWR operation anyway - it should be possible to drive this via the LWR in web server mode but I haven't added it yet). I would be more than happy to continue to see progress toward Mesos support in Galaxy proper. It is strictly a prototype so far - a sort of playground if anyone wants to play with these ideas and build something cool. It really is a "framework" right - not so much a job scheduler so I am not sure it is very immediately useful - but I imagine one could build cool stuff on top of it. Next, I think I would like to add Apache Aurora (http://aurora.incubator.apache.org/) support - because it seems like a much more traditional resource manager but built on top of Mesos so it would be more practical for traditional Galaxy-style jobs. Doesn't buy you anything in terms of parallelization but it would "fit better" with Galaxy. -John On Sat, Oct 26, 2013 at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI (https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me...) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Glad to see someone else is playing around with Mesos. I have a mesos branch that is getting a little long in the tooth. I'd like to get a straight job runner (non-LWR, with a shared file system) running under mesos for Galaxy before I submit that work for a pull request. The hackathon is only 12 days away! Hopefully we'll be able to make some progress on these sorts of projects. Kyle On Sun, Jun 15, 2014 at 4:06 PM, John Chilton <jmchilton@gmail.com> wrote:
Hey Kyle, all,
If anyone wants to play with running Galaxy jobs within an Apache Mesos environment I have added a prototype of this feature to the LWR.
https://bitbucket.org/jmchilton/lwr/commits/555438d2fe266899338474b25c540fef...
https://bitbucket.org/jmchilton/lwr/commits/9748b3035dbe3802d4136a6a1028df83...
This work distributes jobs across a Mesos cluster and injects a MESOS_URL environment variable into the job runtime environment in case the jobs themselves want to take advantage of Mesos.
The advantage of the LWR versus a traditional Galaxy runner is that the job can be staged to remote resources without shared disk. Prior to this I was imaging the LWR to be useful in cases where Galaxy and remote cluster don't share common disk but where there is in fact a shared scratch directory or something across the remote cluster as well a resource manager. The LWR Mesos framework however has the actual compute servers themselves stage the job up and down - so you could imagine distributing Galaxy across large clusters without any shared disk whatsoever - that could be very cool and help scale say cloud applications.
Downsides of an LWR-based approach versus a Galaxy approach is that it is less mature and there is more stuff to configure - need to configure a Galaxy job_conf plugin and destination, need to configure the LWR itself, need to configure a message queue (for this variant of LWR operation anyway - it should be possible to drive this via the LWR in web server mode but I haven't added it yet). I would be more than happy to continue to see progress toward Mesos support in Galaxy proper.
It is strictly a prototype so far - a sort of playground if anyone wants to play with these ideas and build something cool. It really is a "framework" right - not so much a job scheduler so I am not sure it is very immediately useful - but I imagine one could build cool stuff on top of it.
Next, I think I would like to add Apache Aurora (http://aurora.incubator.apache.org/) support - because it seems like a much more traditional resource manager but built on top of Mesos so it would be more practical for traditional Galaxy-style jobs. Doesn't buy you anything in terms of parallelization but it would "fit better" with Galaxy.
-John
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks
on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI ( https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me... ) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using
On Sat, Oct 26, 2013 at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote: based the
existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (5)
-
John Chilton
-
John Chilton
-
Ketan Maheshwari
-
Kyle Ellrott
-
Ravi K Madduri