I have created a Trello card for this (https://trello.com/c/m2x1CXmi) and I have attached a more flushed out IRC conversation for additional public comment/posterity. I think this is exciting stuff - though I need to get my head around mesos and how it would interact with Galaxy more fully. I think some important (perhaps obvious) concerns are: - Integration at framework level must be optional - this shouldn't be a required dependency. - Existing runners and talk parallelism stuff must continue to function with or without mesos. Other thoughts? -John 16:55 < jmchilton> kellrott: Liked the mesos e-mail. Trying to come up with something intelligent to say before responding :). Do you any have specific use cases/tools in mind? 16:56 < kellrott> Starting to move our tools to spark 16:56 < kellrott> That's the big draw for me 16:57 < kellrott> And honestly, it seems better to keep hacking away with things like https://bitbucket.org/galaxy/galaxy-central/pull-request/175/parameter-based... 16:57 < mrscribe> Title: galaxy / galaxy-central / Pull request #175: Parameter Based Bam file parallelization Bitbucket (at bitbucket.org) 17:04 < jmchilton> kellrott: Did you mean "it seems better to keep hacking away with things like" or did you mean it seems better not to... 17:04 < kellrott> missed a 'than' 17:05 < kellrott> but yes, pull #175 is a bit of a hack and wouldn't be needed if there was a really parallelization system 17:06 < kellrott> One of the most relevant usages is any tool that uses a random sampling for statistics, ie consensus clustering 17:06 < jmchilton> Ahh, thought so. Yeah, I think the current task splitting framework stuff is suppose to be pretty transparent to the tool wrapper/application. For applications that depend on something like spark, you idea sounds pretty good to me. 17:07 < kellrott> no real way to support parallel computation of a background model in current system 17:08 < kellrott> When talking to different people, the lack of robust parallelization tends to be the biggest complaint 17:10 < kellrott> I think Mesos is a pretty robust solution, I just wanted to see if anybody had technical complaints 17:11 < jmchilton> hmm.... interesting. I must admit I had never really thought of that as being a problem. I will have to chew on what you said though. 17:11 < jmchilton> All of that said, I have no technical complaints and I am eager to help in any way I reasonably can with the mesos stuff. It seems pretty exciting. 17:11 < kellrott> Sorry, if my explanations are a bit choppy 17:12 < kellrott> I think the biggest complaint would be 'yet another dependency' 17:13 < jmchilton> We could do it in a way that it is optional though right? 17:13 < kellrott> Exactly 17:13 < jmchilton> ... though maybe specific tools would require it. 17:14 < jmchilton> I will create a Trello card, if anyone has objections they can note them there. 17:14 < kellrott> Its not hard to support single CPU calculations (Spark has a built in single CPU mode) 17:15 < kellrott> But tools that require it would probably not be very fast running on a single machine ;-) 17:15 < kellrott> I'll probably start working on the implementation, I just wanted to make sure people are open to merging the work into central 17:17 < jmchilton> Reserve the right to reject specific implementations :), but overall I (at least personally) like the idea. 17:18 < pvh_sa> kellrott, agreed 100% 17:18 < kellrott> Obviously I'm open to peoples critiques 17:19 < pvh_sa> the Galaxy execution engine is going to need replacing, its just a question of with what, and how can it be done in the least disruptive way 17:19 < kellrott> This has become the thing that it the 'rate limiter' for adoption by tool writers around here 17:20 < kellrott> are calculations demand robust parallel programming platforms 17:22 < jmchilton> huh... I see the need to support things other than MPI, but would you say there are problems with our current MPI support (essentially delegated external resource manager). 17:23 < kellrott> well, MPI is supported under mesos ;-) 17:23 < dannon> THe main problem here is that we wanted something that supports parallelization for tools that don't natively do it. 17:24 < dannon> new runners for different execution models, totally fine. 17:25 < pvh_sa> dannon, that's something that needs to cooperation of the execution engine and the type system, right? that's kind of how it works at the moment - tools state that certain inputs are splittable (i.e. are actually collections) 17:26 < dannon> Right. I could imagine the parallelization description langugae being fleshed out more to say "if mesos, do this, else do naieve file chunking, etc." 17:26 < pvh_sa> of course that needs extending so that collections of parameters can be provided as inputs, so that tools can explore a parameter space 17:27 < dannon> But at the core, it needs to be abstract 17:27 < kellrott> I think strait file chunking will only cover a fraction of parallalization needs 17:27 < dannon> kellrott: Absolutely, file chunking is designed for embarassingly parallel problems without shared memory/etc 17:27 < dannon> It just won't work for some things. 17:28 < pvh_sa> kellrott, is mesos also scheduling work parcels? or does it not do scheduling? 17:29 < jmchilton> My understanding is its a framework that you can plugin scheduling into... 17:30 < kellrott> you write a small 'executor' that the system starts offering resources to 17:31 < kellrott> then you can accept them, and you scripts are run on the remote machines 17:31 < pvh_sa> kellrott, ok... so the scheduling is external 17:31 < kellrott> like a queue system, but a two-way dialog rather then a single sided converation 17:32 < pvh_sa> your "executor" must handle what to do with resource offers, right? so it matches the resource offers to work that needs doing 17:32 < kellrott> I believe so 17:33 < pvh_sa> not sure what other people's sense is... 17:34 < kellrott> Sorry, flipped the lingo, you write a scheduler that then has the system launch executors based on resource offers 17:34 < kellrott> https://github.com/apache/mesos/blob/master/docs/App-Framework-Development-G... 17:34 < mrscribe> Title: mesos/docs/App-Framework-Development-Guide.md at master · apache/mesos · GitHub (at github.com) 17:34 < pvh_sa> ok dokey. so conceivably you could use celery with a mesos backend or something 17:35 < kellrott> Exactly, there is already a Torque framework 17:35 < pvh_sa> one of my bugbears with galaxy's execution engine at the moment is that it schedules tasks, not workflows... workflows simply become a set of tasks at execution time 17:37 < jmchilton> pvh_sa: What should Galaxy be doing instead? 17:38 < jmchilton> i.e. how does one "schedule a workflow" if not by individual tasks? 17:38 < pvh_sa> jmchilton, workflows should exist as objects in the execution environment so that you can e.g. pause and restart them 17:38 < pvh_sa> once you've got that kind of model, you can also re-execute failed tasks... at the moment if one task fails, i have to restart the workflow from scratch.... 17:39 < dannon> pvh_sa: That's the plan 17:39 < jmchilton> You can restart workflows now, dannon has added that. 17:39 < kellrott> I kind of like the 'separate tasks' workflow model, because I'm looking into managing 'workflows' using external clients through the API 17:39 < dannon> Actually, right now, you can restart workflows. 17:39 < dannon> yep. 17:39 < kellrott> that way I can read a workflow file, and then use the query engine to find previously run jobs that could fullfill the input needs of that workflow 17:40 < jmchilton> Pausing workflows seems like a reasonable request though... maybe more advanced UI that lets you visualize and interact with the workflow as a unit. 17:40 < dannon> And I think it's actually great that that sort of thing *does* run externally Kyle 17:40 < pvh_sa> jmchilton, yep, something UI related would be good for that. 17:40 < dannon> Very near up on my list is picking up some previous work with galaxy/messaging to integrate celery tasks instead of having all these status loops 17:41 < pvh_sa> dannon, yup for celery 17:41 < kellrott> it lets me restart workflows, even if is a completley different run of the workflow, and I 'forgot' about the original 17:41 < dannon> Yep, using celery, for sure. 17:41 < pvh_sa> i'm also thinking that we want to reason across provenance graphs 17:42 < pvh_sa> so e.g. a certain transform need not be repeated if it already exists in the set of data products... e.g. we've already created that BLAST db, just re-use it 17:42 < kellrott> thats exactly what I want (and parallel tools ;-) 17:42 < dannon> Right, and I think that's the sort of thing Kyle's query work will really enable. 17:43 < pvh_sa> kellrott, have you read the paper on using... datalog i think its called... for provenance recording? 17:43 < kellrott> I'm pretty sure we can do all of this via the api 17:43 < dannon> That's the idea, at least. 17:43 < kellrott> I don't think I've seen that one 17:43 < pvh_sa> yeah the API is king... a powerful API also lets us cleanly split UI from backend 17:44 < dannon> Exactly 17:44 < kellrott> I'm not even thinking UI. I'm thinking about script controlled workflows 17:44 < pvh_sa> "Datalog as a Lingua Franca for provenance querying and reasoning" 17:45 < pvh_sa> seems lots of talking the same language going on. yay. i'm busy hiring developers to work on exactly this kind of stuff 17:45 < pvh_sa> but yeah my 3-part agenda is basically 1) type system 2) execution engine 3) provenance representation 17:46 < kellrott> We are building toward running workflows on TCGA data, so I need a system where I can call a mutation calling workflow ~4000 times 17:47 < jmchilton> pvn_sa: Enhancing Galaxy or replacing it? :) 17:47 < pvh_sa> jmchilton, ripping out bits, keeping the project together though 17:47 < kellrott> Hopefully enhancing Galaxy 17:47 < pvh_sa> jmchilton, you've got 130 000 lines of code 17:47 < pvh_sa> and a vibrant community 17:48 < pvh_sa> any way forward needs to build on that. there's lots to do, but galaxy is like a central beacon we can all gravitate towards 17:49 < pvh_sa> anyway, the current galaxy is already v2.0 compared to the original galaxy described in the literature On Sat, Oct 26, 2013 at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI (https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-me...) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python
Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system.
I'm curious to see what other people think.
Kyle
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/