Glad to see someone else is playing around with Mesos. 
I have a mesos branch that is getting a little long in the tooth. I'd like to get a straight job runner (non-LWR, with a shared file system) running under mesos for Galaxy before I submit that work for a pull request.

The hackathon is only 12 days away! Hopefully we'll be able to make some progress on these sorts of projects.

Kyle



On Sun, Jun 15, 2014 at 4:06 PM, John Chilton <jmchilton@gmail.com> wrote:
Hey Kyle, all,

  If anyone wants to play with running Galaxy jobs within an Apache
Mesos environment I have added a prototype of this feature to the LWR.

https://bitbucket.org/jmchilton/lwr/commits/555438d2fe266899338474b25c540fef42bcece7
https://bitbucket.org/jmchilton/lwr/commits/9748b3035dbe3802d4136a6a1028df8395a9aeb3

This work distributes jobs across a Mesos cluster and injects a
MESOS_URL environment variable into the job runtime environment in
case the jobs themselves want to take advantage of Mesos.

The advantage of the LWR versus a traditional Galaxy runner is that
the job can be staged to remote resources without shared disk. Prior
to this I was imaging the LWR to be useful in cases where Galaxy and
remote cluster don't share common disk but where there is in fact a
shared scratch directory or something across the remote cluster as
well a resource manager. The LWR Mesos framework however has the
actual compute servers themselves stage the job up and down - so you
could imagine distributing Galaxy across large clusters without any
shared disk whatsoever - that could be very cool and help scale say
cloud applications.

Downsides of an LWR-based approach versus a Galaxy approach is that it
is less mature and there is more stuff to configure - need to
configure a Galaxy job_conf plugin and destination, need to configure
the LWR itself, need to configure a message queue (for this variant of
LWR operation anyway - it should be possible to drive this via the LWR
in web server mode but I haven't added it yet). I would be more than
happy to continue to see progress toward Mesos support in Galaxy
proper.

It is strictly a prototype so far - a sort of playground if anyone
wants to play with these ideas and build something cool. It really is
a "framework" right - not so much a job scheduler so I am not sure it
is very immediately useful - but I imagine one could build cool stuff
on top of it.

Next, I think I would like to add Apache Aurora
(http://aurora.incubator.apache.org/) support - because it seems like
a much more traditional resource manager but built on top of Mesos so
it would be more practical for traditional Galaxy-style jobs. Doesn't
buy you anything in terms of parallelization but it would "fit better"
with Galaxy.

-John


On Sat, Oct 26, 2013 at 2:43 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
> I think one of the aspects where Galaxy is a bit soft is the ability to do
> distributed tasks. The current system of split/replicate/merge tasks based
> on file type is a bit limited and hard for tool developers to expand upon.
> Distributed computing is a non-trival thing to implement and I think it
> would be a better use of our time to use an already existing framework. And
> it would also mean one less API for tool writers to have to develop for.
> I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ).
> You can see an overview of the Mesos architecture at
> https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md
> The important thing about Mesos is that it provides an API for C/C++,
> Java/Scala and Python to write distributed frameworks. There are already
> implementations of frameworks for common parallel programming systems such
> as:
>  - Hadoop (https://github.com/mesos/hadoop)
>  - MPI
> (https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-mesos.md)
>  - Spark (http://spark-project.org)
> And you can find example Python framework at
> https://github.com/apache/mesos/tree/master/src/examples/python
>
> Integration with Galaxy would have three parts:
> 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then
> passed to tool wrappers and allows them to contact the local mesos
> infrastructure (assuming the system has been configured) or pass a null if
> the system isn't available.
> 2) Write a tool runner that works as a mesos framework to executes single
> cpu jobs on the distributed system.
> 3) For instances where mesos is not available at a system wide level (say
> they only have access to an SGE based cluster), but the user wants to run
> distributed jobs, write a wrapper that can create a mesos cluster using the
> existing queueing system. For example, right now I run a Mesos system under
> the SGE queue system.
>
> I'm curious to see what other people think.
>
> Kyle
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/