Re: [galaxy-dev] Galaxy sending jobs to multiple clusters

3 Feb 2016

      On Wed, Feb 3, 2016 at 10:09 AM, Nikolay Aleksandrov Vazov <
n.a.vazov@usit.uio.no> wrote:
...
Hi, Nate,
Yes, we are using slurmdbd here. So by controllers, if I get it right,
you mean the controller machines of each cluster which shall connect to
(share) the same slurmdbd.
Correct.
...
And a last one :
In your github version you say about using Slurm >= 14.11. We are running
Slurm 14.03 and we shall have to recompile it. Do you mean by recompilation
that we have to recompile both server Slurm part and the client Slurm on
the submit hosts.
This is a bit tricky. If you don't need to specify multiple clusters when
submitting (e.g. with `--clusters=cluster1,cluster2`) then it works like
this:

You will need to recompile Slurm with the patch shown on Github. You don't
actually have to install this recompiled version if you don't want to, you
only need to make sure that slurm-drmaa uses this recompiled version's
libslurmdb.so at runtime. There are a variety of ways to do this, I list a
couple in the instructions. Or if you don't mind using this modified
version in place of your existing version, you can just install the
modified version so that the "default" libslurmdb.so is compatible. In that
case, no runtime linker tricks should be necessary. You do not have to
compile slurm-drmaa against the modified version.

If you *do* need multicluster submission support, you have to compile
slurm-drmaa against a copy of the (unmodified) Slurm source code for access
to the private headers contained within. Once done, it works the same as
above - you still need to compile a modified libslurmdb.so. The version of
libslurmdb.so used by slurm-drmaa at runtime is the key.

In both cases, this only needs to be done on the submission host. No
(controller) server modifications are necessary.
...
Best regards
Nikolay
===============
Nikolay Vazov, PhD
Department for Research Computing, University of Oslo
------------------------------
*From:* Nate Coraor <nate@bx.psu.edu>
*Sent:* 03 February 2016 15:56
*To:* Nikolay Aleksandrov Vazov
*Cc:* Ganote, Carrie L; John Chilton; dannon.baker@gmail.com;
galaxy-dev@lists.galaxyproject.org
*Subject:* Re: Galaxy sending jobs to multiple clusters
On Tue, Feb 2, 2016 at 9:18 AM, Nikolay Aleksandrov Vazov <
n.a.vazov@usit.uio.no> wrote:
...
Many thanks to all of you!!
Definitely Nate's approach is a better choice. We are running Slurm
14.03, but Nate's manual is exhaustive enough to recompile even the
existing version. (I don't know how we can do this on a running cluster
though :) I will most probably go for this solution.
There is a sentence in Nate's answer  I don't really understand :
"... using `--clusters` means you have to have your controllers
integrated using slurmdbd, ..."
what do you mean by this, Nate?
You have to run slurmdbd (it's optional) and your slurm controllers must
connect to a single slurmdbd instance. This is Slurm's accounting server.
Here's the documentation:
http://slurm.schedmd.com/accounting.html
The setup is relatively simple, you just need to have a MySQL (or
derivative) server for it to store records in.
--nate
...
Carie, I don't actually get how you implemented the hack : did you
reduplicate the
class DRMAAJobRunner
under a different name in drmaa.py? And where do you define every next
cluster (controller machines)?
Can you give me some more detalis?
Thank you
Nikolay
===============
Nikolay Vazov, PhD
Department for Research Computing, University of Oslo
------------------------------
*From:* Nate Coraor <nate@bx.psu.edu>
*Sent:* 01 February 2016 17:28
*To:* Ganote, Carrie L
*Cc:* John Chilton; Nikolay Aleksandrov Vazov; dannon.baker@gmail.com;
galaxy-dev@lists.galaxyproject.org
*Subject:* Re: Galaxy sending jobs to multiple clusters
Hi Nikolay,
It's worth noting that using `--clusters` means you have to have your
controllers integrated using slurmdbd, and they must share munge keys. You
can set up separate destinations as in Carrie's example without having to
"integrate" your controllers at the slurm level. The downside of this
approach is that you can't have slurm automatically "balance" across
clusters, although Slurm's algorithm for doing this with `--clusters` is
fairly primitive. If you don't use `--clusters` you can attempt to do the
balancing with a dynamic job destination.
If you're not using slurmdbd, you may still need to share the same munge
key across clusters to allow the slurm client lib on the Galaxy server to
talk to both clusters. There could be ways around this if it's a problem,
though.
--nate
On Mon, Feb 1, 2016 at 11:10 AM, Ganote, Carrie L <cganote@iu.edu> wrote:
...
Hi Nikolay,
The slurm branch that John mentioned sounds great! That might be your
best bet.
I didn't get drmaa to run with multiple clusters with flags, but I did
'assign' different job handlers to different destinations in the drmaa.py
runner in Galaxy - but that is a bit of a hacky way to do it.
-Carrie
From: John Chilton <jmchilton@gmail.com>
Date: Monday, February 1, 2016 at 11:02 AM
To: Nikolay Aleksandrov Vazov <n.a.vazov@usit.uio.no>
Cc: "dannon.baker@gmail.com" <dannon.baker@gmail.com>, "
galaxy-dev@lists.galaxyproject.org" <galaxy-dev@lists.galaxyproject.org>,
Carrie Ganote <cganote@iu.edu>, Nate Coraor <nate@bx.psu.edu>
Subject: Re: Galaxy sending jobs to multiple clusters
Nate has a branch of slurm drmaa that allows specifying a --clusters
argument in the native specification this can be used to target
multiple hosts.
More information can be found here:
https://github.com/natefoo/slurm-drmaa
Here is how Nate uses it to configure usegalaxy.org:
https://github.com/galaxyproject/usegalaxy-playbook/blob/master/templates/ga...
I guess instead of installing slurm-drmaa for a package manager or the
default source - you will just need to install Nate's version.
-John
On Wed, Jan 20, 2016 at 1:18 PM, Nikolay Aleksandrov Vazov
<n.a.vazov@usit.uio.no> wrote:
Hi, John, Dan, Carrie and all others,
I am considering a task of setting up a Galaxy instance which shall send
jobs to more than on cluster at a time. In my case I am using
drmaa-python
and I was wondering if it was possible to configure multiple drmaa
runners
each "pointing" at a different (slurm) control host, e.g.
local
drmaa1
drmaa2
Thanks a lot for your advice
Nikolay
===============
Nikolay Vazov, PhD
Department for Research Computing, University of Oslo