Job scheduling: FIFO, or fairer to multiple users?

1 May 2012

      Hello all,

I'm curious if there is any way to manipulate the Galaxy job
queuing in order to be 'fairer' to multiple simultaneous users.
My impression is that Galaxy uses a simple FIFO queue itself,
with for cluster jobs offloaded to the cluster queue immediately.

In our case, I'm looking at large BLAST jobs (e.g. 20k queries
against NR), which by their nature are easily subdivided
between nodes (by dividing the query file up). We run these
as one job per node (giving multiple cores for threading).
That works nicely - the question I am currently pondering
is tuning the split strategy, and multiple users.

Specifically we get queue blocking if any one large BLAST
jobs is divided into as many or more sub-jobs than we have
cluster nodes in the BLAST queue. You can have one user's
big BLAST job blocking multiple other user's small BLAST
jobs even starting.

I appreciate whether this is a problem will depend on the
typical jobs run on each Galaxy instance, and the number
and size of nodes in the local cluster - which makes a
one-size-fits all strategy hard.

I know that in order to be back-end agnostic, Galaxy takes
limited advantage of different cluster backends - but perhaps
the new 'run jobs as user' functionality might be helpful to
allow the cluster to balance jobs between users? Is anyone
doing that already?

Another idea would be for Galaxy to manage its job queue
on a user basis. Currently Galaxy submits all its jobs directly
to the cluster, which can build up a backlog of pending jobs
(whose scheduling is now out of Galaxy's control - probably
simple FIFO depending on the cluster). Rather than giving
the queued jobs to the cluster immediately, Galaxy could
cache the jobs, and submit them gradually (monitor the
cluster queue to see when it needs topping up). This
would then enable Galaxy to interleave jobs from different
users - any other queuing strategy. Too complicated?

I think this is only a problem when the number of cluster
nodes (in any given queue) is similar to or smaller than
the number of parts a job might be broken up into. My
guess is the public Galaxy doesn't do much job splitting
(this code is quite new and not many of the wrappers
exploit it), and has a large cluster.

Is anyone else running into this kind of issues? Perhaps
when Galaxy users are in competition with other cluster
users?

Thanks,

Peter

Peter Cock

tags

participants (1)