Job scheduling: FIFO, or fairer to multiple users?
Hello all, I'm curious if there is any way to manipulate the Galaxy job queuing in order to be 'fairer' to multiple simultaneous users. My impression is that Galaxy uses a simple FIFO queue itself, with for cluster jobs offloaded to the cluster queue immediately. In our case, I'm looking at large BLAST jobs (e.g. 20k queries against NR), which by their nature are easily subdivided between nodes (by dividing the query file up). We run these as one job per node (giving multiple cores for threading). That works nicely - the question I am currently pondering is tuning the split strategy, and multiple users. Specifically we get queue blocking if any one large BLAST jobs is divided into as many or more sub-jobs than we have cluster nodes in the BLAST queue. You can have one user's big BLAST job blocking multiple other user's small BLAST jobs even starting. I appreciate whether this is a problem will depend on the typical jobs run on each Galaxy instance, and the number and size of nodes in the local cluster - which makes a one-size-fits all strategy hard. I know that in order to be back-end agnostic, Galaxy takes limited advantage of different cluster backends - but perhaps the new 'run jobs as user' functionality might be helpful to allow the cluster to balance jobs between users? Is anyone doing that already? Another idea would be for Galaxy to manage its job queue on a user basis. Currently Galaxy submits all its jobs directly to the cluster, which can build up a backlog of pending jobs (whose scheduling is now out of Galaxy's control - probably simple FIFO depending on the cluster). Rather than giving the queued jobs to the cluster immediately, Galaxy could cache the jobs, and submit them gradually (monitor the cluster queue to see when it needs topping up). This would then enable Galaxy to interleave jobs from different users - any other queuing strategy. Too complicated? I think this is only a problem when the number of cluster nodes (in any given queue) is similar to or smaller than the number of parts a job might be broken up into. My guess is the public Galaxy doesn't do much job splitting (this code is quite new and not many of the wrappers exploit it), and has a large cluster. Is anyone else running into this kind of issues? Perhaps when Galaxy users are in competition with other cluster users? Thanks, Peter
participants (1)
-
Peter Cock