Hello James,
Thanks for your reply.
James Taylor wrote, On 01/21/2010 07:24 PM:
> I have 7 runnings jobs (of user-A), 35 queued job from user-A and
2
> queued job from user-B, and galaxy consistently chooses to run queued
> job from user-A instead of user-B (presumably because they were queued
> before user-B submitted the jobs).
You've nailed it. The "queuing policy" determines the order in which the
jobs are handed to the runner. If a user queues a bunch of jobs while no
one else is waiting, and then another user comes along, they have to wait.
I'd just like to make a small distinction:
From the user's point of view, there are only two states:
"new" (i.e. not running) and "running".
I'd expect (as a
user) to have a fair-queuing, where all the "new" jobs are fairly executed among
users.
Galaxy, Internally, has two "not running" states: "new" and
"queued".
The "new" jobs are scheduled fairly, the "queued" ones are not (they
are FIFOed).
While I'm in no position to say how things should work, I think a single queue that is
fairly scheduled is preferable.
This is definitely not ideal, however unless you are running *only*
local job runners it is tricky to deal with because really it is the
underlying queuing system that decides what order jobs actually get run in.
Not sure I understand this, because my scheduling background is severely lacking. Why not
(naively) assume that the underlying queuing system is FIFO,
and make sure you submit the jobs to it in the order you want them executed (that is -
fairly)?
In a local-runner, we agree this is the correct solution.
But for your PBS cluster, does it use something else than FIFO (based on the order that
you run "qsub" from inside galaxy)?
Do you have a way to tell your PBS from which session the job came, so it will fairly
schedule them AFTER submission ?
I could be missing something huge here, but aren't "queued" job being sent
to PBS/SGE in FIFO order? meaning: jobs coming from a parallel workflow will still get
ahead of queue.
However, I thought you were already mapping Galaxy users onto
cluster
users, which should result in the cluster scheduler dealing with some of
these issues.
Not exactly.
First, we still use local-runner for performance reasons, and plan to have only
"heavy" jobs run on the cluster.
Second, the patch I sent some time ago adds the user name as the job's name - allowing
to keep track of which user started the job.
But all the job would still run under my SGE account.
(BTW, this is a huge selling point for our Galaxy - users can take advantage of the
cluster easily, without actually having to ask for a cluster unix account).
> Is there a way to work around this issue ? a different configuration
> maybe ?
> Or, an old request: is it possible to limit the number of
> jobs-per-user at any single time (i.e. a single user can run 3 jobs at
> any given time, even if no other users are running jobs and there are
> 7 workers ready) ?
I'd like to hear other suggestions, but I think some substantial
rewriting is the only way to deal with this.
Fortunately, we're in the midst of a big rewrite of the job manager and
scheduling, so this will definitely be something we work on.
If it's not to late to make some suggestions:
0. Fair-queuing everything
The fairness criteria can be debated, but in any way avoid the current FIFO situation.
1. prioritize "interactive" jobs over "workflow" jobs.
Reason:
If a user submits a job 'interactively' (by running a single tool and clicking
'execute'), he usually expect to get an answer fast (unless it's a long
mapping job).
If the job is part of a workflow (especially a NGS related workflow), it is reasonable to
assume the user doesn't expect it to complete immediately.
(Sounds similar to the classical batch vs interactive scheduling on unix).
This is a critical problem for my Galaxy server:
some users use it interactively to investigate and 'play around' with data,
running a job, looking at the output, changing a parameter, running it again, etc., etc.
Other users upload five FASTQ files, start five 25-step workflows, and walk away - knowing
that they'll have the results in a couple of hours.
The "interactive" users are *very* frustrated (and rightfully so).
2. Create 'classes' of tools: light/medium/heavy (or similar).
'light' tools are prioritized over 'heavy' tools (heavy usually means CPU
bound).
Reason:
Users who run light jobs (e.g. text tools, filter, join, visualization, simple interval
operations) can expect them to complete faster than heavy jobs (like mapping or huge
interval intersections). This is especially true when the jobs are interactive, but is
also reasonable for jobs which are part of workflows.
Of course - there are always the exceptions: a multi-column sort on a huge text file, or a
mapping job on a tiny FASTA file.
But the general rule holds (IMHO).
A more complicated criteria:
Some heuristics based on the "weight" of the tool and the size of the input
dataset.
The "weight" settings can be changed in tool_conf.xml or something similar.
3. Limit on number of jobs-per-user.
Maybe not so critical for a big dedicated cluster, but crucial for a local-job-runner, and
also for a cluster on which I have to "compete" for slots with other cluster
(non-galaxy) users.
If given the choice, I'd rather give up some perceived performance (a user on a idle
galaxy can not use it fully with 20 parallel jobs) and gain better responsiveness (new
users can always start jobs relatively fast). Sort of like through-put vs. latency.
4. Optional limit on jobs-per-tool.
This might be too specialized, but it has occurred couple of times for me.
Say there is a limited resource, and I've added a Galaxy tool to use that resource.
Examples: a tool to upload files to a remote FTP (limited bandwidth), a tool to use
special hardware devices, a tool that connects to a remote computer and runs something
with SSH (and I don't want to over load that computer), a tool that requires 23GB or
ram, and I have to run it locally on my 32GB ram server instead of the 8GB cluster nodes
(but I can only run it once at a time).
I'm aware that all of these problems can be solved by correctly installing a new
SGE/PBS cluster manager and configuring a specialized queue, and configuring the
scheduling of that queue - but to be honest: I don't want to be an SGE administrator.
And some suggestions not related to queuing:
5. The ability to shutdown galaxy without losing jobs.
I guess that with a cluster + job-recovery, you never lose jobs when you stop galaxy,
but how do I do that with a local runner ?
Maybe something like the "job-lock", which prevents new jobs from being started.
I suggested a patch long ago, but now I realize it will not work with a multi-process
environment, so something better is needed.
Maybe unlike your public galaxy (which is very stable and not changed often), my local
galaxy server is much less stable, we sometimes add several tools a week,
we frequently add new genomes, we remove tools that don't work and replace them with
newer versions, etc.
I understand it's complicated to reload everything without stopping galaxy, so I need
a way to stop and start it without causing too much damage to my users (or sending funny
lab-wide emails asking: "please don't start any jobs this evening after
7pm").
6. Track the time a job spend in each state (new/queue/running/metadata).
7. Better tracking of "external/internal metadata" state. Call me crazy
("Anal retentive" is also a word that comes to mind...), but I'm having hard
time keeping track of jobs in the reports/unfinished_jobs page (and I do that all the
time). Jobs are completed, their process doesn't appear in "ps", but their
state is still "running" (or something else?). When I see the set-metadata
scripts in "ps" - I want to know where they come from: which job, which user,
everything.
If it's a job like any other job, I want to know all about it.
Especially if a set_metadata job takes minutes (and it does for large files) - Is that
being counted somewhere ?
If an external metadata job is run by other means (cloning histories?) I also want to know
about it.
In short, I need a report page that I can trust to show me exactly what is being executed
right now.
8. Silly, but I think it's true:
When a local-runner executes new processes, they inherit all of its open file descriptors,
or at least the listening socket.
So when my galaxy process is stopped while jobs are still running, the port is taken, and
restarting galaxy fails with "socket bind" error.
I need to kill the jobs and only then restart galaxy (and would much rather let them
finish and set the dataset state manually to "OK").
(of course - galaxy should never crash or be stopped while jobs are running, but I
can't help it).
Maybe it's possible to close file descriptors before forking a new local process ?
Thank you for reading so far,
Galaxy is still great, so please don't take my rants the wrong way,
-gordon