Dynamic job runner configuration followup
Following up on some recent threads that have referenced my dynamic job runner configuration work, Nate and I have discussed these issues and I have created a new pull request based on those discussions and I am confident these changes will be accepted soon. Things are basically as I outlined them in my previous description: http://www.mail-archive.com/galaxy-dev@lists.bx.psu.edu/msg03010.html except for where to place the rules has changed. Now instead of placing them in lib/galaxy/jobs/rules.py you will need to create a file (or multiple files for them) in your lib/galaxy/jobs/rules the directory. My previous e-mail was a technical description of how it worked, I think maybe that is why it didn't generate the excitement I had hoped :). I think instead describing some concrete use cases might be better. So here are six cool things you can do with dynamic job runners. 1) Change maximum walltime based on job parameters or file sizes. 2) Implement wild card like configuration of job runners instead of configuring one tool at a time. 3) Create queues with different priorities, and then give higher priorities to people giving demos (or directors or testers etc...). 4) Utilize environment variables to determine job runner configurations. 5) Limit a particular tool's use to only white-listed users. 6) Tie into Galaxy's job history tables to throttle those problem users clogging up your Galaxy instance. To do any of these you will need pull in the changes from bitbucket, add dynamic to the start_job_runners configuration option in universe_wsgi.ini, and create a file such as lib/galaxy/jobs/rules/200_runners.py for your rules. Below I describe how to do these, though I haven't actually tested the code snippets so they should be considered just an outline of the idea, your mileage may vary. 1) Change maximum walltime based on job parameters or file sizes. Lets say you want to change the max walltime of the BlastN based on the size of the input query. First you would add the line ncbi_blastn_wrapper=dynamic:///python to universe_wsgi.ini. Next in 200_runners.py you would add a function such as following: import os def ncbi_blastn_wrapper(job): inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] ) inp_data.update( [ ( da.name, da.dataset ) for da in job.input_library_datasets ] ) query_file = inp_data[ "query" ].file_name query_size = os.path.getsize( query_file ) if query_size > 1024 * 1024: return 'pbs:////-l walltime=24:00:00/' else: return 'pbs:////-l walltime=12:00:00/' 2) Implement wild card like configuration of job runners instead of configuring one tool at a time. Lets say you have a coworker called J. Johnson ummm wait no Jim J. and he maintains a tool suite for a fictitious metagenomics application called fathur. Assume also that this fathur suite has dozens of tools clogging up your configuration file because they need to all use pbs:////-l procs=8/ instead of the default pbs://///. To configure all the fathur tools at once, in the [app:main] of universe_wsgi.ini would would change default_cluster_job_runner from pbs:///// to dynamic:///python/default_runner and then add the following function to 200_runners.py. def default_runner(tool_id): if tool_id.startswith('fathur_'): return 'pbs:////-l procs=8/' else: return 'pbs://///' 3) Create queues with different priorities, and then give higher priorities to people giving demos. Lets say the users defined by the admin_users configuration property in universe_wsgi.ini are the ones that give demos and do testing and so you want to increase their priority for all jobs, and lets say to do this you have created queues gx_normal and gx_important in your queue manager with differing priorities. You could then take the default_runner concept from the previous example and do something like this: def default_runner(app, user_email): admin_users = app.config.get( "admin_users", "" ).split( "," ) if user_email in admin_users: return 'pbs:///gx_important//' else: return 'pbs:///gx_normal//' You could define the list of users right in this file instead of pulling it in from admin_users and then apply this concept to give higher priority to director or testers or paying users. Alternatively, you could give lower priority to external users, people you just don't like, etc.... 4) Utilize environment variables to determine job runner configurations. Lets say you want cufflinks to always use as many cores as are available, but in your testing environment you only have 4 cores available whereas in production you have 16. Lets also say you have the environment variable MAX_CORES set and this will be different on each machine. You would then update universe_wsgi.ini to have cufflinks use the dynamic job config (cufflinks=dynamic:///python) and then add the following to 200_runners.py import os def cufflinks(): return 'pbs:////-l procs=%s/' % os.environ['MAX_CORES'] (Warning you would need to update the cufflinks wrapper to respect this environment variable also). 5) Limit a particular tool's use to only white-listed users. Say you are developing top_secret_tool and you only want chilton@msi.umn.edu to be able to use it, you could add the line top_secret_tool=dynamic:///python/only_john to universe_wsgi.ini and then the following function to 200_runners.py. def only_john(user_email): if user_email == "chilton@msi.umn.edu": return 'pbs://///' else: return 'noway://///' This is admittedly quite the hack but it works. 6) Tie into Galaxy's job history tables to throttle those problem users clogging up your Galaxy instance. Okay, I haven't actually worked through how to do this one, but I am sure it is possible. -John ------------------------------------------------ John Chilton Senior Software Developer University of Minnesota Supercomputing Institute Office: 612-625-0917 Cell: 612-226-9223
On Sun, Jun 10, 2012 at 7:33 AM, John Chilton <chilton@msi.umn.edu> wrote:
My previous e-mail was a technical description of how it worked, I think maybe that is why it didn't generate the excitement I had hoped :). I think instead describing some concrete use cases might be better. So here are six cool things you can do with dynamic job runners.
...
1) Change maximum walltime based on job parameters or file sizes.
Lets say you want to change the max walltime of the BlastN based on the size of the input query. First you would add the line ncbi_blastn_wrapper=dynamic:///python to universe_wsgi.ini. Next in 200_runners.py you would add a function such as following:
import os
def ncbi_blastn_wrapper(job): inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] ) inp_data.update( [ ( da.name, da.dataset ) for da in job.input_library_datasets ] ) query_file = inp_data[ "query" ].file_name query_size = os.path.getsize( query_file ) if query_size > 1024 * 1024: return 'pbs:////-l walltime=24:00:00/' else: return 'pbs:////-l walltime=12:00:00/'
So these wall time estimates are in a separate file from the tool wrapper - that seems a good idea as they will depend on the local cluster node power. And they can be elaborated on as needed (e.g. for BLAST, consider both the number of query sequences and the number of subject sequences - i.e. the database size). Presumably the exact same approach could handle this: (7) Change job priority or queue depending on job details. Potentially memory intensive tasks like assembly, jobs could be allocated to a big memory queue if the input read count is large, or allocated to the normal (lower) memory queue for smaller jobs like bacteria or viruses. Or, in a slight variation to your wall time snippet, the code could (also) specify big jobs go in the low priority queue while small jobs go in the high priority queue - either using named queues or priority settings depending on the cluster setup. That was something I was hoping to do: http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-June/009962.html With a heterogeneous cluster setup this sort of things would be very helpful. If Nate is positive about including your work soon, that is very good news :) Peter
participants (2)
-
John Chilton
-
Peter Cock