Dynamic job runner status? Could I use it in combination with job splitting?
Hi, I've been researching the possibility of using dynamic job runner in combination with job splitting for blast jobs. My main interest is to create a rule where both the size of the query and the database are taken into consideration in order to select DRMAA and splitting options. My first question is what is the status on dynamic job runner? I found these two threads, but is not clear to me if this feature is part of galaxy-dist already: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-October/007160.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-June/010080.html My second question is if there is any documentation other than the threads above to configure something like what I describe? In any case, there is very good information from John in these emails and I think that should get me started at least. Cheers, Carlos
On Nov 1, 2012, at 4:22 PM, Carlos Borroto <carlos.borroto@gmail.com> wrote:
Hi,
I've been researching the possibility of using dynamic job runner in combination with job splitting for blast jobs. My main interest is to create a rule where both the size of the query and the database are taken into consideration in order to select DRMAA and splitting options.
My first question is what is the status on dynamic job runner? I found these two threads, but is not clear to me if this feature is part of galaxy-dist already: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-October/007160.html http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-June/010080.html
The dynamic runner is alive and well thanks to the continuous work of the author John Chilton and others. For instance, yesterday John committed code to galaxy-central that propagates tracebacks from the dynamic job runner's job rules script to the logs and also allows us to raise explicit exceptions from the job rules script thus making it possible to fail a job and print an arbitrary message in the job history's error message as well as making it much easier to debug the job rules.
My second question is if there is any documentation other than the threads above to configure something like what I describe? In any case, there is very good information from John in these emails and I think that should get me started at least.
John's information is right on spot. I can also share my rules script with you. I've completely switched our production instance to the dynamic runner. It allows me to set the resource requests based on datasets (blast query at the moment, but I'm getting to the point of using the database as well), resources allocated to the user's group or a secondary group and so on as we well as check internal ACLs for tool usage restrictions and such and send jobs to specific queues or compute nodes as necessary. I think your goals will inevitably bring you to the dynamic runner and it's not a bad thing. The only large obstacle in the job handling area as I perceive it is the lack of a generic way to plug some <input> statements into any tool's interface without changing all tool definition files, but that's beyond the dynamic runner and is something for the Galaxy Core Team to consider. Cheers, Alex
Hi Alex, On Fri, Nov 2, 2012 at 9:53 AM, Oleksandr Moskalenko <om@hpc.ufl.edu> wrote:
The dynamic runner is alive and well thanks to the continuous work of the author John Chilton and others. For instance, yesterday John committed code to galaxy-central that propagates tracebacks from the dynamic job runner's job rules script to the logs and also allows us to raise explicit exceptions from the job rules script thus making it possible to fail a job and print an arbitrary message in the job history's error message as well as making it much easier to debug the job rules.
Great!.
John's information is right on spot. I can also share my rules script with you. I've completely switched our production instance to the dynamic runner. It allows me to set the resource requests based on datasets (blast query at the moment, but I'm getting to the point of using the database as well), resources allocated to the user's group or a secondary group and so on as we well as check internal ACLs for tool usage restrictions and such and send jobs to specific queues or compute nodes as necessary.
I think your goals will inevitably bring you to the dynamic runner and it's not a bad thing.
Please, could you share your rules related to blast. I would love to take a look at them. Thanks, Carlos
On Fri, Nov 2, 2012 at 3:58 PM, Carlos Borroto <carlos.borroto@gmail.com> wrote:
On Fri, Nov 2, 2012 at 9:53 AM, Oleksandr Moskalenko <om@hpc.ufl.edu> wrote:
The dynamic runner is alive and well thanks to the continuous work of the author John Chilton and others. For instance, yesterday John committed code to galaxy-central that propagates tracebacks from the dynamic job runner's job rules script to the logs and also allows us to raise explicit exceptions from the job rules script thus making it possible to fail a job and print an arbitrary message in the job history's error message as well as making it much easier to debug the job rules.
Great!.
First bump: galaxy.jobs.handler DEBUG 2012-11-02 16:37:00,728 Loaded job runner: galaxy.jobs.runners.drmaa:DRMAAJobRunner galaxy.jobs.handler ERROR 2012-11-02 16:29:00,415 Job runner is not loadable: galaxy.jobs.runners.dynamic Traceback (most recent call last): File "/local/opt/galaxy/galaxy-dist/lib/galaxy/jobs/handler.py", line 375, in _load_plugin module = __import__( module_name ) ImportError: No module named dynamic In my universe_wsgi.ini: start_job_runners = drmaa,dynamic I get the same error with -dist and -central. Alex, are you sure dynamic job runner is already included? Should I pull something from somewhere else? Thanks, Carlos
On Nov 2, 2012, at 4:42 PM, Carlos Borroto <carlos.borroto@gmail.com> wrote:
On Fri, Nov 2, 2012 at 3:58 PM, Carlos Borroto <carlos.borroto@gmail.com> wrote:
On Fri, Nov 2, 2012 at 9:53 AM, Oleksandr Moskalenko <om@hpc.ufl.edu> wrote:
The dynamic runner is alive and well thanks to the continuous work of the author John Chilton and others. For instance, yesterday John committed code to galaxy-central that propagates tracebacks from the dynamic job runner's job rules script to the logs and also allows us to raise explicit exceptions from the job rules script thus making it possible to fail a job and print an arbitrary message in the job history's error message as well as making it much easier to debug the job rules.
Great!.
First bump:
galaxy.jobs.handler DEBUG 2012-11-02 16:37:00,728 Loaded job runner: galaxy.jobs.runners.drmaa:DRMAAJobRunner galaxy.jobs.handler ERROR 2012-11-02 16:29:00,415 Job runner is not loadable: galaxy.jobs.runners.dynamic Traceback (most recent call last): File "/local/opt/galaxy/galaxy-dist/lib/galaxy/jobs/handler.py", line 375, in _load_plugin module = __import__( module_name ) ImportError: No module named dynamic
In my universe_wsgi.ini:
start_job_runners = drmaa,dynamic
I get the same error with -dist and -central. Alex, are you sure dynamic job runner is already included? Should I pull something from somewhere else?
Hi Carlos, Dynamic job runner is a layer on top of either the pbs or the drmaa runner that actually submits the job: start_job_runners = drmaa default_cluster_job_runner = dynamic:///python/default_runner where "default_runner" is a procedure you define in the job rules script. Then, you can set the correspondence per tool as in upload1 = dynamic:///python/upload1 ncbi_blastn_wrapper = dynamic:///python/ncbi_blastn ncbi_blastp_wrapper = dynamic:///python/ncbi_blastp and so on or just figure out the tool_id in the default_runner procedure. Regards, Alex
On Nov 2, 2012, at 3:58 PM, Carlos Borroto <carlos.borroto@gmail.com> wrote:
Please, could you share your rules related to blast. I would love to take a look at them.
Thanks, Carlos
Here is the blastn rule procedure code and the relevant snippet of the default runner procedure. I just added the database based multiplier, so this part is very simple at the moment. I just set a bogus multiplier of "4" for the "nt_*" databases as an example. def ncbi_blastn(job): nodes = 1 ppn = 4 walltime='167:00:00' inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] ) inp_data.update( [ ( da.name, da.dataset ) for da in job.input_library_datasets ] ) query_file = inp_data[ "query" ].file_name query_size = os.path.getsize( query_file ) inp_params = dict( [ ( da.name, da.value ) for da in job.parameters ] ) inp_params.update( [ ( da.name, da.value ) for da in job.parameters ] ) db_dict = eval(inp_params['db_opts']) db = db_dict['database'] db_multiplier = 1 if db.startswith('nt'): db_multiplier = 4 if query_size <= 20 * 1024 * 1024: pmem = 500 pmem_unit = 'mb' elif query_size > 20 * 1024 * 1024 and query_size <= 50 * 1024 * 1024: pmem = 750 pmem_unit = 'mb' elif query_size > 50 * 1024 * 1024 and query_size <= 100 * 1024 * 1024: pmem = 1500 pmem_unit = 'mb' elif query_size > 100 * 1024 * 1024 and query_size <= 500 * 1024 * 1024: pmem = 2 pmem_unit = 'gb' elif query_size > 500 * 1024 * 1024 and query_size <= 1000 * 1024 * 1024: pmem = 4 pmem_unit = 'gb' elif query_size > 1000 * 1024 * 1024 and query_size <= 2000 * 1024 * 1024: pmem = 10 pmem_unit = 'gb' elif query_size > 2000 * 1024 * 1024: pmem = 20 pmem_unit = 'gb' log.debug('OM: blastn query size is in the bigmem category: %skb\n' % (query_size)) else: pmem = 5 pmem_unit = 'gb' if db_multiplier > 1: pmem = int(pmem * db_multiplier) pmem_str = "%d%s" % (pmem, pmem_unit) log.debug('OM: blastn query: %skb, db: %s, pmem: %s\n' % (query_size, db, pmem_str)) return {'nodes':nodes,'ppn':ppn,'pmem':pmem_str,'walltime':walltime} def default_runner(tool_id, job): ... elif tool_id_src.startswith('ncbi_blastn_wrapper'): request = ncbi_blastn(job) ... drmaa = 'drmaa://%s%s%s/' % (queue_str, group_str, request_str) return drmaa Regards, Alex
On Fri, Nov 2, 2012 at 6:03 PM, Oleksandr Moskalenko <om@hpc.ufl.edu> wrote:
Here is the blastn rule procedure code and the relevant snippet of the default runner procedure. I just added the database based multiplier, so this part is very simple at the moment. I just set a bogus multiplier of "4" for the "nt_*" databases as an example.
def ncbi_blastn(job): nodes = 1 ppn = 4 walltime='167:00:00' inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] ) inp_data.update( [ ( da.name, da.dataset ) for da in job.input_library_datasets ] ) query_file = inp_data[ "query" ].file_name query_size = os.path.getsize( query_file ) inp_params = dict( [ ( da.name, da.value ) for da in job.parameters ] ) inp_params.update( [ ( da.name, da.value ) for da in job.parameters ] ) db_dict = eval(inp_params['db_opts']) db = db_dict['database'] db_multiplier = 1 if db.startswith('nt'): db_multiplier = 4 if query_size <= 20 * 1024 * 1024: pmem = 500 pmem_unit = 'mb' elif query_size > 20 * 1024 * 1024 and query_size <= 50 * 1024 * 1024: pmem = 750 pmem_unit = 'mb' elif query_size > 50 * 1024 * 1024 and query_size <= 100 * 1024 * 1024: pmem = 1500 pmem_unit = 'mb' elif query_size > 100 * 1024 * 1024 and query_size <= 500 * 1024 * 1024: pmem = 2 pmem_unit = 'gb' elif query_size > 500 * 1024 * 1024 and query_size <= 1000 * 1024 * 1024: pmem = 4 pmem_unit = 'gb' elif query_size > 1000 * 1024 * 1024 and query_size <= 2000 * 1024 * 1024: pmem = 10 pmem_unit = 'gb' elif query_size > 2000 * 1024 * 1024: pmem = 20 pmem_unit = 'gb' log.debug('OM: blastn query size is in the bigmem category: %skb\n' % (query_size)) else: pmem = 5 pmem_unit = 'gb' if db_multiplier > 1: pmem = int(pmem * db_multiplier) pmem_str = "%d%s" % (pmem, pmem_unit) log.debug('OM: blastn query: %skb, db: %s, pmem: %s\n' % (query_size, db, pmem_str)) return {'nodes':nodes,'ppn':ppn,'pmem':pmem_str,'walltime':walltime}
def default_runner(tool_id, job): ... elif tool_id_src.startswith('ncbi_blastn_wrapper'): request = ncbi_blastn(job) ... drmaa = 'drmaa://%s%s%s/' % (queue_str, group_str, request_str) return drmaa
Hi Alex, This is great and definitely helped me get going!. I found a few issues related to my local configuration. Like I'm using ncbi_blastn_wrapper that was migrated to the tool_shed, so I had to use: elif 'ncbi_tblastn_wrapper' in tool_id_src Instead of: elif tool_id_src.startswith('ncbi_blastn_wrapper'): The id for the tool from the shed_tool is: toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastn_wrapper/0.0.13 Hopefully this won't break later on. I also need to go back a do a better configuration of our local grid engine( using SGE ), as I did only a very bare bone installation and I'm running into this error: DeniedByDrmException: code 17: unknown resource "nodes" Which I realize is a configuration issue in my SGE. Last, and this was my mistake. I didn't initially realize that this example you share here, assumes all tools will call default_runner(), which in turn will call an specific function to figure out the drmaa options to set. I was trying to use the lines from your previous email: ncbi_blastn_wrapper = dynamic:///python/ncbi_blastn ncbi_blastp_wrapper = dynamic:///python/ncbi_blastp But note to myself and anybody else following the thread, the function being called from universe_wsgi.ini needs to return a proper drmaa/pbs url. Thanks again!, Carlos
On Nov 5, 2012, at 2:15 PM, Carlos Borroto <carlos.borroto@gmail.com> wrote:
On Fri, Nov 2, 2012 at 6:03 PM, Oleksandr Moskalenko <om@hpc.ufl.edu> wrote:
Here is the blastn rule procedure code and the relevant snippet of the default runner procedure. I just added the database based multiplier, so this part is very simple at the moment. I just set a bogus multiplier of "4" for the "nt_*" databases as an example.
Hi Alex,
This is great and definitely helped me get going!. I found a few issues related to my local configuration. Like I'm using ncbi_blastn_wrapper that was migrated to the tool_shed, so I had to use: elif 'ncbi_tblastn_wrapper' in tool_id_src
Instead of: elif tool_id_src.startswith('ncbi_blastn_wrapper'):
The id for the tool from the shed_tool is: toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastn_wrapper/0.0.13
Hopefully this won't break later on.
I hear you. I though of using regexes to avoid that sort of breakage, but instead started using the output of "tool_id_src = tool_id.split("/")[-2]" for my startswith matches. It works well enough.
I also need to go back a do a better configuration of our local grid engine( using SGE ), as I did only a very bare bone installation and I'm running into this error: DeniedByDrmException: code 17: unknown resource "nodes"
Which I realize is a configuration issue in my SGE.
Right, my code is Torque/MOAB specific. You need to rewrite the resource requests to use "pe", "slots", "h_vmem", and "h_rt" instead depending on your GE setup. It's an easy change.
Last, and this was my mistake. I didn't initially realize that this example you share here, assumes all tools will call default_runner(), which in turn will call an specific function to figure out the drmaa options to set. I was trying to use the lines from your previous email: ncbi_blastn_wrapper = dynamic:///python/ncbi_blastn ncbi_blastp_wrapper = dynamic:///python/ncbi_blastp
Yes, I switched from using separate runner lines to routing the jobs from the default runner as a more flexible approach.
But note to myself and anybody else following the thread, the function being called from universe_wsgi.ini needs to return a proper drmaa/pbs url.
Thanks again!, Carlos
Also, the newest error handling code depends on the changeset John put up in a merge request into galaxy-central. Regards, Alex
participants (2)
-
Carlos Borroto
-
Oleksandr Moskalenko