As use of our Galaxy installation is picking up, we're getting a lot of requests for greater fairness and transparency in the Galaxy job runner area. As I understand things the primary tool Galaxy gives us to affect processing order and wait times with our torque-based setup is the ability to map specific tools to varying queues or to keep them on a local-runner. On one end of the spectrum I could see a simple division of small/fast/light jobs on local and big/heavy/slow job on a single cluster queue. On the other extreme one could set up a queue per tool and use sophisticated queue management stuff on the torque side of things to balance capacity across tools, users, expected processing time, etc. How are other sites handling this? -- Ry4an Brase 612-626-6575 Software Developer Application Development University of Minnesota Supercomputing Institute http://www.msi.umn.edu
Ry4an Brase wrote:
As use of our Galaxy installation is picking up, we're getting a lot of requests for greater fairness and transparency in the Galaxy job runner area.
As I understand things the primary tool Galaxy gives us to affect processing order and wait times with our torque-based setup is the ability to map specific tools to varying queues or to keep them on a local-runner.
On one end of the spectrum I could see a simple division of small/fast/light jobs on local and big/heavy/slow job on a single cluster queue. On the other extreme one could set up a queue per tool and use sophisticated queue management stuff on the torque side of things to balance capacity across tools, users, expected processing time, etc.
How are other sites handling this?
Hi Ry4an, I'd prefer to keep most of the scheduling in the DRM (Torque, SGE, etc.) since that's what it's designed to do. That said, we want to make it as easy as possible to do this, and Galaxy currently only sort of has the ability to do it. By currently I mean that you can set DRM parameters per-tool in the config file. There are a couple of pieces that need to exist. For environments like our public site where Galaxy users can't map one-to-one with system users, Galaxy itself needs to be able to limit the number of jobs a user can run on a particular cluster. Work on this component is under way. On environments where Galaxy users *are* system users, Galaxy needs to do things that interact with the system, such as reading files from disk for upload, exporting files for download, and submittings cluster jobs as the real user. Writing this is near-ish to the top of my list. There's a final piece which we've discussed here quite a few times but are not very close to implemeting. That would be a config language to allow Galaxy to make decisions about DRM parameters to set based on variables like input size or sequence count, parameters selected, and so forth. A good example of where this is needed is in the mappers, which currently have a hardcoded multiprocessor setting of 4 that is almost certainly not appropriate for all environments. Ideally Galaxy would be able to decide where to run the job and based on that information, know how many threads/processes to start based on the resources the job is given. I'd love to see this also be able to make assumptions about runtime so that DRM backfill could be properly employed, but this may not be possible since most job runtimes are probably not a calculable function of the size of the input data and the selected parameters. --nate
-- Ry4an Brase 612-626-6575 Software Developer Application Development University of Minnesota Supercomputing Institute http://www.msi.umn.edu ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Mar 15, 2011, at 9:27 AM, Nate Coraor wrote:
Ry4an Brase wrote:
As use of our Galaxy installation is picking up, we're getting a lot of requests for greater fairness and transparency in the Galaxy job runner area.
As I understand things the primary tool Galaxy gives us to affect processing order and wait times with our torque-based setup is the ability to map specific tools to varying queues or to keep them on a local-runner.
On one end of the spectrum I could see a simple division of small/fast/light jobs on local and big/heavy/slow job on a single cluster queue. On the other extreme one could set up a queue per tool and use sophisticated queue management stuff on the torque side of things to balance capacity across tools, users, expected processing time, etc.
How are other sites handling this?
Hi Ry4an,
I'd prefer to keep most of the scheduling in the DRM (Torque, SGE, etc.) since that's what it's designed to do. That said, we want to make it as easy as possible to do this, and Galaxy currently only sort of has the ability to do it. By currently I mean that you can set DRM parameters per-tool in the config file.
There are a couple of pieces that need to exist. For environments like our public site where Galaxy users can't map one-to-one with system users, Galaxy itself needs to be able to limit the number of jobs a user can run on a particular cluster. Work on this component is under way.
On environments where Galaxy users *are* system users, Galaxy needs to do things that interact with the system, such as reading files from disk for upload, exporting files for download, and submittings cluster jobs as the real user. Writing this is near-ish to the top of my list.
There's a final piece which we've discussed here quite a few times but are not very close to implemeting. That would be a config language to allow Galaxy to make decisions about DRM parameters to set based on variables like input size or sequence count, parameters selected, and so forth. A good example of where this is needed is in the mappers, which currently have a hardcoded multiprocessor setting of 4 that is almost certainly not appropriate for all environments. Ideally Galaxy would be able to decide where to run the job and based on that information, know how many threads/processes to start based on the resources the job is given. I'd love to see this also be able to make assumptions about runtime so that DRM backfill could be properly employed, but this may not be possible since most job runtimes are probably not a calculable function of the size of the input data and the selected parameters.
--nate
are there issues open for these Galaxy changes? I would like to follow the development. -- Glen L. Beane Senior Software Engineer The Jackson Laboratory (207) 288-6153
Glen Beane wrote:
I'd prefer to keep most of the scheduling in the DRM (Torque, SGE, etc.) since that's what it's designed to do. That said, we want to make it as easy as possible to do this, and Galaxy currently only sort of has the ability to do it. By currently I mean that you can set DRM parameters per-tool in the config file.
There are a couple of pieces that need to exist. For environments like our public site where Galaxy users can't map one-to-one with system users, Galaxy itself needs to be able to limit the number of jobs a user can run on a particular cluster. Work on this component is under way.
On environments where Galaxy users *are* system users, Galaxy needs to do things that interact with the system, such as reading files from disk for upload, exporting files for download, and submittings cluster jobs as the real user. Writing this is near-ish to the top of my list.
There's a final piece which we've discussed here quite a few times but are not very close to implemeting. That would be a config language to allow Galaxy to make decisions about DRM parameters to set based on variables like input size or sequence count, parameters selected, and so forth. A good example of where this is needed is in the mappers, which currently have a hardcoded multiprocessor setting of 4 that is almost certainly not appropriate for all environments. Ideally Galaxy would be able to decide where to run the job and based on that information, know how many threads/processes to start based on the resources the job is given. I'd love to see this also be able to make assumptions about runtime so that DRM backfill could be properly employed, but this may not be possible since most job runtimes are probably not a calculable function of the size of the input data and the selected parameters.
--nate
are there issues open for these Galaxy changes? I would like to follow the development.
Yes: https://bitbucket.org/galaxy/galaxy-central/issue/106/run-cluster-jobs-as-th... --nate
-- Glen L. Beane Senior Software Engineer The Jackson Laboratory (207) 288-6153
participants (3)
-
Glen Beane
-
Nate Coraor
-
Ry4an Brase