Galaxy fronting multiple clusters
Hello - I am working on standing up our own galaxy installation. We would like to have galaxy front multiple clusters, and I have some questions I was hoping someone could help with. 1) From reading other forum posts on this subject, it seems I need to minimally do the following ... is this correct?: A) have galaxy server w/ sge register as a job submitting host to the head node of each cluster B) Configure multiple tool runners for each tool per remote cluster? 2) When galaxy would submit a job, how would a backend remote cluster be selected? When running workflows, would the same cluster be used to run the entire workflow - or could the workflow then span remote clusters? 3) I am trying to understand some of the source code, where is the logic that would then dispatch the job and select a job runner to use? 4) Other advice or steps needed in order to get galaxy to front multiple remote clusters? Thanks so much, Ann
From my understanding of Galaxy, the tool runner for each tool is hardcoded in the universe.ini file and if you do not configure a tool runner for a tool, Galaxy will use the default tool runner, which is determined by the default_cluster_job_runner parameter. I believe you can configure multiple job runners for a specific tool under the [galaxy:tool_runners] in universe.ini, for instance, for each cluster, you have a different tool runner for a specific tool, however, Galaxy probably will just use one of
I have never setup a Galaxy instance fronting multiple clusters, but it's something I would like to explore. I have a dedicated cluster to run Galaxy jobs and I've got another a shared cluster to which I hope Galaxy can assign jobs when the dedicated cluster is too busy. them, most likely the last one. So cluster selection for a tool is determined by the job runner, which is hard coded in the universe.ini file. As a result, the running of a workflow is determined by the tools in the workflow. If each tool in the workflow is configured to use the same cluster, then the workflow is run on the same cluster, otherwise, it will span multiple clusters. I think if you can configure the machine that runs Galaxy instance to be the submit host of multiple clusters, then it's possible to have Galaxy front multiple clusters. For me, the biggest hurdle is how to let two clusters having a shared storage space and configure a machine in one cluster to be the submit host of another cluster. Thanks, Luoin On Mon, Sep 19, 2011 at 9:44 AM, Ann Black <annblack@eng.uiowa.edu> wrote:
Hello -
I am working on standing up our own galaxy installation. We would like to have galaxy front multiple clusters, and I have some questions I was hoping someone could help with.
1) From reading other forum posts on this subject, it seems I need to minimally do the following ... is this correct?: A) have galaxy server w/ sge register as a job submitting host to the head node of each cluster B) Configure multiple tool runners for each tool per remote cluster?
2) When galaxy would submit a job, how would a backend remote cluster be selected? When running workflows, would the same cluster be used to run the entire workflow - or could the workflow then span remote clusters?
3) I am trying to understand some of the source code, where is the logic that would then dispatch the job and select a job runner to use?
4) Other advice or steps needed in order to get galaxy to front multiple remote clusters?
Thanks so much,
Ann
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Ann Black wrote:
Hello -
I am working on standing up our own galaxy installation. We would like to have galaxy front multiple clusters, and I have some questions I was hoping someone could help with.
1) From reading other forum posts on this subject, it seems I need to minimally do the following ... is this correct?: A) have galaxy server w/ sge register as a job submitting host to the head node of each cluster B) Configure multiple tool runners for each tool per remote cluster?
2) When galaxy would submit a job, how would a backend remote cluster be selected? When running workflows, would the same cluster be used to run the entire workflow - or could the workflow then span remote clusters?
3) I am trying to understand some of the source code, where is the logic that would then dispatch the job and select a job runner to use?
4) Other advice or steps needed in order to get galaxy to front multiple remote clusters?
Hi Ann, This is all split per tool, there is no way to have a tool run on more than one. We're hoping to expand our cluster loading support within the next year, however. The method for setting the cluster options for a tool can be found at the bottom of the cluster wiki page: http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Cluster With SGE this could be a bit tricky as the SGE cell to use is pulled from the environment. It might be possible to make copies of the drmaa runner (lib/galaxy/jobs/runners/drmaa.py) and set SGE_ROOT as the runner starts up, but changing it as each runner starts may break runners which have already started, so this would need some testing. --nate
Thanks so much,
Ann
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hello, I'm trying to get an instance of Galaxy working where the application server - the web front end, as I understand it - is on a completely separate host to the SGE cluster the back end runs on. Is there any way of setting up galaxy that it uses ssh instead of having to have the application server/SGE have a shared NFS filesystem? Is having a shared filesystem the only way I can get Galaxy to work in this kind of scenario? We have a general-purpose cluster which doesn't export filesystems for security reasons, and doesn't run shared web applications. Thanks for any help you could offer. Mike. -- Mike Wallis +44(0)113 343 1880 ARC/HPC Systems Support ISS, University of Leeds, Leeds, LS2 9JT, UK
Mike Wallis wrote:
Hello,
I'm trying to get an instance of Galaxy working where the application server - the web front end, as I understand it - is on a completely separate host to the SGE cluster the back end runs on. Is there any way of setting up galaxy that it uses ssh instead of having to have the application server/SGE have a shared NFS filesystem?
Is having a shared filesystem the only way I can get Galaxy to work in this kind of scenario? We have a general-purpose cluster which doesn't export filesystems for security reasons, and doesn't run shared web applications.
Thanks for any help you could offer.
Hi Mike, Unless you can add file-staging support to the DRMAA job runner, you need a shared filesystem. I had been under the impression that staging was not supported in the DRMAA 1.0, although it might be possible with the native spec field. That said, there are some tools which write files to the working directory and those outputs are collected by the job runner, so if the working directory is not in a shared filesystem, these tools won't work properly. Unfortunately, I don't have a list of such tools, but there are quite a few. --nate
Mike. -- Mike Wallis +44(0)113 343 1880 ARC/HPC Systems Support ISS, University of Leeds, Leeds, LS2 9JT, UK
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Nate, Thanks for this! It has saved me wasting time barking up an inappropriate tree, and we'll rethink the plans. Regards, Mike. -- Mike Wallis +44(0)113 343 1880 ARC/HPC Systems Support ISS, University of Leeds, Leeds, LS2 9JT, UK -----Original Message----- From: Nate Coraor [mailto:nate@bx.psu.edu] Sent: 03 October 2011 19:12 To: Mike Wallis Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] trying to get a Galaxy instance working w/ SGE cluster. Mike Wallis wrote:
Hello,
I'm trying to get an instance of Galaxy working where the application server - the web front end, as I understand it - is on a completely separate host to the SGE cluster the back end runs on. Is there any way of setting up galaxy that it uses ssh instead of having to have the application server/SGE have a shared NFS filesystem?
Is having a shared filesystem the only way I can get Galaxy to work in this kind of scenario? We have a general-purpose cluster which doesn't export filesystems for security reasons, and doesn't run shared web applications.
Thanks for any help you could offer.
Hi Mike, Unless you can add file-staging support to the DRMAA job runner, you need a shared filesystem. I had been under the impression that staging was not supported in the DRMAA 1.0, although it might be possible with the native spec field. That said, there are some tools which write files to the working directory and those outputs are collected by the job runner, so if the working directory is not in a shared filesystem, these tools won't work properly. Unfortunately, I don't have a list of such tools, but there are quite a few. --nate
Mike. -- Mike Wallis +44(0)113 343 1880 ARC/HPC Systems Support ISS, University of Leeds, Leeds, LS2 9JT, UK
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Thanks Nate! What types of plans do you have for multiple clusters and do you have a committed timeline? Since this forum post, I have spoken with a few other users of Galaxy and have been doing some poking around. I did already read the link you provided thanks much. Unfortunately, from the doc it was not clear if multiple tool runners could be defined and if so how the tool runners was selected. Your response below confirms what I was figuring out, thanks! (IE, you can only have one runner per tool and pin the tool to the cluster which is even harder for SGE since the connection info is environment variable based). We are exploring writing our own job runner that might front multiple clusters and dispatch based off of some simple rules. BTW, i have found the following website useful as I am new to SGE (http://arc.liv.ac.uk/SGE/howto/). Off the galaxy architecture web page it states that the job runners are extensible, I am trying to understand the code pathways. I see that we would need to provide an instance of the BaseJobRunner and drop in our job runner python into lib/jobs/runners. Where does the logic sit that calls into the appropriate job runner based off of the configuration? Is there some documentation around with guidance on how to implement our own runners? Finally, one other idea to discuss is having multiple galaxy installations (per cluster) with a shared database/file storage. I am wondering if this would be supported or has been done? Would there be potential for data corruption if there are multiple galaxy instances, dispatching jobs to their local cluster, and updating the database (for example risks of getting duplicate galaxy ids, dataset ids, etc?) Thanks again, Ann On Sep 28, 2011, at 9:29 AM, Nate Coraor wrote:
Ann Black wrote:
Hello -
I am working on standing up our own galaxy installation. We would like to have galaxy front multiple clusters, and I have some questions I was hoping someone could help with.
1) From reading other forum posts on this subject, it seems I need to minimally do the following ... is this correct?: A) have galaxy server w/ sge register as a job submitting host to the head node of each cluster B) Configure multiple tool runners for each tool per remote cluster?
2) When galaxy would submit a job, how would a backend remote cluster be selected? When running workflows, would the same cluster be used to run the entire workflow - or could the workflow then span remote clusters?
3) I am trying to understand some of the source code, where is the logic that would then dispatch the job and select a job runner to use?
4) Other advice or steps needed in order to get galaxy to front multiple remote clusters?
Hi Ann,
This is all split per tool, there is no way to have a tool run on more than one. We're hoping to expand our cluster loading support within the next year, however.
The method for setting the cluster options for a tool can be found at the bottom of the cluster wiki page:
http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Cluster
With SGE this could be a bit tricky as the SGE cell to use is pulled from the environment. It might be possible to make copies of the drmaa runner (lib/galaxy/jobs/runners/drmaa.py) and set SGE_ROOT as the runner starts up, but changing it as each runner starts may break runners which have already started, so this would need some testing.
--nate
Thanks so much,
Ann
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Ann Black wrote:
Thanks Nate!
What types of plans do you have for multiple clusters and do you have a committed timeline?
Hi Ann, Unfortunately, no committed timeline yet. The plan is to make it possible to define many job targets, which could be different clusters or the same cluster with different job arguments. On top of that, there should be a language to describe how Galaxy should choose a job target that is more advanced than just a tool id.
Since this forum post, I have spoken with a few other users of Galaxy and have been doing some poking around. I did already read the link you provided thanks much. Unfortunately, from the doc it was not clear if multiple tool runners could be defined and if so how the tool runners was selected. Your response below confirms what I was figuring out, thanks! (IE, you can only have one runner per tool and pin the tool to the cluster which is even harder for SGE since the connection info is environment variable based).
We are exploring writing our own job runner that might front multiple clusters and dispatch based off of some simple rules. BTW, i have found the following website useful as I am new to SGE (http://arc.liv.ac.uk/SGE/howto/). Off the galaxy architecture web page it states that the job runners are extensible, I am trying to understand the code pathways. I see that we would need to provide an instance of the BaseJobRunner and drop in our job runner python into lib/jobs/runners. Where does the logic sit that calls into the appropriate job runner based off of the configuration? Is there some documentation around with guidance on how to implement our own runners?
No documentation, unfortunately. The piece you're looking for is in lib/galaxy/jobs/__init__.py, the DefaultJobDispatcher class. This will load runner "plugins" from the runners directory based on filename and a class name in the plugin's __all__ array (loading classes subclassed from BaseJobRunner probably would have been a good idea).
Finally, one other idea to discuss is having multiple galaxy installations (per cluster) with a shared database/file storage. I am wondering if this would be supported or has been done? Would there be potential for data corruption if there are multiple galaxy instances, dispatching jobs to their local cluster, and updating the database (for example risks of getting duplicate galaxy ids, dataset ids, etc?)
This is safe as long as you do not track jobs in the database and do not enable job recovery. Eventually we plan to have a job dispatcher which would set the job's owning process so that multiple runners can exist with job recovery. --nate
Thanks again,
Ann
On Sep 28, 2011, at 9:29 AM, Nate Coraor wrote:
Ann Black wrote:
Hello -
I am working on standing up our own galaxy installation. We would like to have galaxy front multiple clusters, and I have some questions I was hoping someone could help with.
1) From reading other forum posts on this subject, it seems I need to minimally do the following ... is this correct?: A) have galaxy server w/ sge register as a job submitting host to the head node of each cluster B) Configure multiple tool runners for each tool per remote cluster?
2) When galaxy would submit a job, how would a backend remote cluster be selected? When running workflows, would the same cluster be used to run the entire workflow - or could the workflow then span remote clusters?
3) I am trying to understand some of the source code, where is the logic that would then dispatch the job and select a job runner to use?
4) Other advice or steps needed in order to get galaxy to front multiple remote clusters?
Hi Ann,
This is all split per tool, there is no way to have a tool run on more than one. We're hoping to expand our cluster loading support within the next year, however.
The method for setting the cluster options for a tool can be found at the bottom of the cluster wiki page:
http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Cluster
With SGE this could be a bit tricky as the SGE cell to use is pulled from the environment. It might be possible to make copies of the drmaa runner (lib/galaxy/jobs/runners/drmaa.py) and set SGE_ROOT as the runner starts up, but changing it as each runner starts may break runners which have already started, so this would need some testing.
--nate
Thanks so much,
Ann
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (4)
-
Ann Black
-
Luobin Yang
-
Mike Wallis
-
Nate Coraor