Scaling and hardware requirements
Hi all, I have a couple of questions regarding a server setup dedicated on Galaxy. The idea is to buy a 64 core 256GB RAM server. From my experience I believe that Galaxy will be able to scale up to 64 cpu's but I would like some more feedback on this. Also, is 4GB RAM per CPU core enough for NGS data? (including de-novo assembly) Bests, Nikos
On Wed, Sep 11, 2013 at 1:03 PM, Nikos Sidiropoulos <nikos.sidiro@gmail.com> wrote:
Hi all,
I have a couple of questions regarding a server setup dedicated on Galaxy.
The idea is to buy a 64 core 256GB RAM server. From my experience I believe that Galaxy will be able to scale up to 64 cpu's but I would like some more feedback on this. Also, is 4GB RAM per CPU core enough for NGS data? (including de-novo assembly)
Bests, Nikos
Hi Nikos, Is this going to be one server both for running Galaxy (which needs fairly low resources) and running jobs for Galaxy, like de novo assemblies (which need high resources)? i.e. You have one big machine only, no cluster? For de novo assembly the RAM per core/CPU isn't important, it is the total RAM on the machine. How much RAM you need depends on which assembler you use, the organism (both size and also complexity) and the volume of data. What you've described should be fine for bacterial assemblies and smaller eukaryotes - beyond that you'll need to give more details. Peter
Hi Peter It's going to be one big machine, running both Galaxy server and the jobs. It's going to be a multi-process configuration. If that idea is terribly bad please let me know so I can give back the feedback. De novo assembly can also be for the human/mouse genome. Bests, Nikos 2013/9/11 Peter Cock <p.j.a.cock@googlemail.com>
On Wed, Sep 11, 2013 at 1:03 PM, Nikos Sidiropoulos <nikos.sidiro@gmail.com> wrote:
Hi all,
I have a couple of questions regarding a server setup dedicated on Galaxy.
The idea is to buy a 64 core 256GB RAM server. From my experience I believe that Galaxy will be able to scale up to 64 cpu's but I would like some more feedback on this. Also, is 4GB RAM per CPU core enough for NGS data? (including de-novo assembly)
Bests, Nikos
Hi Nikos,
Is this going to be one server both for running Galaxy (which needs fairly low resources) and running jobs for Galaxy, like de novo assemblies (which need high resources)?
i.e. You have one big machine only, no cluster?
For de novo assembly the RAM per core/CPU isn't important, it is the total RAM on the machine. How much RAM you need depends on which assembler you use, the organism (both size and also complexity) and the volume of data.
What you've described should be fine for bacterial assemblies and smaller eukaryotes - beyond that you'll need to give more details.
Peter
On Wed, Sep 11, 2013 at 1:19 PM, Nikos Sidiropoulos <nikos.sidiro@gmail.com> wrote:
Hi Peter
It's going to be one big machine, running both Galaxy server and the jobs. It's going to be a multi-process configuration. If that idea is terribly bad please let me know so I can give back the feedback.
It should be OK for a small number of users, but at some point if usage increases you will have to switch to a multi-machine setup (or just a single even bigger machine). If you do have access to a cluster, you can choose which tools are run locally and which are run on the cluster. Don't forget to budget for enough disk storage as well - Galaxy usage tends to generate a lot of intermediate files which your users may not routinely delete from their histories.
De novo assembly can also be for the human/mouse genome.
You may not have enough RAM, however I have no personal experience doing de novo assemblies of mouse/human to guide you here. Someone else on the list should be able to help, or since this is a general question try searching online, e.g. at seqanswers.com I'm not sure how easy it would be to setup your Galaxy to only allow one de novo assembly at a time - which would seem a sensible precaution given you may have multiple users (or the same user) trying to run assemblies in parallel. Peter
I'm not sure how easy it would be to setup your Galaxy to only
allow one de novo assembly at a time - which would seem a sensible precaution given you may have multiple users (or the same user) trying to run assemblies in parallel.
I guess I could dedicate a handler to run this specific tool and in order to run it again, the first job will have to be completed. Thank you for all the help and suggestions!
Can I put in a similar question on top of this: How much resources do you need for re-sequencing of a mammalian genome (assembly and variant detection), one job at a time? E.g. how much RAM etc. if I want the re-sequencing SAM file of a 30-fold coverage be done in 48 hours? Gerald Gerald Bothe 32 Plum Hill Road East Lyme, CT 06333 (860) 451 8776
________________________________ From: Nikos Sidiropoulos <nikos.sidiro@gmail.com> To: Peter Cock <p.j.a.cock@googlemail.com> Cc: "<galaxy-dev@bx.psu.edu>" <galaxy-dev@bx.psu.edu> Sent: Wednesday, September 11, 2013 8:19 AM Subject: Re: [galaxy-dev] Scaling and hardware requirements
Hi Peter
It's going to be one big machine, running both Galaxy server and the jobs. It's going to be a multi-process configuration. If that idea is terribly bad please let me know so I can give back the feedback.
De novo assembly can also be for the human/mouse genome.
Bests, Nikos
2013/9/11 Peter Cock <p.j.a.cock@googlemail.com>
On Wed, Sep 11, 2013 at 1:03 PM, Nikos Sidiropoulos
<nikos.sidiro@gmail.com> wrote:
Hi all,
I have a couple of questions regarding a server setup dedicated on Galaxy.
The idea is to buy a 64 core 256GB RAM server. From my experience I believe that Galaxy will be able to scale up to 64 cpu's but I would like some more feedback on this. Also, is 4GB RAM per CPU core enough for NGS data? (including de-novo assembly)
Bests, Nikos
Hi Nikos,
Is this going to be one server both for running Galaxy (which needs fairly low resources) and running jobs for Galaxy, like de novo assemblies (which need high resources)?
i.e. You have one big machine only, no cluster?
For de novo assembly the RAM per core/CPU isn't important, it is the total RAM on the machine. How much RAM you need depends on which assembler you use, the organism (both size and also complexity) and the volume of data.
What you've described should be fine for bacterial assemblies and smaller eukaryotes - beyond that you'll need to give more details.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
This isn't an easy question to answer. Here's why: *there is significant variation in mammalian genome size; of course, larger genomes require more resources, but the relationship is difficult to quantify; *assembly can take anywhere from a day to a week depending on software and resource choices; *variant detection can take anywhere from 1-4 days depending on software used; *completing assembly and variant detection in 48 hours is something that is challenging for even the most advanced genomics labs. To answer your question, I'd start with 256-512GB of RAM on a machine and 36-72 compute cores across a cluster. This is simply a guess of course. Before investing in hardware, you might try your analysis on the cloud ( usegalaxy.org/cloud ) to get a sense of the resources needed. Good luck, J. On Sep 11, 2013, at 8:34 AM, Gerald Bothe wrote:
Can I put in a similar question on top of this: How much resources do you need for re-sequencing of a mammalian genome (assembly and variant detection), one job at a time? E.g. how much RAM etc. if I want the re-sequencing SAM file of a 30-fold coverage be done in 48 hours?
Gerald
Gerald Bothe 32 Plum Hill Road East Lyme, CT 06333 (860) 451 8776
From: Nikos Sidiropoulos <nikos.sidiro@gmail.com> To: Peter Cock <p.j.a.cock@googlemail.com> Cc: "<galaxy-dev@bx.psu.edu>" <galaxy-dev@bx.psu.edu> Sent: Wednesday, September 11, 2013 8:19 AM Subject: Re: [galaxy-dev] Scaling and hardware requirements
Hi Peter
It's going to be one big machine, running both Galaxy server and the jobs. It's going to be a multi-process configuration. If that idea is terribly bad please let me know so I can give back the feedback.
De novo assembly can also be for the human/mouse genome.
Bests, Nikos
2013/9/11 Peter Cock <p.j.a.cock@googlemail.com> On Wed, Sep 11, 2013 at 1:03 PM, Nikos Sidiropoulos <nikos.sidiro@gmail.com> wrote:
Hi all,
I have a couple of questions regarding a server setup dedicated on Galaxy.
The idea is to buy a 64 core 256GB RAM server. From my experience I believe that Galaxy will be able to scale up to 64 cpu's but I would like some more feedback on this. Also, is 4GB RAM per CPU core enough for NGS data? (including de-novo assembly)
Bests, Nikos
Hi Nikos,
Is this going to be one server both for running Galaxy (which needs fairly low resources) and running jobs for Galaxy, like de novo assemblies (which need high resources)?
i.e. You have one big machine only, no cluster?
For de novo assembly the RAM per core/CPU isn't important, it is the total RAM on the machine. How much RAM you need depends on which assembler you use, the organism (both size and also complexity) and the volume of data.
What you've described should be fine for bacterial assemblies and smaller eukaryotes - beyond that you'll need to give more details.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Is there any website or wiki that discusses scaling and hardware requirements (RAM, processors, hard drives) in more detail, e.g. - re-sequencing (e.g. align John Doe's sequencing fastq's to the standard human genome/ make SAM/ make BAM and determine the variants) - metagenomics (e.g. sequence John Doe's stool to obtain a) species by 16S or b) bacterial genes) - de novo bacterial genome (e.g. isolate an unknown germ from the stool and assemble its genome; DNA only 90% homologous to any known bacteria, let's say the genome has 4Mbases) - all for an isolated instance (one single lab) or a production instance (whole institute or more) would be great ... or could anybody give their own 2 cents? Gerald
________________________________ From: Jeremy Goecks <jeremy.goecks@emory.edu> To: Gerald Bothe <g_bothe@yahoo.com> Cc: Nikos Sidiropoulos <nikos.sidiro@gmail.com>; Peter Cock <p.j.a.cock@googlemail.com>; "<galaxy-dev@bx.psu.edu>" <galaxy-dev@bx.psu.edu> Sent: Thursday, September 12, 2013 10:00 AM Subject: Re: [galaxy-dev] Scaling and hardware requirements
This isn't an easy question to answer. Here's why:
*there is significant variation in mammalian genome size; of course, larger genomes require more resources, but the relationship is difficult to quantify; *assembly can take anywhere from a day to a week depending on software and resource choices; *variant detection can take anywhere from 1-4 days depending on software used; *completing assembly and variant detection in 48 hours is something that is challenging for even the most advanced genomics labs.
To answer your question, I'd start with 256-512GB of RAM on a machine and 36-72 compute cores across a cluster. This is simply a guess of course. Before investing in hardware, you might try your analysis on the cloud ( usegalaxy.org/cloud ) to get a sense of the resources needed.
Good luck, J.
On Sep 11, 2013, at 8:34 AM, Gerald Bothe wrote:
Can I put in a similar question on top of this: How much resources do you need for re-sequencing of a mammalian genome (assembly and variant detection), one job at a time? E.g. how much RAM etc. if I want the re-sequencing SAM file of a 30-fold coverage be done in 48 hours?
Gerald
Gerald Bothe 32 Plum Hill Road East Lyme, CT 06333 (860) 451 8776
________________________________ From: Nikos Sidiropoulos <nikos.sidiro@gmail.com> To: Peter Cock <p.j.a.cock@googlemail.com> Cc: "<galaxy-dev@bx.psu.edu>" <galaxy-dev@bx.psu.edu> Sent: Wednesday, September 11, 2013 8:19 AM Subject: Re: [galaxy-dev] Scaling and hardware requirements
Hi Peter
It's going to be one big machine, running both Galaxy server and the jobs. It's going to be a multi-process configuration. If that idea is terribly bad please let me know so I can give back the feedback.
De novo assembly can also be for the human/mouse genome.
Bests, Nikos
2013/9/11 Peter Cock <p.j.a.cock@googlemail.com>
On Wed, Sep 11, 2013 at 1:03 PM, Nikos Sidiropoulos
<nikos.sidiro@gmail.com> wrote:
Hi all,
I have a couple of questions regarding a server setup dedicated on Galaxy.
The idea is to buy a 64 core 256GB RAM server. From my experience I believe that Galaxy will be able to scale up to 64 cpu's but I would like some more feedback on this. Also, is 4GB RAM per CPU core enough for NGS data? (including de-novo assembly)
Bests, Nikos
Hi Nikos,
Is this going to be one server both for running Galaxy (which needs fairly low resources) and running jobs for Galaxy, like de novo assemblies (which need high resources)?
i.e. You have one big machine only, no cluster?
For de novo assembly the RAM per core/CPU isn't important, it is the total RAM on the machine. How much RAM you need depends on which assembler you use, the organism (both size and also complexity) and the volume of data.
What you've described should be fine for bacterial assemblies and smaller eukaryotes - beyond that you'll need to give more details.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (4)
-
Gerald Bothe
-
Jeremy Goecks
-
Nikos Sidiropoulos
-
Peter Cock