Michael, I don't have any experience with Condor, but we're finding that the Galaxy framework scales very well - mostly because it doesn't do any of the computationally intense stuff itself - it hands that off through the job runner. Our internal Galaxy works fine with very large (6k subjects, Affy 6.0 snp chips, 9.6k subjects, Affy 5.0 snp chips...) datasets. Tools take a while to run (!) but Galaxy itself is more or less indifferent to the size of files because it only stores references (paths eg) to the disk files in the database - not the actual gigagobs of data. A collection of 100gb files takes about the same space in the Galaxy database tables as a collection of 1k ones as far as I can tell. A user's experience of Galaxy tool operation will obviously be impacted by the effects of physically shuffling large datafiles around for the cluster backend when a tool is run, so the cluster architecture, and the way datasets are made available to cluster nodes for processing is a key issue for very large datasets I suspect. On backends, I believe the party-line is that both PostgreSQL and MySQL are fully supported. We've used MySQL as our backend for nearly 2 years without any problems with released Galaxy versions - all 3 database backends are now all auto-tested before release AFAIK. Arguably, Postresql might be a better choice technically, and operationally, that's what runs the primary Galaxy site so is likely to work! My group remain familiar and comfortable with MySQL and don't have the energy to swap over. If you were going to swap, do it before you build a large userbase unless you have a bored DBA available to unload and reload a set of Galaxy history and user tables mid-stream. On Thu, Aug 21, 2008 at 2:00 AM, <galaxy-user-request@bx.psu.edu> wrote:
1. newbie questions (Michael Rusch)
----------------------------------------------------------------------
Message: 1 Date: Tue, 19 Aug 2008 16:53:27 -0500 From: Michael Rusch <mcrusch@wisc.edu> Subject: [galaxy-user] newbie questions To: galaxy-user@bx.psu.edu Message-ID: <8085BDD01E3A4A40A0F01E5C73BBA505@gel.local> Content-Type: text/plain; charset="us-ascii"
We're strongly considering switching to Galaxy from a piece of home-built software that we're in the process of developing. So, I have a couple of newbie questions to see what people's experience is.
How does Galaxy scale? Does anybody have experience with scaling to thousands of datasets, or working with datasets in the hundreds of megabytes?
We have traditionally done most of our work using a MySQL backend. I haven't (yet) received the green light from our sysadmin to install Postgres, and I'm wondering if anybody has any experience running on MySQL. Is it possible? Are there pitfalls?
Has anybody by any chance implemented support for condor as a job scheduler?
-- python -c "foo = map(None,'moc.liamg@surazal.ssor'); foo.reverse(); print ''.join(foo)"