I've been asked to set up a local Galaxy installation specifically for large datasets (tens of terabytes).
Is there a list of default locations where Galaxy puts data? As an admin, that would be my first question, but it's not obvious from the documentation, and for large datasets, it's important to know.
(During testing, with Galaxy in the default location, /home/galaxy, an attempt to upload and decompress a 50GB file wreaked havoc on the server due to heavy I/O. Changing the tmp directory so it wasn't reading and writing massive amounts of data to the same filesystem at the same time helped, but that would have been nice to know ahead of time.)
Is the database used for datasets at all or just user account data? In other words, if users are crunching terabytes of data, do I need to worry about the amount of space on the filesystem that hosts the database?
What exactly are the disadvantages of using MySQL over PostgreSQL? Several places in the docs state that it is preferred but not why. Is it a bigger problem than installing another database that I have no experience with over one that is already installed and with which I am already familiar?
Is an MPI configuration necessary for getting full use out of a multicore system? The docs seem to indicate that Galaxy will use multiple cores ("Without a cluster, you'll be limited to the number of cores in your server...") but take pains to say that GIL won't allow more than a single thread ("This means that regardless of the number of cores in your server, Galaxy can only use one" and "having a multi-core system will not improve the Galaxy framework's performance out of the box since Galaxy can use (at most) one core at a time").
We have a 48-core machine, and there will only be two or three users.
Thanks,
--steve
On Jul 23, 2012, at 11:02 AM, Steven Peckins wrote:
I've been asked to set up a local Galaxy installation specifically for large datasets (tens of terabytes).
Is there a list of default locations where Galaxy puts data? As an admin, that would be my first question, but it's not obvious from the documentation, and for large datasets, it's important to know.
(During testing, with Galaxy in the default location, /home/galaxy, an attempt to upload and decompress a 50GB file wreaked havoc on the server due to heavy I/O. Changing the tmp directory so it wasn't reading and writing massive amounts of data to the same filesystem at the same time helped, but that would have been nice to know ahead of time.)
Hi Steve,
I'm glad that you saw the option to change the temporary directory. You may also want to change job_working_directory, which some tools will use for scratch space during execution.
Is the database used for datasets at all or just user account data? In other words, if users are crunching terabytes of data, do I need to worry about the amount of space on the filesystem that hosts the database?
No, it's not.
What exactly are the disadvantages of using MySQL over PostgreSQL? Several places in the docs state that it is preferred but not why. Is it a bigger problem than installing another database that I have no experience with over one that is already installed and with which I am already familiar?
We've simply had fewer problems with Postgres, and use it here for development and in production, so bugs in our code that differ between Postgres and MySQL will be fixed much quicker for Postgres (most likely before they even make it out in to the wild).
If you prefer MySQL, you can certainly use it.
Is an MPI configuration necessary for getting full use out of a multicore system? The docs seem to indicate that Galaxy will use multiple cores ("Without a cluster, you'll be limited to the number of cores in your server...") but take pains to say that GIL won't allow more than a single thread ("This means that regardless of the number of cores in your server, Galaxy can only use one" and "having a multi-core system will not improve the Galaxy framework's performance out of the box since Galaxy can use (at most) one core at a time").
MPI isn't necessary. None of the provided tools make use of it, nor does the Galaxy framework. Galaxy will use multiple cores to run tools - as many as you configure in the local job runner or in the cluster scheduler. The Galaxy server process itself can only use one core, but if you 'set_metadata_externally = True' in the config, that isn't likely to be a problem with only a few users.
--nate
We have a 48-core machine, and there will only be two or three users.
Thanks,
--steve ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
galaxy-dev@lists.galaxyproject.org