I've been asked to set up a local Galaxy installation specifically for
large datasets (tens of terabytes).
Is there a list of default locations where Galaxy puts data? As an admin,
that would be my first question, but it's not obvious from the
documentation,
and for large datasets, it's important to know.
(During testing, with Galaxy in the default location, /home/galaxy, an
attempt
to upload and decompress a 50GB file wreaked havoc on the server due to
heavy
I/O. Changing the tmp directory so it wasn't reading and writing massive
amounts of data to the same filesystem at the same time helped, but that
would
have been nice to know ahead of time.)
Is the database used for datasets at all or just user account data? In
other
words, if users are crunching terabytes of data, do I need to worry about
the
amount of space on the filesystem that hosts the database?
What exactly are the disadvantages of using MySQL over PostgreSQL? Several
places in the docs state that it is preferred but not why. Is it a bigger
problem
than installing another database that I have no experience with over one
that
is already installed and with which I am already familiar?
Is an MPI configuration necessary for getting full use out of a multicore
system? The docs seem to indicate that Galaxy will use multiple cores
("Without a cluster, you'll be limited to the number of cores in your
server...") but take pains to say that GIL won't allow more than a single
thread ("This means that regardless of the number of cores in your server,
Galaxy can only use one" and "having a multi-core system will not improve
the
Galaxy framework's performance out of the box since Galaxy can use (at
most)
one core at a time").
We have a 48-core machine, and there will only be two or three users.
Thanks,
--steve