we are planning to build a data warehouse for a research center that
utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays,
microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of
managing the data in a relational database. Galaxy looks attractive to us for its
workflow management and data provenance features, e.g. to keep track of how raw data are
analyzed to produce normalized & summarized datasets and/or final sets of statistics
such as p values. We wonder how amenable would Galaxy be to integration with a relational
One possible scenario might be to have Galaxy import a dataset from a relational
database, run a workflow, then submit the results back to the database with the associated
history or link thereto.
This is certainly a reasonable possibility. You could have a Galaxy tool for submitting
data to your database. I would imagine such a tool would produce a Galaxy dataset as
output with whatever unique identifier is necessary to recover exactly that data from the
database for another analysis.
Another possibility is to forgo the relational database altogether
and do all our data management within Galaxy.
I can only give you our experience from inside Galaxy. After initial analysis we made a
decision to store all data in Galaxy as files on disk, with metadata (data about data,
connections between datasets, workflows, et cetera) in a relational database. We feel this
decision has worked well. For the scale of data we see, as well as the wide variety of
different data types, a relational database did not, and still does not, seem practical to
Department of Biology
Department of Mathematics & Computer Science