Re: [galaxy-user] integrating Galaxy with a relational data warehouse?

31 Aug 2010


      Hi Yury,
...
we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq.  We have been thinking of managing the data in a relational database.  Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values.  We wonder how amenable would Galaxy be to integration with a relational data store.
One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto.
This is certainly a reasonable possibility. You could have a Galaxy tool for submitting data to your database. I would imagine such a tool would produce a Galaxy dataset as output with whatever unique identifier is necessary to recover exactly that data from the database for another analysis.
...
Another possibility is to forgo the relational database altogether and do all our data management within Galaxy.
I can only give you our experience from inside Galaxy. After initial analysis we made a decision to store all data in Galaxy as files on disk, with metadata (data about data, connections between datasets, workflows, et cetera) in a relational database. We feel this decision has worked well. For the scale of data we see, as well as the wide variety of different data types, a relational database did not, and still does not, seem practical to us.

-- jt

James Taylor
Assistant Professor
Department of Biology
Department of Mathematics & Computer Science
Emory University