Re: [galaxy-user] integrating Galaxy with a relational data warehouse?
Thank you, James, for your reply. I wonder if you could elaborate on why storing the bulk of the data in a relational database seems impractical, or point me to a document where this is discussed at more length. Yury On 08/31/10, James Taylor <james@jamestaylor.org> wrote:
Hi Yury,
we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate-based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of managing the data in a relational database. Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values. We wonder how amenable would Galaxy be to integration with a relational data store.
One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto.
This is certainly a reasonable possibility. You could have a Galaxy tool for submitting data to your database. I would imagine such a tool would produce a Galaxy dataset as output with whatever unique identifier is necessary to recover exactly that data from the database for another analysis.
Another possibility is to forgo the relational database altogether and do all our data management within Galaxy.
I can only give you our experience from inside Galaxy. After initial analysis we made a decision to store all data in Galaxy as files on disk, with metadata (data about data, connections between datasets, workflows, et cetera) in a relational database. We feel this decision has worked well. For the scale of data we see, as well as the wide variety of different data types, a relational database did not, and still does not, seem practical to us.
-- jt
James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu
Good afternoon Yury: Typical file sizes are currently running in the 10s and 100s of Gb for most work flows these days. It isn't practical to try and stuff such large single entities into a database. It is much more simple to compute indexes into the file and store the indexes in the database. We do this all the time at the UCSC genome browser. --Hiram Yury Bukhman wrote:
Thank you, James, for your reply. I wonder if you could elaborate ...
Exactly. In addition, most relational database are optimized for data that can change, but the access pattern for our raw data is write once. We can implement more efficient storage formats and indexes outside the database for this purpose. On Aug 31, 2010, at 5:17 PM, Hiram Clawson wrote:
Good afternoon Yury:
Typical file sizes are currently running in the 10s and 100s of Gb for most work flows these days. It isn't practical to try and stuff such large single entities into a database. It is much more simple to compute indexes into the file and store the indexes in the database. We do this all the time at the UCSC genome browser.
--Hiram
Yury Bukhman wrote:
Thank you, James, for your reply. I wonder if you could elaborate ...
galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
-- jt James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University
participants (3)
-
Hiram Clawson
-
James Taylor
-
Yury Bukhman