Thanks again for your comments. The points about huge file sizes and their "write once" nature are convincing. Are the indexes you are talking about already implemented in Galaxy? Is this how it supports its database-like join and subset operations? What about summarized downstream data types, such as gene intensities, p values from statistical tests etc? Those would seem to be relatively low-volume and less immutable. Suppose, as a simple example, I have a gene expression experiment with several samples (be that arrays or RNA-Seq runs), assigned to 2 treatments. I want to set up a workflow that would first summarize the data to get an expression value for each gene (or exon, or transcribed region) in each run, and then do t tests to discover those that are differentially expressed between the treatments. I'll need to support a project that would perform similarly designed experiments over and over again, e.g. with different cell lines and/or treatments. Although the raw data may remain as flat files in a Galaxy data library, wouldn't it make sense to store the summarized data and t test p values in a relational database? Thanks. Yury On 09/01/10, James Taylor <james@jamestaylor.org> wrote:
Exactly. In addition, most relational database are optimized for data that can change, but the access pattern for our raw data is write once. We can implement more efficient storage formats and indexes outside the database for this purpose.
On Aug 31, 2010, at 5:17 PM, Hiram Clawson wrote:
Good afternoon Yury:
Typical file sizes are currently running in the 10s and 100s of Gb for most work flows these days. It isn't practical to try and stuff such large single entities into a database. It is much more simple to compute indexes into the file and store the indexes in the database. We do this all the time at the UCSC genome browser.
--Hiram
Yury Bukhman wrote:
Thank you, James, for your reply. I wonder if you could elaborate ...
galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
-- jt
James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu