Thanks again for your comments. The points about huge file sizes and their "write
once" nature are convincing. Are the indexes you are talking about already
implemented in Galaxy? Is this how it supports its database-like join and subset
What about summarized downstream data types, such as gene intensities, p values from
statistical tests etc? Those would seem to be relatively low-volume and less immutable.
Suppose, as a simple example, I have a gene expression experiment with several samples (be
that arrays or RNA-Seq runs), assigned to 2 treatments. I want to set up a workflow that
would first summarize the data to get an expression value for each gene (or exon, or
transcribed region) in each run, and then do t tests to discover those that are
differentially expressed between the treatments. I'll need to support a project that
would perform similarly designed experiments over and over again, e.g. with different cell
lines and/or treatments. Although the raw data may remain as flat files in a Galaxy data
library, wouldn't it make sense to store the summarized data and t test p values in a
On 09/01/10, James Taylor <james(a)jamestaylor.org> wrote:
Exactly. In addition, most relational database are optimized for data
that can change, but the access pattern for our raw data is write once. We can implement
more efficient storage formats and indexes outside the database for this purpose.
On Aug 31, 2010, at 5:17 PM, Hiram Clawson wrote:
> Good afternoon Yury:
> Typical file sizes are currently running in the 10s and 100s of Gb for most work
flows these days.
> It isn't practical to try and stuff such large single entities into a database.
> It is much more simple to compute indexes into the file and store the indexes in
> the database. We do this all the time at the UCSC genome browser.
> Yury Bukhman wrote:
>> Thank you, James, for your reply. I wonder if you could elaborate ...
> galaxy-user mailing list
Department of Biology
Department of Mathematics & Computer Science
Yury V. Bukhman, Ph.D.
Associate Scientist, Bioinformatics
Great Lakes Bioenergy Research Center
University of Wisconsin - Madison
445 Henry Mall, Rm. 513
Madison, WI 53706, USA
Phone: 608-890-2680 Fax: 608-890-2427