On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John <jduddy@illumina.com> wrote:
I'd be up for that something like that, although I have other tasking in the short term after I finish my parallelism work. I'd rather not have Galaxy do the compression/decompression, because that will not effectively utilize the distributed nature of many filesystems, such as Isilon, that our customers use.
Is that like a compressed filesystem, where there is probably less benefit to storing the data gzipped?
My parallelism work (second phase almost done) handles that by using a block-gzipped format and index files that allow the compute nodes to seek to the blocks they need and extract from there.
How similar is your block-gzipped approach to BGZF used in BAM?
Another thing that should probably go along with this is an enhancement to metadata such that it can be fed in from the outside. We upload files by linking to file paths, and at that point, we know everything about the files (index information). So need to decompress a 500GB file and read the whole thing just to count the lines - all you have to do is ask ;-}
I can see how that might be useful. Peter