On Thu, Oct 6, 2011 at 5:02 PM, Duddy, John <jduddy(a)illumina.com> wrote:
As I understand it, Isilion is built up from "bricks" that
have storage
and compute power. They replicate files amongst themselves, so
that for every IO request there are multiple systems that could
respond. They are interconnected by an ultra fast fibre backbone.
So why not use gzipped files on top of that? Smaller chunks of
data to access so should be faster even with the decompression
once it gets to the CPU.
So, depending on your topology, it's possible to get a lot more
throughput by working on different sections of the same file from
different physical computers.
Nice.
I haven't delved into BGZF, so I can't comment. My approach
to
block GZIP was just to concatenate multiple GZIP files and keep
a record of the offsets and sequences contained in each. The
advantage is compatibility, in that it decompresses just like it
was one big chunk, yet you can compose subsets of the data
without decompressing/recompressing and (as long as we
actually have to write out the file subsets) can reap the reduced
IO benefits of smaller writes.
That sounds VERY similar to BGZF - have a read over the
SAM specification which covers this. Basically they stick
the block size into the gzip headers, and the BAM index files
(BAI) use a 64 bit offset which is split into the BGZF block
offset and the offset within that decompressed block. See:
http://samtools.sourceforge.net/SAM1.pdf
Peter