As I understand it, Isilion is built up from "bricks" that have storage and
compute power. They replicate files amongst themselves, so that for every IO request there
are multiple systems that could respond. They are interconnected by an ultra fast fibre
backbone.
So, depending on your topology, it's possible to get a lot more throughput by working
on different sections of the same file from different physical computers.
I haven't delved into BGZF, so I can't comment. My approach to block GZIP was just
to concatenate multiple GZIP files and keep a record of the offsets and sequences
contained in each. The advantage is compatibility, in that it decompresses just like it
was one big chunk, yet you can compose subsets of the data without
decompressing/recompressing and (as long as we actually have to write out the file
subsets) can reap the reduced IO benefits of smaller writes.
John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy(a)illumina.com
-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock@googlemail.com]
Sent: Thursday, October 06, 2011 8:16 AM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev(a)lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes
On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John <jduddy(a)illumina.com> wrote:
I'd be up for that something like that, although I have other
tasking
in the short term after I finish my parallelism work. I'd rather not have
Galaxy do the compression/decompression, because that will not
effectively utilize the distributed nature of many filesystems, such
as Isilon, that our customers use.
Is that like a compressed filesystem, where there is probably less
benefit to storing the data gzipped?
My parallelism work (second
phase almost done) handles that by using a block-gzipped
format and index files that allow the compute nodes to seek to
the blocks they need and extract from there.
How similar is your block-gzipped approach to BGZF used in BAM?
Another thing that should probably go along with this is an
enhancement to metadata such that it can be fed in from the
outside. We upload files by linking to file paths, and at that
point, we know everything about the files (index information).
So need to decompress a 500GB file and read the whole
thing just to count the lines - all you have to do is ask ;-}
I can see how that might be useful.
Peter