On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John <jduddy(a)illumina.com> wrote:
GZIP files are definitely our plan. I just finished testing the code
that distributes the processing of a FASTQ (or pair for PE) to an
arbitrary number of tasks, where each subtask extracts just the
data it needs without reading any of the file it does not need. It
extracts the blocks of GZIPped data into a standalone GZIP file
just by copying whole blocks and appending them (if the window
is not aligned perfectly, there is additional processing). Since
the entire file does not need to be read, it distributes quite nicely.
I'll be preparing a pull request for it soon.
John Duddy
Hi John,
Is your pull request public yet? I'd like to know more about
your GZIP based plan (and how it differs from BGZF). It
would seem silly to reinvent something slightly different
if an existing and well tested mechanism like BGZF (used
in BAM files) would work.
BGZF is based on GZIP with blocks each up to 64kb,
where the block size is recorded in the GZIP block
header. This may be more fine grained than the block
sizes you are using, but should serve equally well for
distribution of data chunks between machines/cores.
I appreciate the SAM/BAM specification where BGZF is
defined is quite dry reading, and the broad potential of
this GZIP variant beyond BAM is not articulated clearly.
So I've written a blog post about how BGZF can be used
for efficient random access to sequential files (in the
sense of one self contained record after another, e.g.
many sequence file formats including FASTA & FASTQ):
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
I've also added a reference to BGZF on the open
Galaxy feature request for general support of gzipped
data types:
https://bitbucket.org/galaxy/galaxy-central/issue/666/
Regards,
Peter