Ahh - sorry. I finally found the format specification for BGZF in the SAM format
specification, and it seems that it is 100% GZIP-compatible. There is still the issue of
needing an external file index, since all BGZF seems to give you is the size of the
compressed block, not anything format-specific, like the number of sequences in the
In any case, whether it's GZIP or BGZF, it seems the solutions are very similar, and
porting my work should be pretty simple - I just used larger blocks and put all the data
in the index file and none in the headers.
Sr. Staff Software Engineer
9885 Towne Centre Drive
San Diego, CA 92121
From: Peter Cock [mailto:firstname.lastname@example.org]
Sent: Tuesday, November 08, 2011 4:04 PM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev(a)lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes
On Tue, Nov 8, 2011 at 11:45 PM, Duddy, John <jduddy(a)illumina.com> wrote:
It's not public yet, and it involves a little conundrum - we
it so we can support large amounts of data efficiently on a variety
of aligners, including our ELAND from CASAVA. However, ELAND
does not support unaligned BAM inputs yet, and apparently it
would be a lot of work to make it so (and another team's area
of responsibility as well).
OK, so using (unaligned) BAM isn't about to happen.
So in the near term, BGZF would not meet our needs.
I don't follow you there, BAM != BGZF.
We can use BGZF to compress FASTQ, FASTA, GenBank,
basically anything. You get compression approaching that
of plain GZIP (depending on the characteristics of the data)
plus efficient random access.
However, work is quite far along on a GZIP-based one
that works with ELAND and BWA, since they both read
GZIP FASTQ files, and works/will work with a converter
to fastq_sanger for other tools.
I can put you in touch with the engineer doing the work if
you are interested.
That might be a good idea, or ask them to post here?