
It's not public yet, and it involves a little conundrum - we want it so we can support large amounts of data efficiently on a variety of aligners, including our ELAND from CASAVA. However, ELAND does not support unaligned BAM inputs yet, and apparently it would be a lot of work to make it so (and another team's area of responsibility as well). So in the near term, BGZF would not meet our needs. However, work is quite far along on a GZIP-based one that works with ELAND and BWA, since they both read GZIP FASTQ files, and works/will work with a converter to fastq_sanger for other tools. I can put you in touch with the engineer doing the work if you are interested. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com -----Original Message----- From: Peter Cock [mailto:p.j.a.cock@googlemail.com] Sent: Tuesday, November 08, 2011 3:29 PM To: Duddy, John Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor Subject: Re: [galaxy-dev] Tool shed and datatypes On Thu, Oct 6, 2011 at 5:45 PM, Duddy, John <jduddy@illumina.com> wrote:
GZIP files are definitely our plan. I just finished testing the code that distributes the processing of a FASTQ (or pair for PE) to an arbitrary number of tasks, where each subtask extracts just the data it needs without reading any of the file it does not need. It extracts the blocks of GZIPped data into a standalone GZIP file just by copying whole blocks and appending them (if the window is not aligned perfectly, there is additional processing). Since the entire file does not need to be read, it distributes quite nicely.
I'll be preparing a pull request for it soon.
John Duddy
Hi John, Is your pull request public yet? I'd like to know more about your GZIP based plan (and how it differs from BGZF). It would seem silly to reinvent something slightly different if an existing and well tested mechanism like BGZF (used in BAM files) would work. BGZF is based on GZIP with blocks each up to 64kb, where the block size is recorded in the GZIP block header. This may be more fine grained than the block sizes you are using, but should serve equally well for distribution of data chunks between machines/cores. I appreciate the SAM/BAM specification where BGZF is defined is quite dry reading, and the broad potential of this GZIP variant beyond BAM is not articulated clearly. So I've written a blog post about how BGZF can be used for efficient random access to sequential files (in the sense of one self contained record after another, e.g. many sequence file formats including FASTA & FASTQ): http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html I've also added a reference to BGZF on the open Galaxy feature request for general support of gzipped data types: https://bitbucket.org/galaxy/galaxy-central/issue/666/ Regards, Peter