On Thu, Oct 20, 2011 at 2:15 PM, Eric Cabot <cabot@biotech.wisc.edu> wrote:
I was not aware of this new naming. It seems like a terrible decision from Illumina because now both reads in a pair technically have the same ID (but a different description).
This is not quite the case. Here are two fastq header lines for a pair of reads produced by Illumina's CASAVA 1.8:
@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 1:N:0:CTTGTA @XYZZY:123:D0ABCDEFG:7:1101:1445:2057 2:N:0:CTTGTA
Yes, Illumina gives both read 1 and read 2 the same template ID of XYZZY:123:D0ABCDEFG:7:1101:1445:2057 (much like the two reads would have the same ID in a SAM/BAM file).
The two key things to note, relevant to this discussion are:
1. A space character is used to split the fields into two groups. This is actually a good thing, because that particular character can NEVER appear in either a sequence or a quality line. This make it easy to detect name lines as those beginning with "@" (a valid quality character) and also having a space. If you are writing a parser for the new Illumina fastq format, please don't break the names on spaces!
Yes, you could use the space as a sanity test for *this* style Illumina FASTQ, and have a bespoke parser which treats this all specially. But for a generic FASTQ parser you *should* split at the space. The point is Illumina have changed the meaning of their FASTQ identifier, it used to be the template ID plus a /1 or /2 suffix, but now it is just the common template ID used for both parts.
2. Appart from the read number, encoded as the digit immediately following the space, the two lines are identical--as they were with earlier CASAVA versions. Why is this worse than two lines differing by "/1" vs. "/2"?
Because it is a change from the existing well established convention, which will require changed to hundreds of scripts and and tools (guessed number including user's bespoke scripts).
An additional improvement with the new naming convention is that flowcell and run ID's, as well as a flag for not passing filters (where N means does PF), are now included.
Yes, that is good. Peter