Re: [galaxy-user] Patch for better FASTQ description handling

20 Oct 2011

      On Thu, Oct 20, 2011 at 2:15 PM, Eric Cabot <cabot@biotech.wisc.edu> wrote:
...
...
I was not aware of this new naming. It seems like a terrible decision from
Illumina because now both reads in a pair technically have the same ID (but
a different description).
This is not quite the case. Here are two fastq header lines for a pair of
reads produced by Illumina's CASAVA 1.8:
@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 1:N:0:CTTGTA
@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 2:N:0:CTTGTA
Yes, Illumina gives both read 1 and read 2 the same template ID
of XYZZY:123:D0ABCDEFG:7:1101:1445:2057 (much like the
two reads would have the same ID in a SAM/BAM file).
...
The two key things to note, relevant to this discussion are:
1. A space character is used to split the fields into two groups.
This is actually a good thing, because that particular character can NEVER
appear in either a sequence or a quality line. This make it easy to detect
name lines as those beginning with "@" (a valid quality character) and also
having a space. If you are writing a parser for the new Illumina fastq
format, please don't break the names on spaces!
Yes, you could use the space as a sanity test for *this* style Illumina
FASTQ, and have a bespoke parser which treats this all specially.
But for a generic FASTQ parser you *should* split at the space.

The point is Illumina have changed the meaning of their FASTQ
identifier, it used to be the template ID plus a /1 or /2 suffix, but
now it is just the common template ID used for both parts.
...
2. Appart from the read number, encoded as the digit immediately following
the space, the two lines are identical--as they were with earlier CASAVA
versions.  Why is this worse than two lines differing by "/1" vs. "/2"?
Because it is a change from the existing well established convention,
which will require changed to hundreds of scripts and and tools
(guessed number including user's bespoke scripts).
...
An additional improvement with the new naming convention is that flowcell
and run ID's, as well as a flag for not passing filters (where N means does
PF), are now included.
Yes, that is good.

Peter

Re: [galaxy-user] Patch for better FASTQ description handling

Peter Cock