Peter and Daniel, thanks for the comments.
On 19/10/11 23:49, Peter Cock wrote:
> On Wed, Oct 19, 2011 at 2:31 PM, Daniel Blankenberg<dan(a)bx.psu.edu>
>> Hi Florent,
>> Sorry for the delay. I did try the patch out shortly after you
>> it, but it caused the functional to fail. I was able to fix the
>> issue and
>> allow the existing tests to start passing, but I've been bogged down
>> and haven't been able to perform a more thorough review of the code.
>> If you
>> could provide tests with files (e.g. for the tools affected) that
>> test the
>> new functionality, that would be a great help.
I'll have a look at that.
>> The use of partition removes python compatibility for<2.5, although
>> this is
>> a lesser/non-concern.
> I guess you could use split, but special case on there being no space.
>> Also, I'm not entirely sold on having the "Identifier line" being
>> parsed as
>> "identifier" +<space> + "description" instead a
> That is the normal convention, just like with FASTA.
The Bioperl and Biopython projects use this convention for FASTA and
>> This would mean that identifiers could not themselves contain spaces,
>> but "There is no standardization for identifiers" (so they could
>> have spaces?). Could two different reads be identified as "Read A"
>> and "Read
>> B", but then would no longer be uniquely identifiable as each would
>> then be
>> identified as "Read". If this added functionalilty were introduced as
>> optional behavior (e.g. a user needs to click a checkbox on the tools to
>> apply the id line splitting), these concerns can be mitigated.
> That is expected, "@Read A" and "@Read B" have the same
>> Peter, Florent, anyone else: I'd be very interested to hear your
>> thoughts on
>> the above, particularly in respect to know real-world data. For now,
>> discount SRA data from this discussion.
> See also the new Illumina 1.8 naming convention where they dropped
> the /1 and /2 and hit it in the description. It should be tested, but
> I think
> Florent's patch will work here (while the current Galaxy behaviour
I was not aware of this new naming. It seems like a terrible decision
from Illumina because now both reads in a pair technically have the same
ID (but a different description).
This is not quite the case. Here are two fastq header lines for a pair of
reads produced by Illumina's CASAVA 1.8:
The two key things to note, relevant to this discussion are:
1. A space character is used to split the fields into two groups.
This is actually a good thing, because that particular character can NEVER
appear in either a sequence or a quality line. This make it easy to detect
name lines as those beginning with "@" (a valid quality character) and
also having a space. If you are writing a parser for the new Illumina
fastq format, please don't break the names on spaces!
2. Appart from the read number, encoded as the digit immediately following
the space, the two lines are identical--as they were with earlier CASAVA
versions. Why is this worse than two lines differing by "/1" vs.
An additional improvement with the new naming convention is that flowcell
and run ID's, as well as a flag for not passing filters (where N means
does PF), are now included.
Eric L. Cabot
University of Wisconsin-Madison