Split FASTQ file into paired and unpaired reads

23 Nov 2010

      Hi all,

I've got a Python script which divides a FASTQ file containing a mixture
of paired, unpaired and orphan reads into valid pairs and single/orphan
reads.

Such a situation can occur after applying quality filtering to a file of
paired FASTQ reads. I've also had raw data supplied in this state from
a sequencing center.

It works by looking at the read names, and understand the /1 and /2
convention used by Illumina, .f and .r which I understand is common,
and the Sanger convention as well. The only requirement (so that
only a single pass though the input is needed) is that the input be
sorted such that any pairs come as consecutive entries (forward
then reverse read).

Is there anything like this in Galaxy already that I have missed?

Would you consider merging such a tool into Galaxy? I haven't
written a wrapper XML file yet, and it also currently uses Biopython
for FASTQ parsing, but I could switch it to use the Galaxy FASTQ
code instead.

Regards,

Peter

Peter

Jennifer Jackson

Peter

Jennifer Jackson

tags

participants (2)