[galaxy-user] preprocessing gDNA illumina paired end data for mapping/snp calling

23 Apr 2013

      This question is w/ regards to pre-processing whole genome resequencing data for mapping data to a reference yeast strain.

I'm having trouble joining paired end data. I have two files per sample (read1 and read2).

I've successfully uploaded my fastq.gz files into galaxy using FTP. I have two fastq files for each direction per strain labelled for example:
(for the left hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
(for the right hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq

Now once I groom each using FASTQ Groomer I'm trying to join them to get a single file and I'm joining 0% of the reads. So I think the header or directory is not in the correct format. E.g., the raw groomed reads for the left hand and right hand look like:

(for the left hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 1:N:0:CCGTCC
NGTATGGAAGACGTAGAGTGGATGAAAATTTTGTGAAAAAAAAAAGCTTATAGGAACAAAAACATCCTTACATCTTCGGGTATTTCTTCTAGGGTTGAAGT
+
!!!%%%%%)))))**(*(!$()(((***(***')(**********)'))%!!%&%$$$$$####$$$$$$"$!!""!!##!!!!$$%$""!!"#$#!!!!!
@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 1:N:0:CCGTCC
NCCAGACACAGTTAACGCAACCTGACATGCAACAGTTATCGGGTTCTTGTGGTTTTGCAGGCACTTGGACACCTGCTATTTTCTTCGTTCCGCCGCTAAGC

(for the right hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq

@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 2:N:0:CCGTCC

GCATAGTTACTTTTTGATCACTAACAACGATATATTATCGTTGAACAATTTACTACGCAAAACAGTTCACGTGATGTACGTCAGATAATTCACTGAAGGTA

+

$$$''''')))))++++++(+++++++++++++++++*+*++*++++++++++*++++*+++++*))))))''''&'&'%&%%%%$%%%'&%%%%%$%%$!

@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 2:N:0:CCGTCC

ATGTATTATAAGCCCGAATCAGATACTCAAATTTGAAAAAAGATATCTTTCTCCTCCGACATGGCCGAACTCATTTACATAAATAGCATAAATTAAACAGA

According to the wiki I think the fastq format should look something like this with /1 and /2 corresponding the each paired file.

@61CC3AAXX100125:7:118:2538:5577/1

GACACCTTTAATGTCTGAAAAGAGACATTCACCATCTATTCTCTTGGAGGGCTACCACCTAAGAGCCTTCATCCCC

+

?>CADFEEEBEDIEHHIDGGGEEEEHFFGIGIIFFIIEFHIIIIHIIFFIIIDEIIGIIIEHFFFIIEHIFA@?==

@61CC3AAXX100125:7:1:17320:13701/1

CTCAGAAGACCCTGAGAACATGTGCCCAAGGTGGTCACAGTGCATCTTAGTTTTGTACATTTTAGGGAGATATGAG

+

?BCAAADBBGGHGIDDDGHFEIFIIIIFGEIFIIFIGIGEFIIGGIIHEFFHHHIHEIFGHHIEFIIEECE?>@89

Any suggestions on how to get the files in the correct format/header to be able to join them?

Last question, what is the tool to trim reads based on quality again?

Thanks very much gentle people

Tim

[galaxy-user] preprocessing gDNA illumina paired end data for mapping/snp calling

Timothy Brennan