New subject: preprocessing gDNA illumina paired end data for mapping/snp calling

24 Apr 2013

      This question is w/ regards to pre-processing whole genome resequencing data for mapping data to a reference yeast strain.

I'm having trouble joining paired end data. I have two files per sample (read1 and read2).

I've successfully uploaded my fastq.gz files into galaxy using FTP. I have two fastq files for each direction per strain labelled for example:
(for the left hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
(for the right hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq

Now once I groom each using FASTQ Groomer I'm trying to join them to get a single file and I'm joining 0% of the reads. So I think the header or directory is not in the correct format. E.g., the raw groomed reads for the left hand and right hand look like:

(for the left hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 1:N:0:CCGTCC
NGTATGGAAGACGTAGAGTGGATGAAAATTTTGTGAAAAAAAAAAGCTTATAGGAACAAAAACATCCTTACATCTTCGGGTATTTCTTCTAGGGTTGAAGT
+
!!!%%%%%)))))**(*(!$()(((***(***')(**********)'))%!!%&%$$$$$####$$$$$$"$!!""!!##!!!!$$%$""!!"#$#!!!!!
@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 1:N:0:CCGTCC
NCCAGACACAGTTAACGCAACCTGACATGCAACAGTTATCGGGTTCTTGTGGTTTTGCAGGCACTTGGACACCTGCTATTTTCTTCGTTCCGCCGCTAAGC

(for the right hand dir)   130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq

@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 2:N:0:CCGTCC

GCATAGTTACTTTTTGATCACTAACAACGATATATTATCGTTGAACAATTTACTACGCAAAACAGTTCACGTGATGTACGTCAGATAATTCACTGAAGGTA

+

$$$''''')))))++++++(+++++++++++++++++*+*++*++++++++++*++++*+++++*))))))''''&'&'%&%%%%$%%%'&%%%%%$%%$!

@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 2:N:0:CCGTCC

ATGTATTATAAGCCCGAATCAGATACTCAAATTTGAAAAAAGATATCTTTCTCCTCCGACATGGCCGAACTCATTTACATAAATAGCATAAATTAAACAGA

According to the wiki I think the fastq format should look something like this with /1 and /2 corresponding the each paired file.

@61CC3AAXX100125:7:118:2538:5577/1

GACACCTTTAATGTCTGAAAAGAGACATTCACCATCTATTCTCTTGGAGGGCTACCACCTAAGAGCCTTCATCCCC

+

?>CADFEEEBEDIEHHIDGGGEEEEHFFGIGIIFFIIEFHIIIIHIIFFIIIDEIIGIIIEHFFFIIEHIFA@?==

@61CC3AAXX100125:7:1:17320:13701/1

CTCAGAAGACCCTGAGAACATGTGCCCAAGGTGGTCACAGTGCATCTTAGTTTTGTACATTTTAGGGAGATATGAG

+

?BCAAADBBGGHGIDDDGHFEIFIIIIFGEIFIIFIGIGEFIIGGIIHEFFHHHIHEIFGHHIEFIIEECE?>@89

Any suggestions on how to get the files in the correct format/header to be able to join them?

Last question, what is the tool to trim reads based on quality again?

Thanks very much gentle people

Tim

preprocessing gDNA illumina paired end data for mapping/snp calling

Timothy Brennan

Jennifer Jackson

tags

participants (2)