preprocessing gDNA illumina paired end data for mapping/snp calling
This question is w/ regards to pre-processing whole genome resequencing data for mapping data to a reference yeast strain. I'm having trouble joining paired end data. I have two files per sample (read1 and read2). I've successfully uploaded my fastq.gz files into galaxy using FTP. I have two fastq files for each direction per strain labelled for example: (for the left hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq (for the right hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq Now once I groom each using FASTQ Groomer I'm trying to join them to get a single file and I'm joining 0% of the reads. So I think the header or directory is not in the correct format. E.g., the raw groomed reads for the left hand and right hand look like: (for the left hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq @HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 1:N:0:CCGTCC NGTATGGAAGACGTAGAGTGGATGAAAATTTTGTGAAAAAAAAAAGCTTATAGGAACAAAAACATCCTTACATCTTCGGGTATTTCTTCTAGGGTTGAAGT + !!!%%%%%)))))**(*(!$()(((***(***')(**********)'))%!!%&%$$$$$####$$$$$$"$!!""!!##!!!!$$%$""!!"#$#!!!!! @HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 1:N:0:CCGTCC NCCAGACACAGTTAACGCAACCTGACATGCAACAGTTATCGGGTTCTTGTGGTTTTGCAGGCACTTGGACACCTGCTATTTTCTTCGTTCCGCCGCTAAGC (for the right hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq @HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 2:N:0:CCGTCC GCATAGTTACTTTTTGATCACTAACAACGATATATTATCGTTGAACAATTTACTACGCAAAACAGTTCACGTGATGTACGTCAGATAATTCACTGAAGGTA + $$$''''')))))++++++(+++++++++++++++++*+*++*++++++++++*++++*+++++*))))))''''&'&'%&%%%%$%%%'&%%%%%$%%$! @HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 2:N:0:CCGTCC ATGTATTATAAGCCCGAATCAGATACTCAAATTTGAAAAAAGATATCTTTCTCCTCCGACATGGCCGAACTCATTTACATAAATAGCATAAATTAAACAGA According to the wiki I think the fastq format should look something like this with /1 and /2 corresponding the each paired file. @61CC3AAXX100125:7:118:2538:5577/1 GACACCTTTAATGTCTGAAAAGAGACATTCACCATCTATTCTCTTGGAGGGCTACCACCTAAGAGCCTTCATCCCC + ?>CADFEEEBEDIEHHIDGGGEEEEHFFGIGIIFFIIEFHIIIIHIIFFIIIDEIIGIIIEHFFFIIEHIFA@?== @61CC3AAXX100125:7:1:17320:13701/1 CTCAGAAGACCCTGAGAACATGTGCCCAAGGTGGTCACAGTGCATCTTAGTTTTGTACATTTTAGGGAGATATGAG + ?BCAAADBBGGHGIDDDGHFEIFIIIIFGEIFIIFIGIGEFIIGGIIHEFFHHHIHEIFGHHIEFIIEECE?>@89 Any suggestions on how to get the files in the correct format/header to be able to join them? Last question, what is the tool to trim reads based on quality again? Thanks very much gentle people Tim
Hello Tim, Yes, this is a known issue, a few of the tools do not function with sequences with the newer formatted identifiers. There is a bare-bones Trello ticket for the upgrade, but to be honest, this hasn't been prioritized because of so few use cases: https://trello.com/c/bhurghHk After running "FASTQ Groomer", doing some QC is a good idea by running "FastQC", then " FASTQ Trimmer" or "FASTQ Quality Trimmer" are options. From there, just proceed with mapping. You can always filter the mapping results to restrict hits to those resulting from property paired data after (See "NGS: SAM Tools -> Filter SAM") or certain other tools will have this as an option on the downstream tool form itself. So, you don't need to filter twice with respect to pairs - once at the end is generally considered OK as long as the input data all meet some minimum quality thresholds (to optimally map with the given tool/parameters and avoid processing problems). Hopefully this helps, Jen Galaxy team On 4/23/13 8:01 PM, Timothy Brennan wrote:
This question is w/ regards to pre-processing whole genome resequencing data for mapping data to a reference yeast strain.
I'm having trouble joining paired end data. I have two files per sample (read1 and read2).
I've successfully uploaded my fastq.gz files into galaxy using FTP. I have two fastq files for each direction per strain labelled for example:
(for the left hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
(for the right hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq
Now once I groom each using FASTQ Groomer I'm trying to join them to get a single file and I'm joining 0% of the reads. So I think the header or directory is not in the correct format. E.g., the raw groomed reads for the left hand and right hand look like:
(for the left hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.1.fastq
@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 1:N:0:CCGTCC
NGTATGGAAGACGTAGAGTGGATGAAAATTTTGTGAAAAAAAAAAGCTTATAGGAACAAAAACATCCTTACATCTTCGGGTATTTCTTCTAGGGTTGAAGT
+
!!!%%%%%)))))**(*(!$()(((***(***')(**********)'))%!!%&%$$$$$####$$$$$$"$!!""!!##!!!!$$%$""!!"#$#!!!!!
@HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 1:N:0:CCGTCC
NCCAGACACAGTTAACGCAACCTGACATGCAACAGTTATCGGGTTCTTGTGGTTTTGCAGGCACTTGGACACCTGCTATTTTCTTCGTTCCGCCGCTAAGC
(for the right hand dir) 130104_7001240_0133_AH0854ADXX.lane_2.CCGTCC.2.fastq
@HWI-ST1240:133:H0854ADXX:2:1101:2716:1998 2:N:0:CCGTCC GCATAGTTACTTTTTGATCACTAACAACGATATATTATCGTTGAACAATTTACTACGCAAAACAGTTCACGTGATGTACGTCAGATAATTCACTGAAGGTA + $$$''''')))))++++++(+++++++++++++++++*+*++*++++++++++*++++*+++++*))))))''''&'&'%&%%%%$%%%'&%%%%%$%%$! @HWI-ST1240:133:H0854ADXX:2:1101:5045:1994 2:N:0:CCGTCC ATGTATTATAAGCCCGAATCAGATACTCAAATTTGAAAAAAGATATCTTTCTCCTCCGACATGGCCGAACTCATTTACATAAATAGCATAAATTAAACAGA
According to the wiki I think the fastq format should look something like this with /1 and /2 corresponding the each paired file.
@61CC3AAXX100125:7:118:2538:5577/1
GACACCTTTAATGTCTGAAAAGAGACATTCACCATCTATTCTCTTGGAGGGCTACCACCTAAGAGCCTTCATCCCC
+
?>CADFEEEBEDIEHHIDGGGEEEEHFFGIGIIFFIIEFHIIIIHIIFFIIIDEIIGIIIEHFFFIIEHIFA@?==
@61CC3AAXX100125:7:1:17320:13701/1
CTCAGAAGACCCTGAGAACATGTGCCCAAGGTGGTCACAGTGCATCTTAGTTTTGTACATTTTAGGGAGATATGAG
+
?BCAAADBBGGHGIDDDGHFEIFIIIIFGEIFIIFIGIGEFIIGGIIHEFFHHHIHEIFGHHIEFIIEECE?>@89
Any suggestions on how to get the files in the correct format/header to be able to join them?
Last question, what is the tool to trim reads based on quality again?
Thanks very much gentle people
Tim
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at:
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
participants (2)
-
Jennifer Jackson
-
Timothy Brennan