FASTQ to FASTQSanger using Groomer question
Hello! I am fairly new to using Galaxy and have a question about the FASTQ Groomer feature. I have 4 RNA-Seq raw data files that were just recently generated from Illumina's NGS instruments. I am aware that the first step to perform in Galaxy is FASTQ Groomer to convert the format to FASTQ Sanger. I presume that I would choose Illumina 1.3+ in the "Input FASTQ quality scores type" box. However, if I look at the raw data reads, I notice that Line 4 (which encodes the quality values for sequence in Line 2) has values outside of the Illumina 1.3+ range (some of them fall into the Sanger format. I am enclosing the Quality Score Comparison figure along with some of the raw RNA-Seq data): Quality Score Comparison SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 93 values (0, 93) (0 to 60 expected in raw reads) I - Illumina 1.3 Phred+64, 62 values (0, 62) (0 to 40 expected in raw reads) X - Solexa Solexa+64, 67 values (-5, 62) (-5 to 40 expected in raw reads) Diagram adapted from http://en.wikipedia.org/wiki/FASTQ_format RNA-Seq raw data @HWI-ST156_294:7:1:1058:2165:0/1 CACCAACTCACAGCCACTCCGTGAGGCCAGCAAGGCAAGAACATTCATCTC + HHHHFGGHHHGFHHFHHEGHC<GGGEB.EE9D?DDEEEE4FFFCBB/.C=D @HWI-ST156_294:7:1:1184:2191:0/1 CGTAAATCCATGTCTGACTTCTGGATAGCAAACACCAGCACCGCGTGGATG + EE;E=ECEEBE@EEEE=GBFGF/GFFC<FA;:@<8AEABB>A######### @HWI-ST156_294:7:1:1018:2200:0/1 NCTGATTAAGGATAATGAGTTTTTAGTAGAACTAATGATGTTATTCCTTGG + ################################################### @HWI-ST156_294:7:1:1225:2217:0/1 GTTTTTGACTACACAAAGCACCCTTCTAAACCAGACCATTCTGGAGAATGA + FFCEFFFE?FEBDC?987::,3:<-9145,DA<:C9;+?############ As a test in FASTQ Groomer, I chose either the Sanger or Illumina 1.3+ as the input quality scores type and these are the results I got: FASTQ Groomer on tn-read1 (using Sanger as input) 6.1 Gb format: fastqsanger, database:mm9 Info: Groomed 45868679 sanger reads into sanger reads. Based upon quality and sequence, the input data is valid for: sanger Input ASCII range: '#'(35) - 'I'(73) Input decimal range: 2 - 40 FASTQ Groomer on tn-read1 (using Illumina1.3+ as input) 6.1 Gb format: fastqsanger, database:mm9 Info: Groomed 45868679 illumina reads into sanger reads. Based upon quality and sequence, the input data is valid for: sanger Input ASCII range: '#'(35) - 'I'(73) Input decimal range: -29 - 9 Which one is right (I presume the Illumina 1.3+ one, but I can't find any sort of explanation)? I noticed that the "input decimal range" had different values (although they spanned the same length) in relation to which input was chosen. What would happen downstream in TopHat if Sanger was used instead of Illumina 1.3+ for these files? Is there any other reading material/websites/etc... out there that might help me better understand the quality score and which to use? Any info/help would be greatly appreciated. Thanks, David David K. Crossman, Ph.D. Systems Biologist/Analyst/Statistician Heflin Center for Genomic Science University of Alabama at Birmingham 720 20th Street South Kaul Room 420 Birmingham, AL 35294-0024 (205) 996-4045 (205) 996-4056 (fax) David K. Crossman, Ph.D.<mailto:dkcrossm@uab.edu> Heflin Center for Genomic Science<http://www.heflingenetics.uab.edu/>
Hi David, Your files appear to be of the Sanger FASTQ variant. As you have noticed, the info blurb provided by the Grooming tool provides information that should be utilized to confirm input types. While the 'Illumina 1.3+' FASTQ format does encode scores using a different ASCII range, it is my understanding that the scripts provided by the manufacturer to create FASTQ formatted files were enhanced to write out Sanger encoded quality scores. The correct Grooming path for your data is Sanger --> Sanger. Please let us know if we can provide further assistance. Thanks for using Galaxy, Dan On Mar 21, 2011, at 9:42 AM, David K Crossman wrote:
Hello!
I am fairly new to using Galaxy and have a question about the FASTQ Groomer feature. I have 4 RNA-Seq raw data files that were just recently generated from Illumina’s NGS instruments. I am aware that the first step to perform in Galaxy is FASTQ Groomer to convert the format to FASTQ Sanger. I presume that I would choose Illumina 1.3+ in the “Input FASTQ quality scores type” box. However, if I look at the raw data reads, I notice that Line 4 (which encodes the quality values for sequence in Line 2) has values outside of the Illumina 1.3+ range (some of them fall into the Sanger format. I am enclosing the Quality Score Comparison figure along with some of the raw RNA-Seq data): Quality Score Comparison SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, 93 values (0, 93) (0 to 60 expected in raw reads) I - Illumina 1.3 Phred+64, 62 values (0, 62) (0 to 40 expected in raw reads) X - Solexa Solexa+64, 67 values (-5, 62) (-5 to 40 expected in raw reads) Diagram adapted from http://en.wikipedia.org/wiki/FASTQ_format RNA-Seq raw data @HWI-ST156_294:7:1:1058:2165:0/1 CACCAACTCACAGCCACTCCGTGAGGCCAGCAAGGCAAGAACATTCATCTC + HHHHFGGHHHGFHHFHHEGHC<GGGEB.EE9D?DDEEEE4FFFCBB/.C=D
@HWI-ST156_294:7:1:1184:2191:0/1 CGTAAATCCATGTCTGACTTCTGGATAGCAAACACCAGCACCGCGTGGATG + EE;E=ECEEBE@EEEE=GBFGF/GFFC<FA;:@<8AEABB>A#########
@HWI-ST156_294:7:1:1018:2200:0/1 NCTGATTAAGGATAATGAGTTTTTAGTAGAACTAATGATGTTATTCCTTGG + ###################################################
@HWI-ST156_294:7:1:1225:2217:0/1 GTTTTTGACTACACAAAGCACCCTTCTAAACCAGACCATTCTGGAGAATGA + FFCEFFFE?FEBDC?987::,3:<-9145,DA<:C9;+?############
As a test in FASTQ Groomer, I chose either the Sanger or Illumina 1.3+ as the input quality scores type and these are the results I got:
FASTQ Groomer on tn-read1 (using Sanger as input) 6.1 Gb format: fastqsanger, database:mm9 Info: Groomed 45868679 sanger reads into sanger reads. Based upon quality and sequence, the input data is valid for: sanger Input ASCII range: '#'(35) - 'I'(73) Input decimal range: 2 - 40
FASTQ Groomer on tn-read1 (using Illumina1.3+ as input) 6.1 Gb format: fastqsanger, database:mm9 Info: Groomed 45868679 illumina reads into sanger reads. Based upon quality and sequence, the input data is valid for: sanger Input ASCII range: '#'(35) - 'I'(73) Input decimal range: -29 - 9
Which one is right (I presume the Illumina 1.3+ one, but I can’t find any sort of explanation)? I noticed that the “input decimal range” had different values (although they spanned the same length) in relation to which input was chosen. What would happen downstream in TopHat if Sanger was used instead of Illumina 1.3+ for these files? Is there any other reading material/websites/etc… out there that might help me better understand the quality score and which to use? Any info/help would be greatly appreciated.
Thanks, David
David K. Crossman, Ph.D. Systems Biologist/Analyst/Statistician Heflin Center for Genomic Science University of Alabama at Birmingham 720 20th Street South Kaul Room 420 Birmingham, AL 35294-0024 (205) 996-4045 (205) 996-4056 (fax) David K. Crossman, Ph.D. Heflin Center for Genomic Science
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Mon, Mar 21, 2011 at 1:42 PM, David K Crossman <dkcrossm@uab.edu> wrote:
Hello!
I am fairly new to using Galaxy and have a question about the FASTQ Groomer feature. I have 4 RNA-Seq raw data files that were just recently generated from Illumina’s NGS instruments.
Very recently? If they are already using Illumina's CASAVA v1.8 pipeline then the FASTQ files will already be in the Sanger FASTQ format: http://seqanswers.com/forums/showthread.php?t=8895 Peter
Illumina's technical support team told me two weeks ago that Cassava 1.8 will not be released for at least six weeks. That makes it at least a month from now. Do you know anyone outside the company that has used it yet? Beyond the moaning within the cited seqanswers thread, I'd be interested in hearing any first-hand impressions. Eric Peter Cock wrote:
On Mon, Mar 21, 2011 at 1:42 PM, David K Crossman <dkcrossm@uab.edu> wrote:
Hello!
I am fairly new to using Galaxy and have a question about the FASTQ Groomer feature. I have 4 RNA-Seq raw data files that were just recently generated from Illumina’s NGS instruments.
Very recently? If they are already using Illumina's CASAVA v1.8 pipeline then the FASTQ files will already be in the Sanger FASTQ format: http://seqanswers.com/forums/showthread.php?t=8895
Peter
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (4)
-
Daniel Blankenberg
-
David K Crossman
-
Eric Cabot
-
Peter Cock