Hello, I'd like to share couple of first impressions after testing the updated NGS version. =============== Biggest issue for me is the support for Solexa formats. (Obiovusly, I'm biased, and think that Solexa machines are much more common/useful than other types). Uploading a FASTQ file automatically detects the type (fastqsolexa or fastqsanger). That's a good start, but there's a catch: If I'm not aware that there are multiple types of FASTQ files - I'm stuck. Bowtie & BWA tools don't display my FASTQ file as possible input dataset. A novice user will need to notice two things: 1. the small warning on the tools' pages: "BWA accepts files in Sanger FASTQ format." "Must have Sanger-scaled quality values with ASCII offset 33 " 2. the fact the the file type is "fastqsolexa" (on the expanded green dataset rectangle). Even if he does notice, there's no way for him to know what to do... The easier (and wrong) solution is to click the pencil icon and change the type to "fastqsanger". It's wrong because the quality scores are not re-scaled, but ironically it kind of works and bowtie even returns pretty good results (just to show you how much those quality scores are worth). Another way (which is also not perfect) is to use the "Quality Format Converter" twice: first from ASCII(64) to Numeric, and then from numeric to ASCII(33). This still doesn't re-scale the quality scores, but at least they are now with offset 33. As the guy who actually wrote the "Quality Format convert" tool, let me tell you that the tool's interface is not understandable at all (that's my fault, of course). It is not clear if the "FASTQ ASCII OFFSET" input parameter refers to the input format or output format. What another user tried to do (seemed very logical from his POV), is to run the tool just ONCE: Select a 'fastqsolexa' file (with ASCII 64) as input dataset, and select ASCII and 33 as the tool's parameters (hoping to convert ASCII-64 to ASCII-33). This obviously doesn't work (the tool returns an error). Basically, if you've uploaded a solexa FASTQ file, you're stuck - but it's worse then just being stuck - you don't know why you're stuck or how to fix it. At least for bowtie, there's a command line argument to accept different scales - so you could build an XML file that adds those command line arguments based on the file type, e.g.: ================= <command> bwa_wrapper.py [ gzillion other options ] #if $input.ext == "fastqsolexa" --solexa-qual #else --pthread33-quals #endif </command> ================ This will not solve the 33/64 issue, or the 1.3 vs 1.4 quality scores, but might still be better. Since I don't trust the quality scores at all, and most of the time prefer to discard them, The thing I'd most like to do is be able to map FASTA (not FASTQ) files. bowtie supports this (with "-f"), and with another "if" in the XML this can be done (I think). =============== Another issue is selecting the organism. Galaxy supports so many of them, that selecting the one you want becomes a bit annoying. The "Upload File" starts with "--Additional Species are Below --" selected (half-way down the list). I need to scroll up if my organism is one of common ones, or down if it's an "additional". Another rant about forgotten usability in the Web2.0 era: on every normal operating system, typing the first couple of letters will scroll the list down to a line starting with those letters. But since we start half-way down the list, typing "hu" (for human) gets me to "Human Herpesvirus 1" instead of "human (hg18)". Right now, the easiest way I found to select 'human' is: 1. move focus to the list box (with TAB, not by clicking the mouse, of course) 2. click the "home" key on the keyboard (if you use a Mac, I don't know what key you need for that...) 3. click "h", then "u" - I get to the "human GRCh37 (hg19)" 4. click "down arrow" once to get to the 'right' human (hg18). And believe me, all you mice-loving people - my keyboard way is faster ;) And after spending so much effort on selecting the right organism at upload time, and with most of the other Galaxy tools respecting (and even requiring) the dbkey to be set - Having a stand-alone list-box in the Bowtie/BWA tools to select a reference organism database is downright insulting. Also, the default for BWA is "hg18" and the default for Bowtie is "hg19" - a bit confusing. Note that even the UCSC Genome Browser still has "hg18" as default (not "hg19"). My greedy suggestion: An edit-box with auto completion. And when you auto-complete, complete both with the short names (e.g. "hg" -> "hg19/hg18/hg17") and with full names (e.g "hu" -> "human") and with mid-string names (e.g "viri" => "Dro. virilis") AND expand the genus name too (because I want to be able to write "dro" and get to the Drosophilas, I don't want to write "D. "). Better yet, also add couple of quick selections (one button per organism) for the most common ones. ================= Another issue is the index database file used for bowtie and bwa. I don't know if you've downloaded the pre-built one or built it yourself, but the index file you have contains useless (IMHO) identifiers. Here's a sample of bowtie mapping result: gi|51511727|ref|NC_000011.8|NC_000011 125559911 25 In short: WTF? Yes, I learned (after a quick search) that "NC_000011.8" is simply "chr11". But please, in the name of all that is holy, rebuild the indexes with the commonly known chromosome names. Same goes for BWA, which is only slightly better: chr17_random.nib:1-2617613 492542 0 Hurrah, I my sequence mapped to "chr17_random.nib:1-2617613" .... This is not just me being spoiled - when I use the "SAM to Interval" tool to get normal mapping results[*], I get: ============ chr17_random.nib:1-2617613 510910 510946 + chr17_random.nib:1-2617613 512402 512438 + chr13.nib:1-114142980 45846690 45846726 - chr17_random.nib:1-2617613 492541 492577 - =========== see http://main.g2.bx.psu.edu/datasets/841134/display/index This will never work with the UCSC Genome browser (or any other sane browser, for that matter), so the interval file is useless. =========================== Another issue is the accuracy of BWA/Bowtie results. Take a look at the following history: http://main.g2.bx.psu.edu/history/imp?id=72116af6496fcec3 It contains just one sequences in fastqsolexa format (dataset 2). It's a real sequences from a real FASTQ file (except for the machine name and flowcell number, which I've changed). This is the sequence: ATCGCTTCTCGGCCTTTTGGCNAAGATCAAGTGTAG Dataset 3 is conversion to numeric values. Dataset 8 is conversion to ASCII-33 values. Dataset 9 is BWA mapping, just one result (chrom 17) with one mismatch: FOOBAR_27_FC42AGAAXX:1:1:2:435 0 chr17_random.nib:1-2617613 510911 0 36M Dataset 10 is Bowtie mapping, just one result (chrom 9) with one mismatch: FOOBAR_27_FC42AGAAXX:1:1:2:435 16 gi|89161216|ref|NC_000009.10|NC_000009 18530316 0 As a sanity check, run it through BLAT on the UCSC Genome browser (with hg18): You'll get 25 (!) results with 1 mismatch, and many more with more mismatches. The blat results include the chr9 (from BWA) and chr17 (from Bowtie). There's no way that the quality scores in the input FASTQ affected the mapping so much that BWA and Bowtie can decide to ignore all coordinates except one (and not even the same one). I've written it before - running bowtie and BWA with the default parameters and without understanding the complexities of those programs is down right misleading (IMHO). A novice user who just wants to get results fast - will be partial results at best (especially when looking for "unique mappers"). The "commonly used" parameters for BWA and Bowtie are not right for everyone (not right for most, even). Offering them as the default without a big flashing red warning is a disservice. ======= Thanks for reading so far. Regardless, galaxy is still the best thing since sliced bread, many people here use it daily and it makes everyone's life a lot easier - so don't take the above too hard. -gordon [*] not sure if it's obvious, but I despise the "SAM" format.
In short: There will be only two formats from now on: fastq and fastqsanger All user input will be required to go through fastq groomer: http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/next_gen_conversion... so only fastqsanger will be "workable" by downstream tools. This will be on test shortly. There is number of sequencing providers pushing for fastqsanger. ---- the issues with bowtie and bwa indicies will be fixed, speaking of complexities of bowtie/bwa program = you kinda need to know what you're doing and Galaxy will never be able to help with that, but blat is definitely NOT the way to map reads. a. On Oct 6, 2009, at 11:54 PM, Gordon, Assaf wrote:
Hello,
I'd like to share couple of first impressions after testing the updated NGS version.
=============== Biggest issue for me is the support for Solexa formats. (Obiovusly, I'm biased, and think that Solexa machines are much more common/useful than other types).
Uploading a FASTQ file automatically detects the type (fastqsolexa or fastqsanger). That's a good start, but there's a catch: If I'm not aware that there are multiple types of FASTQ files - I'm stuck.
Bowtie & BWA tools don't display my FASTQ file as possible input dataset. A novice user will need to notice two things: 1. the small warning on the tools' pages: "BWA accepts files in Sanger FASTQ format." "Must have Sanger-scaled quality values with ASCII offset 33 "
2. the fact the the file type is "fastqsolexa" (on the expanded green dataset rectangle).
Even if he does notice, there's no way for him to know what to do...
The easier (and wrong) solution is to click the pencil icon and change the type to "fastqsanger". It's wrong because the quality scores are not re-scaled, but ironically it kind of works and bowtie even returns pretty good results (just to show you how much those quality scores are worth).
Another way (which is also not perfect) is to use the "Quality Format Converter" twice: first from ASCII(64) to Numeric, and then from numeric to ASCII(33). This still doesn't re-scale the quality scores, but at least they are now with offset 33. As the guy who actually wrote the "Quality Format convert" tool, let me tell you that the tool's interface is not understandable at all (that's my fault, of course). It is not clear if the "FASTQ ASCII OFFSET" input parameter refers to the input format or output format.
What another user tried to do (seemed very logical from his POV), is to run the tool just ONCE: Select a 'fastqsolexa' file (with ASCII 64) as input dataset, and select ASCII and 33 as the tool's parameters (hoping to convert ASCII-64 to ASCII-33). This obviously doesn't work (the tool returns an error).
Basically, if you've uploaded a solexa FASTQ file, you're stuck - but it's worse then just being stuck - you don't know why you're stuck or how to fix it.
At least for bowtie, there's a command line argument to accept different scales - so you could build an XML file that adds those command line arguments based on the file type, e.g.: ================= <command> bwa_wrapper.py [ gzillion other options ] #if $input.ext == "fastqsolexa" --solexa-qual #else --pthread33-quals #endif </command> ================
This will not solve the 33/64 issue, or the 1.3 vs 1.4 quality scores, but might still be better.
Since I don't trust the quality scores at all, and most of the time prefer to discard them, The thing I'd most like to do is be able to map FASTA (not FASTQ) files. bowtie supports this (with "-f"), and with another "if" in the XML this can be done (I think).
=============== Another issue is selecting the organism. Galaxy supports so many of them, that selecting the one you want becomes a bit annoying. The "Upload File" starts with "--Additional Species are Below --" selected (half-way down the list). I need to scroll up if my organism is one of common ones, or down if it's an "additional".
Another rant about forgotten usability in the Web2.0 era: on every normal operating system, typing the first couple of letters will scroll the list down to a line starting with those letters. But since we start half-way down the list, typing "hu" (for human) gets me to "Human Herpesvirus 1" instead of "human (hg18)".
Right now, the easiest way I found to select 'human' is: 1. move focus to the list box (with TAB, not by clicking the mouse, of course) 2. click the "home" key on the keyboard (if you use a Mac, I don't know what key you need for that...) 3. click "h", then "u" - I get to the "human GRCh37 (hg19)" 4. click "down arrow" once to get to the 'right' human (hg18).
And believe me, all you mice-loving people - my keyboard way is faster ;)
And after spending so much effort on selecting the right organism at upload time, and with most of the other Galaxy tools respecting (and even requiring) the dbkey to be set - Having a stand-alone list-box in the Bowtie/BWA tools to select a reference organism database is downright insulting.
Also, the default for BWA is "hg18" and the default for Bowtie is "hg19" - a bit confusing. Note that even the UCSC Genome Browser still has "hg18" as default (not "hg19").
My greedy suggestion: An edit-box with auto completion. And when you auto-complete, complete both with the short names (e.g. "hg" -> "hg19/hg18/hg17") and with full names (e.g "hu" -> "human") and with mid-string names (e.g "viri" => "Dro. virilis") AND expand the genus name too (because I want to be able to write "dro" and get to the Drosophilas, I don't want to write "D. "). Better yet, also add couple of quick selections (one button per organism) for the most common ones.
================= Another issue is the index database file used for bowtie and bwa.
I don't know if you've downloaded the pre-built one or built it yourself, but the index file you have contains useless (IMHO) identifiers. Here's a sample of bowtie mapping result: gi|51511727|ref|NC_000011.8|NC_000011 125559911 25
In short: WTF? Yes, I learned (after a quick search) that "NC_000011.8" is simply "chr11". But please, in the name of all that is holy, rebuild the indexes with the commonly known chromosome names.
Same goes for BWA, which is only slightly better: chr17_random.nib:1-2617613 492542 0
Hurrah, I my sequence mapped to "chr17_random.nib:1-2617613" ....
This is not just me being spoiled - when I use the "SAM to Interval" tool to get normal mapping results[*], I get: ============ chr17_random.nib:1-2617613 510910 510946 + chr17_random.nib:1-2617613 512402 512438 + chr13.nib:1-114142980 45846690 45846726 - chr17_random.nib:1-2617613 492541 492577 - =========== see http://main.g2.bx.psu.edu/datasets/841134/display/index
This will never work with the UCSC Genome browser (or any other sane browser, for that matter), so the interval file is useless.
=========================== Another issue is the accuracy of BWA/Bowtie results.
Take a look at the following history: http://main.g2.bx.psu.edu/history/imp?id=72116af6496fcec3
It contains just one sequences in fastqsolexa format (dataset 2).
It's a real sequences from a real FASTQ file (except for the machine name and flowcell number, which I've changed). This is the sequence: ATCGCTTCTCGGCCTTTTGGCNAAGATCAAGTGTAG
Dataset 3 is conversion to numeric values. Dataset 8 is conversion to ASCII-33 values.
Dataset 9 is BWA mapping, just one result (chrom 17) with one mismatch: FOOBAR_27_FC42AGAAXX:1:1:2:435 0 chr17_random.nib:1-2617613 510911 0 36M
Dataset 10 is Bowtie mapping, just one result (chrom 9) with one mismatch: FOOBAR_27_FC42AGAAXX:1:1:2:435 16 gi|89161216|ref|NC_000009.10| NC_000009 18530316 0
As a sanity check, run it through BLAT on the UCSC Genome browser (with hg18): You'll get 25 (!) results with 1 mismatch, and many more with more mismatches. The blat results include the chr9 (from BWA) and chr17 (from Bowtie). There's no way that the quality scores in the input FASTQ affected the mapping so much that BWA and Bowtie can decide to ignore all coordinates except one (and not even the same one).
I've written it before - running bowtie and BWA with the default parameters and without understanding the complexities of those programs is down right misleading (IMHO).
A novice user who just wants to get results fast - will be partial results at best (especially when looking for "unique mappers").
The "commonly used" parameters for BWA and Bowtie are not right for everyone (not right for most, even). Offering them as the default without a big flashing red warning is a disservice. =======
Thanks for reading so far. Regardless, galaxy is still the best thing since sliced bread, many people here use it daily and it makes everyone's life a lot easier - so don't take the above too hard. -gordon
[*] not sure if it's obvious, but I despise the "SAM" format. _______________________________________________ galaxy-dev mailing list galaxy-dev@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-dev
Anton Nekrutenko http://nekrut.bx.psu.edu http://galaxyproject.org
Hello Anton, Thank you for your answer. Anton Nekrutenko wrote, On 10/07/2009 11:11 AM:
All user input will be required to go through fastq groomer: http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/next_gen_conversion... Looks very useful. I'm looking forward to 'pull' it...
the issues with bowtie and bwa indicies will be fixed, speaking of complexities of bowtie/bwa program = you kinda need to know what you're doing and Galaxy will never be able to help with that,
I most definitely agree - you always need to know what you're doing. But, the BWA and Bowtie tools (as they appear now in Galaxy) provide a way to use them without realizing you don't know what you're doing. I'll just give one example which is critical in our lab: "unique mappers". These are short-read that map only once to the reference genome (within an acceptable alignment-score/mismatches range). If you look for "uniquely mapped" in google scholar you'll see many papers that make use of those. Here's an example of a unique-mapper: AACACCTTTGGGTGGTATGACTGGTTTCCACATGCAAACTGAAGATCGAA It maps once (to one location) in the human genome (hg18) perfectly, without mismatches. It maps to other locations, but with many more mismatches. There's no ambiguity about the above sequence, and I'm sure BWA and Bowtie would return the same result. My issue is with non-unique mappers. BWA and Bowtie (with the "common" parameters), return what they consider the "best" match - but only one result. This gives the false impression that the sequence mapped once (=uniquely) - unless you know a-priori that the default parameters will choose just one location and ignore the others. The naive way to finding unique mappers is to count to how many times they appear in a mapping result file - this will not work with BWA & Bowtie and their default parameters. This is getting philosophical, and I'll try to give it a rest...
but blat is definitely NOT the way to map reads.
Not to map millions of reads at once - true (and also not for paired-end reads), But to map one (or a handful) of reads to quickly see where they map - I think it does a very good job. I used it many times as a control - after mapping and intersecting and annotating and filtering, you end up with not-too-many sequences that you think are important - load them up to the UCSC genome browser and see how they look and where they map to. If blat's results are consistent with your other program's result - you're OK. We also use an exhaustive search program on the whole genome sometimes, but that takes relatively long time. regards, -gordon.
Assaf: bowtie and bwa indices are updated = they now have correct chr ids. a. On Oct 7, 2009, at 12:06 PM, Assaf Gordon wrote:
Hello Anton,
Thank you for your answer.
Anton Nekrutenko wrote, On 10/07/2009 11:11 AM:
All user input will be required to go through fastq groomer: http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/next_gen_conversion... Looks very useful. I'm looking forward to 'pull' it...
the issues with bowtie and bwa indicies will be fixed, speaking of complexities of bowtie/bwa program = you kinda need to know what you're doing and Galaxy will never be able to help with that,
I most definitely agree - you always need to know what you're doing. But, the BWA and Bowtie tools (as they appear now in Galaxy) provide a way to use them without realizing you don't know what you're doing.
I'll just give one example which is critical in our lab: "unique mappers". These are short-read that map only once to the reference genome (within an acceptable alignment-score/mismatches range). If you look for "uniquely mapped" in google scholar you'll see many papers that make use of those.
Here's an example of a unique-mapper: AACACCTTTGGGTGGTATGACTGGTTTCCACATGCAAACTGAAGATCGAA It maps once (to one location) in the human genome (hg18) perfectly, without mismatches. It maps to other locations, but with many more mismatches.
There's no ambiguity about the above sequence, and I'm sure BWA and Bowtie would return the same result.
My issue is with non-unique mappers. BWA and Bowtie (with the "common" parameters), return what they consider the "best" match - but only one result. This gives the false impression that the sequence mapped once (=uniquely) - unless you know a-priori that the default parameters will choose just one location and ignore the others.
The naive way to finding unique mappers is to count to how many times they appear in a mapping result file - this will not work with BWA & Bowtie and their default parameters.
This is getting philosophical, and I'll try to give it a rest...
but blat is definitely NOT the way to map reads.
Not to map millions of reads at once - true (and also not for paired- end reads), But to map one (or a handful) of reads to quickly see where they map - I think it does a very good job.
I used it many times as a control - after mapping and intersecting and annotating and filtering, you end up with not-too-many sequences that you think are important - load them up to the UCSC genome browser and see how they look and where they map to. If blat's results are consistent with your other program's result - you're OK. We also use an exhaustive search program on the whole genome sometimes, but that takes relatively long time.
regards, -gordon. _______________________________________________ galaxy-dev mailing list galaxy-dev@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-dev
Anton Nekrutenko http://nekrut.bx.psu.edu http://galaxyproject.org
participants (3)
-
Anton Nekrutenko
-
Assaf Gordon
-
Gordon, Assaf