NGS impressions (long)

6 Oct 2009

      Hello,

I'd like to share couple of first impressions after testing the updated NGS version.

===============
Biggest issue for me is the support for Solexa formats.
(Obiovusly, I'm biased, and think that Solexa machines are much more common/useful than other types).

Uploading a FASTQ file automatically detects the type (fastqsolexa or fastqsanger).
That's a good start, but there's a catch:
If I'm not aware that there are multiple types of FASTQ files - I'm stuck.

Bowtie & BWA tools don't display my FASTQ file as possible input dataset.
A novice user will need to notice two things:
1. the small warning on the tools' pages: 
       "BWA accepts files in Sanger FASTQ format."
       "Must have Sanger-scaled quality values with ASCII offset 33 "

2. the fact the the file type is "fastqsolexa" (on the expanded green dataset rectangle).

Even if he does notice, there's no way for him to know what to do...

The easier (and wrong) solution is to click the pencil icon and change the type to "fastqsanger".
It's wrong because the quality scores are not re-scaled, but ironically it kind of works and bowtie even returns pretty good results (just to show you how much those quality scores are worth).

Another way (which is also not perfect) is to use the "Quality Format Converter" twice:
first from ASCII(64) to Numeric, and then from numeric to ASCII(33).
This still doesn't re-scale the quality scores, but at least they are now with offset 33.
As the guy who actually wrote the "Quality Format convert" tool, let me tell you that the tool's interface is not understandable at all (that's my fault, of course).
It is not clear if the "FASTQ ASCII OFFSET" input parameter refers to the input format or output format.

What another user tried to do (seemed very logical from his POV), is to run the tool just ONCE:
Select a 'fastqsolexa' file (with ASCII 64) as input dataset, and select ASCII and 33 as the tool's parameters (hoping to convert ASCII-64 to ASCII-33).
This obviously doesn't work (the tool returns an error).

Basically, if you've uploaded a solexa FASTQ file, you're stuck - but it's worse then just being stuck - you don't know why you're stuck or how to fix it.

At least for bowtie, there's a command line argument to accept different scales - so you could build an XML file that adds those command line arguments based on the file type, e.g.:
=================
<command>
  bwa_wrapper.py
  [ gzillion other options ]
#if $input.ext == "fastqsolexa"
	--solexa-qual
#else
	--pthread33-quals
#endif
</command>
================

This will not solve the 33/64 issue, or the 1.3 vs 1.4 quality scores, but might still be better.

Since I don't trust the quality scores at all, and most of the time prefer to discard them,
The thing I'd most like to do is be able to map FASTA (not FASTQ) files.
bowtie supports this (with "-f"), and with another "if" in the XML this can be done (I think).

===============
Another issue is selecting the organism.
Galaxy supports so many of them, that selecting the one you want becomes a bit annoying.
The "Upload File" starts with "--Additional Species are Below --" selected (half-way down the list).
I need to scroll up if my organism is one of common ones, or down if it's an "additional".

Another rant about forgotten usability in the Web2.0 era:
on every normal operating system, typing the first couple of letters will scroll the list down to a line starting with those letters.
But since we start half-way down the list, typing "hu" (for human) gets me to "Human Herpesvirus 1" instead of "human (hg18)".

Right now, the easiest way I found to select 'human' is:
 1. move focus to the list box (with TAB, not by clicking the mouse, of course)
 2. click the "home" key on the keyboard (if you use a Mac, I don't know what key you need for that...)
 3. click "h", then "u" - I get to the "human GRCh37 (hg19)"
 4. click "down arrow" once to get to the 'right' human (hg18).

And believe me, all you mice-loving people - my keyboard way is faster ;)

And after spending so much effort on selecting the right organism at upload time,
and with most of the other Galaxy tools respecting (and even requiring) the dbkey to be set -
Having a stand-alone list-box in the Bowtie/BWA tools to select a reference organism database is downright insulting.

Also, the default for BWA is "hg18" and the default for Bowtie is "hg19" - a bit confusing.
Note that even the UCSC Genome Browser still has "hg18" as default (not "hg19").

My greedy suggestion: 
An edit-box with auto completion.
And when you auto-complete, complete both with the short names (e.g. "hg" -> "hg19/hg18/hg17") and with full names (e.g "hu" -> "human") and with mid-string names (e.g "viri" => "Dro. virilis")
AND expand the genus name too (because I want to be able to write "dro" and get to the Drosophilas, I don't want to write "D. ").
Better yet, also add couple of quick selections (one button per organism) for the most common ones.

=================
Another issue is the index database file used for bowtie and bwa.

I don't know if you've downloaded the pre-built one or built it yourself,
but the index file you have contains useless (IMHO) identifiers.
Here's a sample of bowtie mapping result:
   gi|51511727|ref|NC_000011.8|NC_000011	125559911	25

In short: WTF?
Yes, I learned (after a quick search) that "NC_000011.8" is simply "chr11".
But please, in the name of all that is holy, rebuild the indexes with the commonly known chromosome names.

Same goes for BWA, which is only slightly better:
   chr17_random.nib:1-2617613	492542	0

Hurrah, I my sequence mapped to "chr17_random.nib:1-2617613" ....

This is not just me being spoiled - when I use the "SAM to Interval" tool to get normal mapping results[*], I get:
============
chr17_random.nib:1-2617613	510910	510946	+
chr17_random.nib:1-2617613	512402	512438	+
chr13.nib:1-114142980	45846690	45846726 -
chr17_random.nib:1-2617613	492541	492577	-
===========
see http://main.g2.bx.psu.edu/datasets/841134/display/index

This will never work with the UCSC Genome browser (or any other sane browser, for that matter), so the interval file is useless.

===========================
Another issue is the accuracy of BWA/Bowtie results.

Take a look at the following history:
http://main.g2.bx.psu.edu/history/imp?id=72116af6496fcec3

It contains just one sequences in fastqsolexa format (dataset 2).

It's a real sequences from a real FASTQ file (except for the machine name and flowcell number, which I've changed).
This is the sequence: ATCGCTTCTCGGCCTTTTGGCNAAGATCAAGTGTAG

Dataset 3 is conversion to numeric values.
Dataset 8 is conversion to ASCII-33 values.

Dataset 9 is BWA mapping, just one result (chrom 17) with one mismatch:
FOOBAR_27_FC42AGAAXX:1:1:2:435	0	chr17_random.nib:1-2617613	510911	0	36M

Dataset 10 is Bowtie mapping, just one result (chrom 9) with one mismatch:
FOOBAR_27_FC42AGAAXX:1:1:2:435	16	gi|89161216|ref|NC_000009.10|NC_000009	18530316	0

As a sanity check, run it through BLAT on the UCSC Genome browser (with hg18):
You'll get 25 (!) results with 1 mismatch, and many more with more mismatches.
The blat results include the chr9 (from BWA) and chr17 (from Bowtie).
There's no way that the quality scores in the input FASTQ affected the mapping so much that BWA and Bowtie can decide to ignore all coordinates except one (and not even the same one).

I've written it before - running bowtie and BWA with the default parameters and without understanding the complexities of those programs is down right misleading (IMHO).

A novice user who just wants to get results fast - will be partial results at best (especially when looking for "unique mappers").

The "commonly used" parameters for BWA and Bowtie are not right for everyone (not right for most, even).
Offering them as the default without a big flashing red warning is a disservice.
=======

Thanks for reading so far.
Regardless, galaxy is still the best thing since sliced bread, many people here use it daily and it makes everyone's life a lot easier - so don't take the above too hard.
  -gordon

[*] not sure if it's obvious, but I despise the "SAM" format.

Gordon, Assaf

Anton Nekrutenko

Assaf Gordon

Anton Nekrutenko

tags

participants (3)