Jing et al
Thank you for the offer to write some code to help advance the
metagenomics arena. It is certainly needed.
So the problem is well known with megablast and shotgun
metagenomics and without proper understanding and correct
software will yield very misleading and in many cases incorrect
data. For those of us who wish NOT to move to a protein level of
comparison for specific reasons, we are stuck.
The Problem:
If I megablast 50 million sequences from a HiSeq run, millions
of rRNA sequences will have a 99% match to all microbes rRNA
genbank deposits. Not surprizing since the rRNA is highly
conserved. The difference between E.coli and Shigella is 1 to 2
bases for the full 1540 bp 16s. So 16s is not useful for Genus
level, and certainly not Species
So what happens:
The returned matches will have many hits to whatever model
organism is in Genbank. For example E coli has 13000 entries for
rRNA and Sphearotilus has 3 entries for rRNA. If the blasted
sequence matches both, the results will mislead the investigator
to think they have 13000 hits to E coli, EVEN if the microbe is
Sphearotilus.
The cure?:
If there was a way to filter/ remove all hits ? Let say, for
example, that a result has a first match (say E. coli) at
>99% a second match (say Pseudomanas) at >99% and a third
, forth and fifth match >99 for three other organisms. This
sequence
must be discarded because it is a conserve
sequence.
Basically conserved sequence is the enemy and invalidates the
entire result.
Another problem:
If you have a reference sample with 19 non-model microbes, and
you run that by HiSeq Shotgun for metagenomics and then
megablast, what do you think you get? If E coli is not in the
reference sample, how many hits do you think you get? Yes,
10,000 of thousands. So without removing conserved sequences,
your data is wrong and you are much better served by culturing
and running a Biolog metabolic panel and comparing to the
sequence result.
So where do we start? I have some shotgun metagenomics data from
the reference sample which included the 19 microbes. That was
data from a MiSeq.
Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/20/2013 9:17 PM, Jing Yu wrote: