Jing et al
Thank you for the offer to write some code to help
advance the metagenomics arena. It is certainly needed.
So the problem is well known with megablast and shotgun
metagenomics and without proper understanding and
correct software will yield very misleading and in many
cases incorrect data. For those of us who wish NOT to
move to a protein level of comparison for specific
reasons, we are stuck.
The Problem:
If I megablast 50 million sequences from a HiSeq run,
millions of rRNA sequences will have a 99% match to all
microbes rRNA genbank deposits. Not surprizing since the
rRNA is highly conserved. The difference between E.coli
and Shigella is 1 to 2 bases for the full 1540 bp 16s.
So 16s is not useful for Genus level, and certainly not
Species
So what happens:
The returned matches will have many hits to whatever
model organism is in Genbank. For example E coli has
13000 entries for rRNA and Sphearotilus has 3 entries
for rRNA. If the blasted sequence matches both, the
results will mislead the investigator to think they have
13000 hits to E coli, EVEN if the microbe is
Sphearotilus.
The cure?:
If there was a way to filter/ remove all hits ? Let say,
for example, that a result has a first match (say E.
coli) at >99% a second match (say Pseudomanas) at
>99% and a third , forth and fifth match >99 for
three other organisms. This sequence
must be
discarded because it is a conserve sequence.
Basically conserved sequence is the enemy and
invalidates the entire result.
Another problem:
If you have a reference sample with 19 non-model
microbes, and you run that by HiSeq Shotgun for
metagenomics and then megablast, what do you think you
get? If E coli is not in the reference sample, how many
hits do you think you get? Yes, 10,000 of thousands. So
without removing conserved sequences, your data is wrong
and you are much better served by culturing and running
a Biolog metabolic panel and comparing to the sequence
result.
So where do we start? I have some shotgun metagenomics
data from the reference sample which included the 19
microbes. That was data from a MiSeq.
Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/20/2013 9:17 PM, Jing Yu wrote: