1. Remove rDNA sequences (and/or other well known highly-conserved
sequences to reduce the workload in step 2).
2. Blast, then remove sequences with > (say 99%) match to > (say 5)
genus. (Optional if step 1 is already good enough)
For step 1:
Build a fasta file of the chosen highly conserved sequences, and
use it as a feed to blast against your MiSeq result.
Remove positive hits.
For step 2:
Blast remaining MiSeq sequences against NCBI (or whatever) database.
Remove if it hits more than n genus.
Jing
On 24 Sep 2013, at 22:17, Scott Tighe <scott.tighe@uvm.edu
<mailto:scott.tighe@uvm.edu>> wrote:
Jing et al
Thank you for the offer to write some code to help advance the
metagenomics arena. It is certainly needed.
So the problem is well known with megablast and shotgun
metagenomics and without proper understanding and correct software
will yield very misleading and in many cases incorrect data. For
those of us who wish NOT to move to a protein level of comparison
for specific reasons, we are stuck.
*The Problem:*
If I megablast 50 million sequences from a HiSeq run, millions of
rRNA sequences will have a 99% match to all microbes rRNA genbank
deposits. Not surprizing since the rRNA is highly conserved. The
difference between E.coli and Shigella is 1 to 2 bases for the full
1540 bp 16s. So 16s is not useful for Genus level, and certainly
not Species
*So what happens:*
The returned matches will have many hits to whatever model organism
is in Genbank. For example E coli has 13000 entries for rRNA and
Sphearotilus has 3 entries for rRNA. If the blasted sequence
matches both, the results will mislead the investigator to think
they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus.
*The cure?:*
If there was a way to filter/ remove all hits ? Let say, for
example, that a result has a first match (say E. coli) at >99% a
second match (say Pseudomanas) at >99% and a third , forth and
fifth match >99 for three other organisms. This sequence _must_ be
discarded because it is a conserve sequence.
Basically conserved sequence is the enemy and invalidates the
entire result.
*
**Another problem:*
If you have a reference sample with 19 non-model microbes, and you
run that by HiSeq Shotgun for metagenomics and then megablast, what
do you think you get? If E coli is not in the reference sample,
how many hits do you think you get? Yes, 10,000 of thousands. So
without removing conserved sequences, your data is wrong and you
are much better served by culturing and running a Biolog metabolic
panel and comparing to the sequence result.
So where do we start? I have some shotgun metagenomics data from
the reference sample which included the 19 microbes. That was data
from a MiSeq.
Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/20/2013 9:17 PM, Jing Yu wrote:
Hi Scott,
I can do some perl programming, such as local/remote blasting. Can
you specify your problem a little bit clearer, so that maybe I can
write a program to do just that?
Regards,
Jing
Gerald
16s is basically useless for identification to genus. Since I
started sequencing 16s in 1992, I have come to realize that
without sequencing the full 1540 bases, it is generally
misleading, and even than, it is not accurate enough to nail genus
on more than 1/2 the cases. However, what is your feeling on
ITS and gyrase, They seem to be far more discriminating but those
databases have been decommissioned sometime ago.
The desirable thing would be that Galaxy or NCBI add a "filter
conserved genes" [ ie any hit with a second choice greater than 3%
distance]. Something such as that.
If you (or others) are aware of such a thing, I'd love the here
about it.
Sincerely
Scott