Re: [galaxy-user] Metagenomic filtering

24 Sep 2013

      Jing et al

Thank you for the offer to write some code to help advance the 
metagenomics arena. It is certainly needed.

So the problem is well known with megablast and shotgun metagenomics and 
without proper understanding and correct software will yield very 
misleading and in many cases incorrect data. For those of us who wish 
NOT to move to a protein level of comparison for specific reasons, we 
are stuck.

*The Problem:*

If I megablast 50 million sequences from a HiSeq run, millions of rRNA 
sequences will have a 99% match to all microbes rRNA genbank deposits. 
Not surprizing since the rRNA is highly conserved. The difference 
between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s.  
So 16s is not useful for Genus level, and certainly not Species

*So what happens:*

The returned matches will have many hits to whatever model organism is 
in Genbank. For example E coli has 13000 entries for rRNA and 
Sphearotilus has 3 entries for rRNA. If the blasted sequence matches 
both, the results will mislead the investigator to think they have 13000 
hits to E coli, EVEN if the microbe is Sphearotilus.

*The cure?:*

If there was a way to filter/ remove all hits ? Let say, for example, 
that a result has a first match (say E. coli) at >99% a second match 
(say Pseudomanas) at >99% and a third , forth and fifth match >99 for 
three other organisms. This sequence _must_ be discarded because it is a 
conserve sequence.

Basically conserved sequence is the enemy and invalidates the entire 
result.
*
**Another problem:*

If you have a reference sample with 19 non-model  microbes, and you run 
that by HiSeq Shotgun for metagenomics and then megablast, what do you 
think you get?  If E coli is not in the reference sample, how many hits 
do you think you get? Yes, 10,000 of thousands. So without removing 
conserved sequences, your data is wrong and you are much better served 
by culturing and running a Biolog metabolic panel and comparing to the 
sequence result.

So where do we start? I have some shotgun metagenomics data from the 
reference sample which included the 19 microbes. That was data from a MiSeq.

Scott

Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557

On 9/20/2013 9:17 PM, Jing Yu wrote:
...
Hi Scott,
I can do some perl programming, such as local/remote blasting. Can you 
specify your problem a little bit clearer, so that maybe I can write a 
program to do just that?
Regards,
Jing
Gerald
16s is basically useless for identification to genus. Since I started 
sequencing 16s in 1992, I have come to realize that without sequencing 
the  full 1540 bases, it is generally misleading, and even than, it is 
not accurate enough to nail genus on more than 1/2 the cases.   
However, what is your feeling on ITS  and gyrase, They seem to be far 
more discriminating but those databases have been decommissioned 
sometime ago.
The desirable thing would be that Galaxy or NCBI  add a "filter 
conserved genes" [ ie any hit with a second choice greater than 3% 
distance]. Something such as that.
If you (or others)  are aware of such a thing, I'd love the here about it.
Sincerely
Scott