Re: [galaxy-user] MegaBLAST output

24 Apr 2012

      Hi Sarah,

We appreciate all of the information you have provided and have been 
working here since yesterday to investigate the issue in more detail. 
This includes incorporating the additional data both you and Peter have 
been posting.

We don't have anything conclusive to report yet, but it would have been 
considerate to send an update this morning to let you know what we were 
doing. Please accept my apologies for not doing so - we are in fact in 
complete agreement that as the data currently presents, something odd 
appears to be going on.  Genbank updates would be unrelated as gi 
numbers do not change through time (although they can be retired, but 
again, not related to this case). The question of the mismatch in the 
wrapped Megablast output between gi and reported length is the open 
issue to be addressed.

A reply will be send as soon as the root cause is determined. If there 
is indeed a problem, this would of course be considered a priority to 
correct. Not that we are expecting delays, but if your analysis is very 
urgent, using the BLAST+ BLASTN megablast wrapper that Peter authored, 
in a local or cloud instance, would be the best immediate remedy (this 
version has the standard 12 column output). Sequence length data could 
always be obtained from Genbank and added into these results using other 
Galaxy tools (column join, etc.).

Thank you and Peter both for the help and for your patience!

Best,

Jen
Galaxy team

On 4/24/12 1:50 PM, Sarah Hicks wrote:
...
So, Jen, I'm not sure if we're talking about the same ID change... I
am under the impression that GenBank does not change it's GI numbers
for it's entries. Plus, it's now looking like all sequence length
output info for each hit through Galaxy's megablast does not match to
the GI number output given by Galaxy megablast, but to the GI number
before it. Because the "-1" rule is so consistent, it makes this seem
less and less like it has to do with NCBI changing it's GI numbers to
make room for new entries or something. In other words, there is a
shift, as if a 1 was added to each NCBI GI number in galaxy before
galaxy produces the output file.
I need someone to tell me if I can trust the output. Basically I see
it this way. Every hit row from Galaxy megablast actually has
information for two NCBI entries: the one that shares the GI output
and the one before it that shares the sequence length output. Which
one is the hit I should be using?
Because on some occations, the NCBI entry that shares the GI output
from galaxy is VERY distantly related to the NCBI entry that shares
the subject sequence length output from galaxy, and I don't know which
to pick. Is this problem well understood, yet?
On Tue, Apr 24, 2012 at 10:52 AM, Peter Cock<p.j.a.cock@googlemail.com>  wrote:
...
On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicks<garlicscape@gmail.com>  wrote:
...
Peter, you requested an example, here are the first five hits for my
first query sequence (OTU#0)
0       324034994       527     93.23   266     13      5       1       265     22      283     7e-102  379.0
0       56181650        513     93.26   267     10      8       1       265     25      285     7e-102  379.0
0       314913953       582     91.79   268     13      9       1       265     24      285     2e-92   347.0
0       305670062       281     92.52   254     14      5       4       256     32      281     2e-92   347.0
0       310814066       1180    91.73   266     14      7       1       265     24      282     9e-92   345.0
You will notice there are 13 columns, one in addition to the 12 column
titles you explained. This is because there is a column between sseqID
and pident.
I see now - the megablast_wrapper.py calls megablast (from the old legacy
NCBI blast suite) which does indeed produce 12 column tabular output. But
the wrapper script then edits the output:
It appears to be splitting column 2 in two at the underscore intended to
give the match ID and the length. This puzzles me but I haven't used
the legacy BLAST tabular output for a while. On BLAST+ you can ask
for the query or subject length explicitly as their own columns so we
don't have this problem.
The megablast_wrapper.py also re-formats the floating point score in the
last column, apparently the NCBI style could cause problems with the
Galaxy filter tool.
...
In the metagenomic tutorial the first 4 columns are
explained, and column 3 is described as length of sequence in database
(or length of the subject sequence).
This is the problem column. The length of only one of the subject GI
numbers above match the subject length in NCBI. This has caused me to
wonder if I can trust the hit info. In all cases that I've checked,
when this happens the correct match is the listed GI value minus 1
(ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt
long).
That is strange.
Peter
-- 
Jennifer Jackson
http://galaxyproject.org