So, Jen, I'm not sure if we're talking about the same ID change... I am under the impression that GenBank does not change it's GI numbers for it's entries. Plus, it's now looking like all sequence length output info for each hit through Galaxy's megablast does not match to the GI number output given by Galaxy megablast, but to the GI number before it. Because the "-1" rule is so consistent, it makes this seem less and less like it has to do with NCBI changing it's GI numbers to make room for new entries or something. In other words, there is a shift, as if a 1 was added to each NCBI GI number in galaxy before galaxy produces the output file. I need someone to tell me if I can trust the output. Basically I see it this way. Every hit row from Galaxy megablast actually has information for two NCBI entries: the one that shares the GI output and the one before it that shares the sequence length output. Which one is the hit I should be using? Because on some occations, the NCBI entry that shares the GI output from galaxy is VERY distantly related to the NCBI entry that shares the subject sequence length output from galaxy, and I don't know which to pick. Is this problem well understood, yet? On Tue, Apr 24, 2012 at 10:52 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicks <garlicscape@gmail.com> wrote:
Peter, you requested an example, here are the first five hits for my first query sequence (OTU#0)
0 324034994 527 93.23 266 13 5 1 265 22 283 7e-102 379.0 0 56181650 513 93.26 267 10 8 1 265 25 285 7e-102 379.0 0 314913953 582 91.79 268 13 9 1 265 24 285 2e-92 347.0 0 305670062 281 92.52 254 14 5 4 256 32 281 2e-92 347.0 0 310814066 1180 91.73 266 14 7 1 265 24 282 9e-92 345.0
You will notice there are 13 columns, one in addition to the 12 column titles you explained. This is because there is a column between sseqID and pident.
I see now - the megablast_wrapper.py calls megablast (from the old legacy NCBI blast suite) which does indeed produce 12 column tabular output. But the wrapper script then edits the output:
It appears to be splitting column 2 in two at the underscore intended to give the match ID and the length. This puzzles me but I haven't used the legacy BLAST tabular output for a while. On BLAST+ you can ask for the query or subject length explicitly as their own columns so we don't have this problem.
The megablast_wrapper.py also re-formats the floating point score in the last column, apparently the NCBI style could cause problems with the Galaxy filter tool.
In the metagenomic tutorial the first 4 columns are explained, and column 3 is described as length of sequence in database (or length of the subject sequence).
This is the problem column. The length of only one of the subject GI numbers above match the subject length in NCBI. This has caused me to wonder if I can trust the hit info. In all cases that I've checked, when this happens the correct match is the listed GI value minus 1 (ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt long).
That is strange.
Peter