Re: [galaxy-user] [galaxy-bugs] GI errors in the megablast table of results ?
Dear Sandrine, Thanks for pointing out this issue. The BLAST databases we have on Galaxy are from last year, while those on NCBI website are the latest (Jan 2012). As pointed out on NCBI website ( http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html), it appears that each time any change is made to a sequence/database, GI numbers change as well. This is perhaps why you're observing discrepancies in GI numbers and lengths between megablast outputs on Galaxy and NCBI. I'm currently in the process of downloading the latest BLAST databases from NCBI, and I'll let you know when they're available for use on Galaxy. Thanks for your patience, Guru Galaxy team. On Wed, Nov 9, 2011 at 8:03 AM, Sandrine Hughes <Sandrine.Hughes@ens-lyon.fr
wrote:
Dear all,
I’m not sure where I need to send my email so I apologize if I’m wrong.
I have a trouble with the Megablast program available in NGS Mapping and I hope that you can help. Indeed, I think that there might be a problem with the table given in output, and notably a shift between the GI numbers and the parameters associated.
Here are the details:
I. First, what I have done : I used the program to identify the species that I have in a mix of sequences by using the following options: Database nt 27-Jun-2011 Word size 16 Identity 90.0 Cutoff 0.001 Filter out low complexity regions Yes I run the analyses twice and obtained exactly the same results (I used the online version of Galaxy, not a local one).
II. Second, I analysed the data obtained for one of my sequence (1-202). The following lines are the beginning of the table that I obtained after the megablast and two lines with troubles:
1-202 312182292 484 99.33 150 1 0 1 150 1 150 2e-75 289.0 1-202 312182201 476 99.33 150 1 0 1 150 1 150 2e-75 289.0 1-202 308228725 928 99.33 150 1 0 1 150 19 168 2e-75 289.0 1-202 308228711 938 99.33 150 1 0 1 150 22 171 2e-75 289.0 1-202 308197083 459 99.33 150 1 0 1 150 10 159 2e-75 289.0 1-202 300392378 920 99.33 150 1 0 1 150 10 159 2e-75 289.0 1-202 300392376 918 99.33 150 1 0 1 150 9 158 2e-75 289.0 1-202 300392375 922 99.33 150 1 0 1 150 11 160 2e-75 289.0 1-202 300392374 931 99.33 150 1 0 1 150 21 170 2e-75 289.0 1-202 300392373 909 99.33 150 1 0 1 150 21 170 2e-75 289.0 1-202 300392371 1172 99.33 150 1 0 1 150 9 158 2e-75 289.0 ... 1-202 179366399 151762 98.67 150 2 0 1 150 46880 47029 6e-73 281.0 1-202 58617849 511 98.67 150 2 0 1 150 21 170 6e-73 281.0
III. Third, what I’ve noticed: My first trouble was that among all the species identified, two were very different from the expected ones (2 last lines). So I decided to search if that could be possible for that sequence and performed independently a megablast on the NCBI with similar options. I was not able to find these two species in the results. So, I decided to check the hits identified in the table above and identified a second trouble. In the table, the second column give the GI of the database hit and the third column give the length of the database hit. However, when I manually checked in NCBI the length of the GI, this one was incorrect. Indeed, for the GI 312182292, the length should be 580 and not 484. By checking different lines, I noticed that the length that is given for a GI corresponds to the length of the GI-1. As you can see in the above table, some GI are consecutive (300392376, 300392375,...). When checking the length of 300392376 in NCBI, I should have 920. But when I checked 300392375, I found 918. And this was true for the following lines : 300392374 give normally 922 and 300392373 give 931... My conclusion at that point was that there was a shift of –1 between the GI and the other parameters of the line (indeed the parameters for the remaining columns are in agreement with the length of the GI-1). However, that’s not always true.... For some GI given in the table (for example, the two last lines), if we check the parameters of the GI-1, the parameters are completely different... So, I suppose that there is a trouble in the GI sorting during the megablast but I’m not able to clearly define the problem.
IV. Fourth, confirmed with an other dataset In order to be sure that the problem was not linked to my data or my process, I asked a colleague to do a megablast on independent data. The conclusions were similar to mine : a shift in the GI given in the table and the parameters associated, that most of the time but not always, correspond to GI-1.
Can you confirm that there is a problem with the output of the megablast available in Galaxy ? If yes, do you think you can fix it ?
Many thanks for your help,
Best regards,
Sandrine
-- Graduate student, Bioinformatics and Genomics Makova lab/Galaxy team Penn State University 505 Wartik lab University Park PA 16802 guru@psu.edu
Hello all, Did this issue get resolved? If Sandrine was right about there being an off by one error in GI number in the BLAST tabular output, it could be a bug in 'legacy' blastall command. I say 'legacy' BLAST because that's what Galay's NGS 'megablast' tool is using internally (as opposed to the the NCBI's replacement BLAST+). Peter On Wed, Jan 25, 2012 at 3:14 PM, Guru Ananda <guru@psu.edu> wrote:
Dear Sandrine,
Thanks for pointing out this issue. The BLAST databases we have on Galaxy are from last year, while those on NCBI website are the latest (Jan 2012). As pointed out on NCBI website (http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html), it appears that each time any change is made to a sequence/database, GI numbers change as well. This is perhaps why you're observing discrepancies in GI numbers and lengths between megablast outputs on Galaxy and NCBI. I'm currently in the process of downloading the latest BLAST databases from NCBI, and I'll let you know when they're available for use on Galaxy.
Thanks for your patience, Guru Galaxy team.
On Wed, Nov 9, 2011 at 8:03 AM, Sandrine Hughes <Sandrine.Hughes@ens-lyon.fr> wrote:
Dear all,
I’m not sure where I need to send my email so I apologize if I’m wrong.
I have a trouble with the Megablast program available in NGS Mapping and I hope that you can help. Indeed, I think that there might be a problem with the table given in output, and notably a shift between the GI numbers and the parameters associated.
Here are the details:
I. First, what I have done : I used the program to identify the species that I have in a mix of sequences by using the following options: Database nt 27-Jun-2011 Word size 16 Identity 90.0 Cutoff 0.001 Filter out low complexity regions Yes I run the analyses twice and obtained exactly the same results (I used the online version of Galaxy, not a local one).
II. Second, I analysed the data obtained for one of my sequence (1-202). The following lines are the beginning of the table that I obtained after the megablast and two lines with troubles:
1-202 312182292 484 99.33 150 1 0 1 150 1 150 2e-75 289.0 1-202 312182201 476 99.33 150 1 0 1 150 1 150 2e-75 289.0 1-202 308228725 928 99.33 150 1 0 1 150 19 168 2e-75 289.0 1-202 308228711 938 99.33 150 1 0 1 150 22 171 2e-75 289.0 1-202 308197083 459 99.33 150 1 0 1 150 10 159 2e-75 289.0 1-202 300392378 920 99.33 150 1 0 1 150 10 159 2e-75 289.0 1-202 300392376 918 99.33 150 1 0 1 150 9 158 2e-75 289.0 1-202 300392375 922 99.33 150 1 0 1 150 11 160 2e-75 289.0 1-202 300392374 931 99.33 150 1 0 1 150 21 170 2e-75 289.0 1-202 300392373 909 99.33 150 1 0 1 150 21 170 2e-75 289.0 1-202 300392371 1172 99.33 150 1 0 1 150 9 158 2e-75 289.0 ... 1-202 179366399 151762 98.67 150 2 0 1 150 46880 47029 6e-73 281.0 1-202 58617849 511 98.67 150 2 0 1 150 21 170 6e-73 281.0
III. Third, what I’ve noticed: My first trouble was that among all the species identified, two were very different from the expected ones (2 last lines). So I decided to search if that could be possible for that sequence and performed independently a megablast on the NCBI with similar options. I was not able to find these two species in the results. So, I decided to check the hits identified in the table above and identified a second trouble. In the table, the second column give the GI of the database hit and the third column give the length of the database hit. However, when I manually checked in NCBI the length of the GI, this one was incorrect. Indeed, for the GI 312182292, the length should be 580 and not 484. By checking different lines, I noticed that the length that is given for a GI corresponds to the length of the GI-1. As you can see in the above table, some GI are consecutive (300392376, 300392375,...). When checking the length of 300392376 in NCBI, I should have 920. But when I checked 300392375, I found 918. And this was true for the following lines : 300392374 give normally 922 and 300392373 give 931... My conclusion at that point was that there was a shift of –1 between the GI and the other parameters of the line (indeed the parameters for the remaining columns are in agreement with the length of the GI-1). However, that’s not always true.... For some GI given in the table (for example, the two last lines), if we check the parameters of the GI-1, the parameters are completely different... So, I suppose that there is a trouble in the GI sorting during the megablast but I’m not able to clearly define the problem.
IV. Fourth, confirmed with an other dataset In order to be sure that the problem was not linked to my data or my process, I asked a colleague to do a megablast on independent data. The conclusions were similar to mine : a shift in the GI given in the table and the parameters associated, that most of the time but not always, correspond to GI-1.
Can you confirm that there is a problem with the output of the megablast available in Galaxy ? If yes, do you think you can fix it ?
Many thanks for your help,
Best regards,
Sandrine
-- Graduate student, Bioinformatics and Genomics Makova lab/Galaxy team Penn State University 505 Wartik lab University Park PA 16802 guru@psu.edu
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Guru Ananda
-
Peter Cock