On Thu, Nov 21, 2013 at 6:11 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Nov 21, 2013 at 5:59 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
I hear you, re. guessing about data - it just sounded like this would be a rare case. Is it happening on particular database searches? Now that I look at it I'm wondering in what situation the IndexError would be triggered. I'm diving into the details here just because I don't want to discover later on there that I'd made some assumptions about the id parsing.
Yes, it is rare - but the fix was triggered by falling over the following example from a BLAST against the NR database, shown in the commit comment:
https://github.com/peterjc/galaxy_blast/commit/5210af6622bf905ecb09ffbf6d7d3...
<Hit> <Hit_num>146</Hit_num> <Hit_id>gi|157832142|pdb|1NKD|A</Hit_id> <Hit_def>Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> >gi|157833740|pdb|1RPO|A Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle</Hit_def> <Hit_accession>1NKD_A</Hit_accession> <Hit_len>65</Hit_len>
Spliting on just the greater than sign broke on the <2aa> comment. Splitting on space then greater than sign is slightly less fragile.
Ideally this multi-entry field would be presented explicitly in the XML, something I suggested in passing on this related blog post: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions....
You can see the problem entry like this:
$ blastdbcmd -entry 157832142 -db nr -outfmt "%t" Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle
To see if there are any more naught entries in the NR database, I am trying this command (no output yet, might take a while though):
$ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">" ...
tyr Variant Of Myoglobin Chain L, Fv Fragment Of Mouse Monoclonal Antibody D1.3 (BalbC, IGG1, K) Engineered Mutant Pro95l->ser On Variant Chain L Glu81- >asp And Chain H Leu312->val Chain A, Alteration Of Axial Coordination By Protein Engineering In Myoglobin. Bis-Imidazole Ligation In The His64-- val(Slash)val68-->his Double Mutant Chain B, Alteration Of Axial Coordination By Protein Engineering In Myoglobin. Bis-Imidazole Ligation In The His64-- val(Slash)val68-->his Double Mutant Very similar to alpha-NACs, (Nascent polypeptide > [Arabidopsis thaliana] Chain A, Golgi Mannosidase Ii D204a Catalytic Nucleophile Mutant Complex With Methyl(Alpha-D-Mannopyranosyl)-(1->3)-S-[(Alpha-D-Mannopyranosyl)-(1- 6)]-Alpha-D-Mannopyranoside
arg, Met302->leu) In Complex With 1-{5-[2-(1-Methyl-1h- Pyrazolo[4,3-D]pyrimidin-7-Ylamino)-Ethyl]-Thiazol-2-Yl}-3- (3-Trifluoromethyl-Phenyl)-Urea Chain A, Crystal Structure Of Mouse Aurora A (Asn186->gly, Lys240- arg, Met302->leu) In Complex With [7-(2-{2-[3-(3-Chloro- Phenyl)-Ureido]-Thiazol-5-Yl}-Ethylamino)-Pyrazolo[4,3- D]pyrimidin-1-Yl]-Acetic Acid Glutaminase > [Sulfurimonas gotlandica GD1] exonuclease VIII 5 > 3 specific dsDNA exonuclease [uncultured phage MedDCM-OCT-S12-C102] exonuclease VIII 5 > 3 specific dsDNA exonuclease [uncultured organism MedDCM-OCT-S08-C700]
thr] Insulin Mutant (Pt Insulin) Chain A, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala) Chain B, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala) Chain C, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala) Chain D, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala)
With hindsight, I should have asked this to include the protein ID too, but in any case there are about 2650 examples, most using an arrow like --> or -> but quite a few HTML italics tags too - plus things like the <2aa>, <R> and <ESP> as well: $ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">" Alpha (1->4) glucosyltransferase [Mycobacterium tuberculosis H37Rv] Alpha (1->4) glucosyltransferase [Mycobacterium tuberculosis H37Rv] ssDNA exonuclease, 5' --> 3'-specific [Escherichia coli str. K-12 substr. MG1655] ssDNA exonuclease, 5' --> 3'-specific [Escherichia coli str. K-12 substr. W3110] ... PREDICTED: putative C->U-editing enzyme APOBEC-4-like, partial [Alligator sinensis] real 1842m2.024s user 1835m55.169s sys 15m17.406s (i.e. about 30 hours to dump all the titles from the NR database). The bad news (in terms of splitting the XML description) is there are currently 30 examples with " >" (space greater-than) in the description (many look purely accidental due to past line wrapping I would guess): Chain B, Crystal Structure Of Trypsin Complexed With The Bpti Variant (Tyr35- >gly) Chain A, Structural Characterization Of Heme Ligation In The His64-- putative NADH dehydrogenase Fe-S protein 7 >Feature gb|DQ213771 [Taeniopygia guttata] Chain A, Crystal Structure Of Mouse Aurora A (asn186->gly, Lys240->arg, Met302- >leu) In Complex With 1-{5-[2-(thieno[3,2-d]pyrimidin-4-ylamino)- Ethyl]- Thiazol-2-yl}-3-(3-trifluoromethyl-phenyl)-urea Chain A, Crystal Structure Of Mouse Aurora A (asn186->gly, Lys240->arg, Met302- >leu) In Complex With 1-(3-chloro-phenyl)-3-{5-[2-(thieno[3,2- D]pyrimidin-4-ylamino)- Ethyl]-thiazol-2-yl}-urea [sns-314] Chain A, Crystal Structure Of Mouse Aurora A (Asn186->gly, Lys240- transporter, probably Low affinity (KM > 3mM) ammonia uptake carrier, AmtB [Bifidobacterium asteroides PRL2011] transporter, probably Low affinity (KM > 3mM) ammonia uptake carrier, AmtB [Bifidobacterium asteroides PRL2011] Chain H, Fv Fragment Of Mouse Monoclonal Antibody D1.3 (BalbC, IGG1, K) Engineered Mutant Pro95l->ser On Variant Chain L Glu81- >asp And Chain H Leu312->val Chain A, Solution Structure Of The Monomeric [thr(B27)->pro,Pro(B28)- putative oxidase > apramycin biosynthesisN-methyltransferase [Streptoalloteichus tenebrarius] First of four adjacent putative subtilase family > [Arabidopsis thaliana] Third of four adjacent putative subtilase family > [Arabidopsis thaliana] putative ORF >60AA, partial [Escherichia coli] putative ORF >24AA, partial [Escherichia coli] Chain B, Solution Structure Of The Monomeric [thr(B27)->pro,Pro(B28)-
thr] Insulin Mutant (Pt Insulin) Chain A, Crystal Structure Of Trypsin Complexed With The Bpti Variant (Tyr35- >gly)
Regards, Peter