NCBI Blast+ "BLAST XML to tabular" tool question
I'm doing a 1 step generic reporting tool along the lines of the "BLAST XML to tabular" script by Peter. I was just about to ask about this line, which looked pretty much like a bug: sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(" >")) Then I found the patch from Nov 7th 2013: https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/bl... try: sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(" >")) except IndexError as e: stop_err("Problem splitting multuple hits?\n%r\n--> %s" % (hit_def, e)) Yay! But what I've seen in recent XML output reports is that the ">" content has been changed to ">" . E.g. https://github.com/biopython/biopython/blob/master/Tests/Blast/mirna.xml <Hit> <Hit_num>66</Hit_num> <Hit_id>gi|195029385|ref|XR_047134.1|</Hit_id> <Hit_def>Drosophila grimshawi miR-7-RA (Dgri\mir-7), ncRNA >gi|195336156|ref|XR_048470.1| Drosophila sechellia miR-7-RA (Dsec\mir-7), ncRNA >gi|195585143|ref|XR_050309.1| Drosophila simulans miR-7-RA (Dsim\mir-7), ncRNA</Hit_def> <Hit_accession>XR_047134</Hit_accession> ... So perhaps a stop_err() could be avoided, if test is for ">" instead? I assume that no variants of python ElementTree.iterparse() will unescape content when returned via the iterator? Damion
On Wed, Nov 20, 2013 at 7:10 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
I'm doing a 1 step generic reporting tool along the lines of the "BLAST XML to tabular" script by Peter. I was just about to ask about this line, which looked pretty much like a bug:
sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(" >"))
Then I found the patch from Nov 7th 2013:
https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/bl...
try: sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(" >")) except IndexError as e: stop_err("Problem splitting multuple hits?\n%r\n--> %s" % (hit_def, e))
Yay! But what I've seen in recent XML output reports is that the ">" content has been changed to ">" . E.g.
https://github.com/biopython/biopython/blob/master/Tests/Blast/mirna.xml
<Hit> <Hit_num>66</Hit_num> <Hit_id>gi|195029385|ref|XR_047134.1|</Hit_id> <Hit_def>Drosophila grimshawi miR-7-RA (Dgri\mir-7), ncRNA >gi|195336156|ref|XR_048470.1| Drosophila sechellia miR-7-RA (Dsec\mir-7), ncRNA >gi|195585143|ref|XR_050309.1| Drosophila simulans miR-7-RA (Dsim\mir-7), ncRNA</Hit_def> <Hit_accession>XR_047134</Hit_accession> ...
So perhaps a stop_err() could be avoided, if test is for ">" instead? I assume that no variants of python ElementTree.iterparse() will unescape content when returned via the iterator?
Damion
On Wed, Nov 20, 2013 at 7:31 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
Woops - I realize now findtext() must be unescaping all ">", so Peter was trying to address other non-splitting occurances of " >" as per his patch notes. But perhaps a stop_err() isn't merrited in this case?
So ignore my test for ">" comment.
Regards,
Damion
OK - good. I was worried that there might be some inconsistency between different databases of versions of BLAST about how the > was encoded. As to why I treat this as a fatal error (calling stop_err), the alternative would be to issue a warning to stderr, and guess what the data ought to look like? That just seems like asking for trouble - a big red error should ensure I hear bug reports ;) Zen of Python: Errors should never pass silently. Peter
I hear you, re. guessing about data - it just sounded like this would be a rare case. Is it happening on particular database searches? Now that I look at it I'm wondering in what situation the IndexError would be triggered. I'm diving into the details here just because I don't want to discover later on there that I'd made some assumptions about the id parsing. Thanks, Damion On Wed, Nov 20, 2013 at 7:31 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
Woops - I realize now findtext() must be unescaping all ">", so Peter was trying to address other non-splitting occurances of " >" as per his patch notes. But perhaps a stop_err() isn't merrited in this case?
So ignore my test for ">" comment.
Regards,
Damion
OK - good. I was worried that there might be some inconsistency between different databases of versions of BLAST about how the > was encoded. As to why I treat this as a fatal error (calling stop_err), the alternative would be to issue a warning to stderr, and guess what the data ought to look like? That just seems like asking for trouble - a big red error should ensure I hear bug reports ;) Zen of Python: Errors should never pass silently. Peter
On Thu, Nov 21, 2013 at 5:59 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
I hear you, re. guessing about data - it just sounded like this would be a rare case. Is it happening on particular database searches? Now that I look at it I'm wondering in what situation the IndexError would be triggered. I'm diving into the details here just because I don't want to discover later on there that I'd made some assumptions about the id parsing.
Yes, it is rare - but the fix was triggered by falling over the following example from a BLAST against the NR database, shown in the commit comment: https://github.com/peterjc/galaxy_blast/commit/5210af6622bf905ecb09ffbf6d7d3... <Hit> <Hit_num>146</Hit_num> <Hit_id>gi|157832142|pdb|1NKD|A</Hit_id> <Hit_def>Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> >gi|157833740|pdb|1RPO|A Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle</Hit_def> <Hit_accession>1NKD_A</Hit_accession> <Hit_len>65</Hit_len> Spliting on just the greater than sign broke on the <2aa> comment. Splitting on space then greater than sign is slightly less fragile. Ideally this multi-entry field would be presented explicitly in the XML, something I suggested in passing on this related blog post: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.... You can see the problem entry like this: $ blastdbcmd -entry 157832142 -db nr -outfmt "%t" Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle To see if there are any more naught entries in the NR database, I am trying this command (no output yet, might take a while though): $ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">" ... Regards, Peter
On Thu, Nov 21, 2013 at 6:11 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Nov 21, 2013 at 5:59 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
I hear you, re. guessing about data - it just sounded like this would be a rare case. Is it happening on particular database searches? Now that I look at it I'm wondering in what situation the IndexError would be triggered. I'm diving into the details here just because I don't want to discover later on there that I'd made some assumptions about the id parsing.
Yes, it is rare - but the fix was triggered by falling over the following example from a BLAST against the NR database, shown in the commit comment:
https://github.com/peterjc/galaxy_blast/commit/5210af6622bf905ecb09ffbf6d7d3...
<Hit> <Hit_num>146</Hit_num> <Hit_id>gi|157832142|pdb|1NKD|A</Hit_id> <Hit_def>Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> >gi|157833740|pdb|1RPO|A Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle</Hit_def> <Hit_accession>1NKD_A</Hit_accession> <Hit_len>65</Hit_len>
Spliting on just the greater than sign broke on the <2aa> comment. Splitting on space then greater than sign is slightly less fragile.
Ideally this multi-entry field would be presented explicitly in the XML, something I suggested in passing on this related blog post: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions....
You can see the problem entry like this:
$ blastdbcmd -entry 157832142 -db nr -outfmt "%t" Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle
To see if there are any more naught entries in the NR database, I am trying this command (no output yet, might take a while though):
$ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">" ...
tyr Variant Of Myoglobin Chain L, Fv Fragment Of Mouse Monoclonal Antibody D1.3 (BalbC, IGG1, K) Engineered Mutant Pro95l->ser On Variant Chain L Glu81- >asp And Chain H Leu312->val Chain A, Alteration Of Axial Coordination By Protein Engineering In Myoglobin. Bis-Imidazole Ligation In The His64-- val(Slash)val68-->his Double Mutant Chain B, Alteration Of Axial Coordination By Protein Engineering In Myoglobin. Bis-Imidazole Ligation In The His64-- val(Slash)val68-->his Double Mutant Very similar to alpha-NACs, (Nascent polypeptide > [Arabidopsis thaliana] Chain A, Golgi Mannosidase Ii D204a Catalytic Nucleophile Mutant Complex With Methyl(Alpha-D-Mannopyranosyl)-(1->3)-S-[(Alpha-D-Mannopyranosyl)-(1- 6)]-Alpha-D-Mannopyranoside
arg, Met302->leu) In Complex With 1-{5-[2-(1-Methyl-1h- Pyrazolo[4,3-D]pyrimidin-7-Ylamino)-Ethyl]-Thiazol-2-Yl}-3- (3-Trifluoromethyl-Phenyl)-Urea Chain A, Crystal Structure Of Mouse Aurora A (Asn186->gly, Lys240- arg, Met302->leu) In Complex With [7-(2-{2-[3-(3-Chloro- Phenyl)-Ureido]-Thiazol-5-Yl}-Ethylamino)-Pyrazolo[4,3- D]pyrimidin-1-Yl]-Acetic Acid Glutaminase > [Sulfurimonas gotlandica GD1] exonuclease VIII 5 > 3 specific dsDNA exonuclease [uncultured phage MedDCM-OCT-S12-C102] exonuclease VIII 5 > 3 specific dsDNA exonuclease [uncultured organism MedDCM-OCT-S08-C700]
thr] Insulin Mutant (Pt Insulin) Chain A, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala) Chain B, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala) Chain C, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala) Chain D, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2 (variant Gly16- > D-ala)
With hindsight, I should have asked this to include the protein ID too, but in any case there are about 2650 examples, most using an arrow like --> or -> but quite a few HTML italics tags too - plus things like the <2aa>, <R> and <ESP> as well: $ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">" Alpha (1->4) glucosyltransferase [Mycobacterium tuberculosis H37Rv] Alpha (1->4) glucosyltransferase [Mycobacterium tuberculosis H37Rv] ssDNA exonuclease, 5' --> 3'-specific [Escherichia coli str. K-12 substr. MG1655] ssDNA exonuclease, 5' --> 3'-specific [Escherichia coli str. K-12 substr. W3110] ... PREDICTED: putative C->U-editing enzyme APOBEC-4-like, partial [Alligator sinensis] real 1842m2.024s user 1835m55.169s sys 15m17.406s (i.e. about 30 hours to dump all the titles from the NR database). The bad news (in terms of splitting the XML description) is there are currently 30 examples with " >" (space greater-than) in the description (many look purely accidental due to past line wrapping I would guess): Chain B, Crystal Structure Of Trypsin Complexed With The Bpti Variant (Tyr35- >gly) Chain A, Structural Characterization Of Heme Ligation In The His64-- putative NADH dehydrogenase Fe-S protein 7 >Feature gb|DQ213771 [Taeniopygia guttata] Chain A, Crystal Structure Of Mouse Aurora A (asn186->gly, Lys240->arg, Met302- >leu) In Complex With 1-{5-[2-(thieno[3,2-d]pyrimidin-4-ylamino)- Ethyl]- Thiazol-2-yl}-3-(3-trifluoromethyl-phenyl)-urea Chain A, Crystal Structure Of Mouse Aurora A (asn186->gly, Lys240->arg, Met302- >leu) In Complex With 1-(3-chloro-phenyl)-3-{5-[2-(thieno[3,2- D]pyrimidin-4-ylamino)- Ethyl]-thiazol-2-yl}-urea [sns-314] Chain A, Crystal Structure Of Mouse Aurora A (Asn186->gly, Lys240- transporter, probably Low affinity (KM > 3mM) ammonia uptake carrier, AmtB [Bifidobacterium asteroides PRL2011] transporter, probably Low affinity (KM > 3mM) ammonia uptake carrier, AmtB [Bifidobacterium asteroides PRL2011] Chain H, Fv Fragment Of Mouse Monoclonal Antibody D1.3 (BalbC, IGG1, K) Engineered Mutant Pro95l->ser On Variant Chain L Glu81- >asp And Chain H Leu312->val Chain A, Solution Structure Of The Monomeric [thr(B27)->pro,Pro(B28)- putative oxidase > apramycin biosynthesisN-methyltransferase [Streptoalloteichus tenebrarius] First of four adjacent putative subtilase family > [Arabidopsis thaliana] Third of four adjacent putative subtilase family > [Arabidopsis thaliana] putative ORF >60AA, partial [Escherichia coli] putative ORF >24AA, partial [Escherichia coli] Chain B, Solution Structure Of The Monomeric [thr(B27)->pro,Pro(B28)-
thr] Insulin Mutant (Pt Insulin) Chain A, Crystal Structure Of Trypsin Complexed With The Bpti Variant (Tyr35- >gly)
Regards, Peter
participants (2)
-
Dooley, Damion
-
Peter Cock