On Thu, Nov 21, 2013 at 5:59 PM, Dooley, Damion Damion.Dooley@bccdc.ca wrote:
I hear you, re. guessing about data - it just sounded like this would be a rare case. Is it happening on particular database searches? Now that I look at it I'm wondering in what situation the IndexError would be triggered. I'm diving into the details here just because I don't want to discover later on there that I'd made some assumptions about the id parsing.
Yes, it is rare - but the fix was triggered by falling over the following example from a BLAST against the NR database, shown in the commit comment:
https://github.com/peterjc/galaxy_blast/commit/5210af6622bf905ecb09ffbf6d7d3...
<Hit> <Hit_num>146</Hit_num> <Hit_id>gi|157832142|pdb|1NKD|A</Hit_id> <Hit_def>Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> >gi|157833740|pdb|1RPO|A Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle</Hit_def> <Hit_accession>1NKD_A</Hit_accession> <Hit_len>65</Hit_len>
Spliting on just the greater than sign broke on the <2aa> comment. Splitting on space then greater than sign is slightly less fragile.
Ideally this multi-entry field would be presented explicitly in the XML, something I suggested in passing on this related blog post: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions....
You can see the problem entry like this:
$ blastdbcmd -entry 157832142 -db nr -outfmt "%t" Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa> Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding Of A 4- Alpha-Helical Bundle
To see if there are any more naught entries in the NR database, I am trying this command (no output yet, might take a while though):
$ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">" ...
Regards,
Peter