On Wed, Nov 24, 2010 at 11:02 PM, Bossers, Alex wrote:
Peter,
a nice extra feature welcomed by myself would be to allow the optional inclusion of the Hit_defline in the output table. In many workflows we would need to blast, get the id from the table, use id to get human readible name and insert/use it.... which is silly of course since that data is available in the xml anyway.
I don't know python and about hg changesets but I modified your python and xml file to incorporate this (see attachement). By default its normal blast tabular output but optionally it can include the defline. The hit_defline needed to be split (I hope I did it in a python way) to eliminate multiple discriptions separated by >gi (nt and nr) or plain semicolons for swissprot.... maybe there are more but not sure...
Have a look and test and maybe it will find the way in some form into your suite. Anyway its very useful in this way to us.
cheers Alex
Hi Alex, I'm glad to see the BLAST+ wrappers being used already, and to get positive feedback. I had a quick look at your modifications - I think it could be made more beautiful, but it looks like it would work fine. I understand the aim behind your suggested change, but I have another solution in mind. I was already planning to write another tool for splitting a column in a tabular file - e.g. splitting on the pipe character could be very useful to extract the GI number from a typical NCBI identifier string. Such a tool could also be used on the BLAST output to do what you are asking for (splitting the hit IDs), or to grab a particular word from formatted text (by spitting on spaces). I'm surprised this isn't in Galaxy already to be honest - maybe it is and I haven't found it yet ;) I'd also like to explain that I deliberately kept the provided XML to tabular functionality simple to start with - all it tried to do is recreate the default tabular output, but even that turned out to be non-trivial. I have several ideas for extension which I will try to outline here. The BLAST+ suite actually lets you ask for certain other predefined columns in the tabular output. I am wondering about offering a "full" tabular output option in the BLAST+ wrappers - this seems simpler than making the user pick and choose which columns they want. e.g. for blastp: The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession sallacc means All subject accessions qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame Note that calculating and recording of the above will add computation cost and IO load - so keeping the default std set of columns as the default in the Galaxy wrapper makes sense to me. Potentially the BLAST XML output can be converted into this full tabular output too - I expect so but it may not be so easy. Another avenue by which to extend the BLAST+ suite is to teach Galaxy about the BLAST ASN.1 output format, and wrap the new blast_formatter application for turning ASN.1 into another BLAST output format. Regards, Peter