On Wed, Nov 24, 2010 at 11:02 PM, Bossers, Alex wrote:
> Peter,
>
> a nice extra feature welcomed by myself would be to allow the
> optional inclusion of the Hit_defline in the output table. In many
> workflows we would need to blast, get the id from the table, use
> id to get human readible name and insert/use it.... which is silly
> of course since that data is available in the xml anyway.
>
> I don't know python and about hg changesets but I modified
> your python and xml file to incorporate this (see attachement).
> By default its normal blast tabular output but optionally it can
> include the defline.
> The hit_defline needed to be split (I hope I did it in a python
> way) to eliminate multiple discriptions separated by >gi (nt
> and nr) or plain semicolons for swissprot.... maybe there
> are more but not sure...
>
> Have a look and test and maybe it will find the way in some
> form into your suite. Anyway its very useful in this way to us.
>
> cheers
> Alex
Hi Alex,
I'm glad to see the BLAST+ wrappers being used already,
and to get positive feedback.
I had a quick look at your modifications - I think it could
be made more beautiful, but it looks like it would work
fine. I understand the aim behind your suggested change,
but I have another solution in mind.
I was already planning to write another tool for splitting a
column in a tabular file - e.g. splitting on the pipe character
could be very useful to extract the GI number from a typical
NCBI identifier string. Such a tool could also be used on the
BLAST output to do what you are asking for (splitting the hit
IDs), or to grab a particular word from formatted text (by
spitting on spaces). I'm surprised this isn't in Galaxy already
to be honest - maybe it is and I haven't found it yet ;)
I'd also like to explain that I deliberately kept the provided
XML to tabular functionality simple to start with - all it tried
to do is recreate the default tabular output, but even that
turned out to be non-trivial. I have several ideas for
extension which I will try to outline here.
The BLAST+ suite actually lets you ask for certain other
predefined columns in the tabular output. I am wondering
about offering a "full" tabular output option in the BLAST+
wrappers - this seems simpler than making the user pick
and choose which columns they want. e.g. for blastp:
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
sallacc means All subject accessions
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
Note that calculating and recording of the above will
add computation cost and IO load - so keeping the
default std set of columns as the default in the Galaxy
wrapper makes sense to me.
Potentially the BLAST XML output can be converted
into this full tabular output too - I expect so but it may
not be so easy.
Another avenue by which to extend the BLAST+ suite
is to teach Galaxy about the BLAST ASN.1 output
format, and wrap the new blast_formatter application
for turning ASN.1 into another BLAST output format.
Regards,
Peter