Including descriptions etc in extended BLAST+ tabular output
Hello all, FAO: Administrators of local Galaxy instances using the NCBI BLAST+ wrappers. Over on the galaxy_blast repository I have been updating the NCBI BLAST+ wrappers (including unit tests) to work with the current release, NCBI BLAST+ 2.2.28 (aka BLAST 2.2.28+): https://github.com/peterjc/galaxy_blast The initial set of changes is now on the Test Tool Shed, http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus This includes a workaround for a known regression in the makeblastdb tool dealing with duplicated identifiers: https://github.com/peterjc/galaxy_blast/commit/349e31c6cec4429c5523fde5975e2... In terms of end-user features, the big improvement in the BLAST+ 2.2.28 release was the ability to get the BLAST match descriptions in the tabular output, and other fields: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.... staxids means Subject Taxonomy ID(s), separated by a ';' sscinames means Subject Scientific Name(s), separated by a ';' scomnames means Subject Common Name(s), separated by a ';' sblastnames means Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP On this branch I am including the new salltitles field as the 25th column in the extended BLAST tabular output offered within the Galaxy interface: https://github.com/peterjc/galaxy_blast/tree/c25 However, I'm not so sure about the taxonomy fields. Since (thus far) they are not available via the XML, I am leaning to introducing a third tabular mode, e.g. * Standard 12 columns (can convert from XML) * Extended 25 columns (can convert from XML) * Extended also with taxonomy (cannot currently convert from XML) Instead, we could offer a pick-you-own columns route (in all the primary BLAST tools, handled via macros)?: * Standard 12 columns (can convert from XML) * Extended 25 columns (can convert from XML) * Pick your own columns from the full NCBI list (depending on columns, can convert from XML) This is inspired by JJ's changes to the BLAST XML to tabular conversion tool for Galaxy-P, https://github.com/jmchilton/galaxy_blast/commit/d79afc03522768323494818a40a... I would be much keener on the pick-you-own columns option if it was possible for the tool to record arbitrary column names for a tabular file in Galaxy's metadata (I can't find a trello card, but I'm sure I've asked about this before). Any thoughts or comments? eg Hurry up and just release this branch adding the hit descriptions as column 25 - we want that now ;) [*] Regards, Peter [*] For our local instance, the taxonomy stuff will be useful, but right now I would prioritise the description, which we currently get via the BLAST XML using this tool: https://github.com/peterjc/galaxy_blast/tree/master/tools/blastxml_to_top_de...
Peter, It's good news that the description field is available in tabular format. That has been the primary reason for using the xml format. The tabular format allows use of the task parallel feature. I think you have good default options for output format. But, I would think we should offer the option to get all available result information. The multi-select checkboxes could serve that purpose. Is NCBI good about maintaining the column position if their tabular over successive versions? Thanks, JJ On 11/28/13, 11:13 AM, Peter Cock wrote:
Hello all,
FAO: Administrators of local Galaxy instances using the NCBI BLAST+ wrappers.
Over on the galaxy_blast repository I have been updating the NCBI BLAST+ wrappers (including unit tests) to work with the current release, NCBI BLAST+ 2.2.28 (aka BLAST 2.2.28+): https://github.com/peterjc/galaxy_blast
The initial set of changes is now on the Test Tool Shed, http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus
This includes a workaround for a known regression in the makeblastdb tool dealing with duplicated identifiers: https://github.com/peterjc/galaxy_blast/commit/349e31c6cec4429c5523fde5975e2...
In terms of end-user features, the big improvement in the BLAST+ 2.2.28 release was the ability to get the BLAST match descriptions in the tabular output, and other fields: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions....
staxids means Subject Taxonomy ID(s), separated by a ';' sscinames means Subject Scientific Name(s), separated by a ';' scomnames means Subject Common Name(s), separated by a ';' sblastnames means Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP
On this branch I am including the new salltitles field as the 25th column in the extended BLAST tabular output offered within the Galaxy interface:
https://github.com/peterjc/galaxy_blast/tree/c25
However, I'm not so sure about the taxonomy fields. Since (thus far) they are not available via the XML, I am leaning to introducing a third tabular mode, e.g.
* Standard 12 columns (can convert from XML) * Extended 25 columns (can convert from XML) * Extended also with taxonomy (cannot currently convert from XML)
Instead, we could offer a pick-you-own columns route (in all the primary BLAST tools, handled via macros)?:
* Standard 12 columns (can convert from XML) * Extended 25 columns (can convert from XML) * Pick your own columns from the full NCBI list (depending on columns, can convert from XML)
This is inspired by JJ's changes to the BLAST XML to tabular conversion tool for Galaxy-P, https://github.com/jmchilton/galaxy_blast/commit/d79afc03522768323494818a40a...
I would be much keener on the pick-you-own columns option if it was possible for the tool to record arbitrary column names for a tabular file in Galaxy's metadata (I can't find a trello card, but I'm sure I've asked about this before).
Any thoughts or comments? eg Hurry up and just release this branch adding the hit descriptions as column 25 - we want that now ;) [*]
Regards,
Peter
[*] For our local instance, the taxonomy stuff will be useful, but right now I would prioritise the description, which we currently get via the BLAST XML using this tool: https://github.com/peterjc/galaxy_blast/tree/master/tools/blastxml_to_top_de...
-- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota
On Fri, Nov 29, 2013 at 3:26 PM, Jim Johnson <johns198@umn.edu> wrote:
Peter,
It's good news that the description field is available in tabular format. That has been the primary reason for using the xml format.
Yes.
The tabular format allows use of the task parallel feature.
We already support parallel sub-tasks with BLAST XML merge support - but yes this is simpler with tabular output.
I think you have good default options for output format.
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
But, I would think we should offer the option to get all available result information. The multi-select checkboxes could serve that purpose.
OK, so adding a pick-you-own columns tabular output would be useful :)
Is NCBI good about maintaining the column position of their tabular over successive versions?
When picking the columns, the order is the order given at the command line itself - i.e. we would control that in the wrapper explicitly. Regards, Peter
Hi,
On Fri, Nov 29, 2013 at 3:26 PM, Jim Johnson <johns198@umn.edu> wrote:
Peter,
It's good news that the description field is available in tabular format. That has been the primary reason for using the xml format.
Yes.
The tabular format allows use of the task parallel feature.
We already support parallel sub-tasks with BLAST XML merge support - but yes this is simpler with tabular output.
I think you have good default options for output format.
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1 The 24 column output will be replaced, right?
But, I would think we should offer the option to get all available result information. The multi-select checkboxes could serve that purpose.
OK, so adding a pick-you-own columns tabular output would be useful :)
+1 Cheers, Bjoern
Is NCBI good about maintaining the column position of their tabular over successive versions?
When picking the columns, the order is the order given at the command line itself - i.e. we would control that in the wrapper explicitly.
Regards,
Peter
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning wrote:
Peter wrote:
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1 The 24 column output will be replaced, right?
Yes, the extended output used to be 22 columns, now 24 columns, soon 25 columns.
But, I would think we should offer the option to get all available result information. The multi-select checkboxes could serve that purpose.
OK, so adding a pick-you-own columns tabular output would be useful :)
+1
Any thoughts on if this should be the standard 12 columns plus a user selection, or a free selection of any columns (e.g. could omit most of the standard 12). Peter
Il 2013-12-02 13:08 Peter Cock ha scritto:
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning wrote:
Peter wrote:
But, I would think we should offer the option to get all available result information. The multi-select checkboxes could serve that purpose.
OK, so adding a pick-you-own columns tabular output would be useful :)
+1
Any thoughts on if this should be the standard 12 columns plus a user selection, or a free selection of any columns (e.g. could omit most of the standard 12).
I'd go with free selection. Best, Nicola
On 12/2/13, 7:03 AM, Nicola Soranzo wrote:
Il 2013-12-02 13:08 Peter Cock ha scritto:
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning wrote:
Peter wrote:
But, I would think we should offer the option to get all available result information. The multi-select checkboxes could serve that purpose.
OK, so adding a pick-you-own columns tabular output would be useful :)
+1
Any thoughts on if this should be the standard 12 columns plus a user selection, or a free selection of any columns (e.g. could omit most of the standard 12).
I'd go with free selection.
Best, Nicola I concur. JJ
-- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning <bjoern.gruening@pharmazie.uni-freiburg.de> wrote:
Hi,
On Fri, Nov 29, 2013 at 3:26 PM, Jim Johnson <johns198@umn.edu> wrote:
Peter,
It's good news that the description field is available in tabular format. That has been the primary reason for using the xml format.
Yes.
The tabular format allows use of the task parallel feature.
We already support parallel sub-tasks with BLAST XML merge support - but yes this is simpler with tabular output.
I think you have good default options for output format.
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1
Done, see this and the preceding commits: https://github.com/peterjc/galaxy_blast/commit/eb1e522a864e5274a2d274a49fcb1... And the Test Tool Shed repository has been updated: http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/8f9023b30384 We'll also be running this update locally for a little more testing, but I'd appreciate some of you also testing this, e.g. via the Test Tool Shed on your own local (development) Galaxy instances. If there are no problems, that can probably go out to the main Tool Shed by the end of the week. I'll look at the pick-your-own column output for the next revision of the NCBI BLAST+ wrappers. Thanks, Peter
On Mon, Dec 2, 2013 at 3:37 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning <bjoern.gruening@pharmazie.uni-freiburg.de> wrote:
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1
Done, see this and the preceding commits:
https://github.com/peterjc/galaxy_blast/commit/eb1e522a864e5274a2d274a49fcb1...
And the Test Tool Shed repository has been updated:
http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/8f9023b30384
We'll also be running this update locally for a little more testing, but I'd appreciate some of you also testing this, e.g. via the Test Tool Shed on your own local (development) Galaxy instances.
If there are no problems, that can probably go out to the main Tool Shed by the end of the week.
I pushed another update to the Test Tool Shed addressing a slight change in the autogenerated IDs used in the BLAST XML output, and an overlooked mention of 24 columns: http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/72170c3f515a The functional tests are passing on the Test Tool Shed, so before I push this to the main Tool Shed, any other comments?
I'll look at the pick-your-own column output for the next revision of the NCBI BLAST+ wrappers.
I've been working on this using JJ's work on column picking in the BLAST XML to tabular tool - viewable here: https://github.com/jmchilton/galaxy_blast/commit/d79afc03522768323494818a40a... See this branch updating and extending that work: https://github.com/peterjc/galaxy_blast/tree/blastxml_jj I'm considering splitting the column picker into three, the standard 12 columns, the further 13 used in our extended 25 column output, and a third category for any other columns offered by BLAST+ (like the taxonomy columns added in BLAST+ 2.2.28). Here's a screenshot showing the column picker offering the 25 columns, split in two: [image: Inline image 1] Note that (following JJ's earlier work) these all include the column numbers as used in the standard 12 column BLAST tabular output, or our extended 25 column output. Note I would not intend to number the final set of "extra" columns. Do people think this is helpful, or potentially confusing (since the column numbering will change in a custom selection)? Peter
On 12/4/13, 10:21 AM, Peter Cock wrote:
On Mon, Dec 2, 2013 at 3:37 PM, Peter Cock <p.j.a.cock@googlemail.com <mailto:p.j.a.cock@googlemail.com>> wrote:
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning <bjoern.gruening@pharmazie.uni-freiburg.de <mailto:bjoern.gruening@pharmazie.uni-freiburg.de>> wrote:
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1
Done, see this and the preceding commits: https://github.com/peterjc/galaxy_blast/commit/eb1e522a864e5274a2d274a49fcb1...
And the Test Tool Shed repository has been updated: http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/8f9023b30384
We'll also be running this update locally for a little more testing, but I'd appreciate some of you also testing this, e.g. via the Test Tool Shed on your own local (development) Galaxy instances.
If there are no problems, that can probably go out to the main Tool Shed by the end of the week.
I pushed another update to the Test Tool Shed addressing a slight change in the autogenerated IDs used in the BLAST XML output, and an overlooked mention of 24 columns:
http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/72170c3f515a
The functional tests are passing on the Test Tool Shed, so before I push this to the main Tool Shed, any other comments?
I'll look at the pick-your-own column output for the next revision of the NCBI BLAST+ wrappers.
I've been working on this using JJ's work on column picking in the BLAST XML to tabular tool - viewable here: https://github.com/jmchilton/galaxy_blast/commit/d79afc03522768323494818a40a...
See this branch updating and extending that work: https://github.com/peterjc/galaxy_blast/tree/blastxml_jj
I'm considering splitting the column picker into three, the standard 12 columns, the further 13 used in our extended 25 column output, and a third category for any other columns offered by BLAST+ (like the taxonomy columns added in BLAST+ 2.2.28).
Here's a screenshot showing the column picker offering the 25 columns, split in two:
Inline image 1
Note that (following JJ's earlier work) these all include the column numbers as used in the standard 12 column BLAST tabular output, or our extended 25 column output.
Note I would not intend to number the final set of "extra" columns.
Do people think this is helpful, or potentially confusing (since the column numbering will change in a custom selection)?
Peter
Splitting into 3 sections seems like a good idea. That makes it easier to select groups of columns. I also think labeling the column options with the column number in the standard tabular outputs will be useful to users. JJ -- James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota
Il 2013-12-04 20:48 Jim Johnson ha scritto:
On 12/4/13, 10:21 AM, Peter Cock wrote:
On Mon, Dec 2, 2013 at 3:37 PM, Peter Cock <p.j.a.cock@googlemail.com [1]> wrote:
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning <bjoern.gruening@pharmazie.uni-freiburg.de [2]> wrote:
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1
Done, see this and the preceding commits:
https://github.com/peterjc/galaxy_blast/commit/eb1e522a864e5274a2d274a49fcb1...
[3]
And the Test Tool Shed repository has been updated:
http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/8f9023b30384
[4]
We'll also be running this update locally for a little more testing, but I'd appreciate some of you also testing this, e.g. via the Test Tool Shed on your own local (development) Galaxy instances.
If there are no problems, that can probably go out to the main Tool Shed by the end of the week.
I pushed another update to the Test Tool Shed addressing a slight change in the autogenerated IDs used in the BLAST XML output, and an overlooked mention of 24 columns:
http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/72170c3f515a
[5]
The functional tests are passing on the Test Tool Shed, so before I push this to the main Tool Shed, any other comments?
I'll look at the pick-your-own column output for the next revision of the NCBI BLAST+ wrappers.
I've been working on this using JJ's work on column picking in the BLAST XML to tabular tool - viewable here:
https://github.com/jmchilton/galaxy_blast/commit/d79afc03522768323494818a40a...
[6]
See this branch updating and extending that work: https://github.com/peterjc/galaxy_blast/tree/blastxml_jj [7]
I'm considering splitting the column picker into three, the standard 12 columns, the further 13 used in our extended 25 column output, and a third category for any other columns offered by BLAST+ (like the taxonomy columns added in BLAST+ 2.2.28).
Here's a screenshot showing the column picker offering the 25 columns, split in two:
Note that (following JJ's earlier work) these all include the column numbers as used in the standard 12 column BLAST tabular output,
or our extended 25 column output.
Note I would not intend to number the final set of "extra" columns.
Do people think this is helpful, or potentially confusing (since the column numbering will change in a custom selection)?
Peter Splitting into 3 sections seems like a good idea. That makes it easier to select groups of columns.
I also think labeling the column options with the column number in the standard tabular outputs will be useful to users.
I also like the splitting into 3 sections, but I think that the column numbering may be confusing and I would prefer it's not included. Nicola
Peter Cock wrote:
I've been working on this using JJ's work on column picking in the BLAST XML to tabular tool - viewable here:
...
I'm considering splitting the column picker into three, the standard 12 columns, the further 13 used in our extended 25 column output, and a third category for any other columns offered by BLAST+ (like the taxonomy columns added in BLAST+ 2.2.28).
Here's a screenshot showing the column picker offering the 25 columns, split in two:
Note that (following JJ's earlier work) these all include the column numbers as used in the standard 12 column BLAST tabular output, or our extended 25 column output.
Note I would not intend to number the final set of "extra" columns.
Do people think this is helpful, or potentially confusing (since the column numbering will change in a custom selection)?
Peter
Jim Johnson wrote:
Splitting into 3 sections seems like a good idea. That makes it easier to select groups of columns.
I also think labeling the column options with the column number in the standard tabular outputs will be useful to users.
Nicola Soranzo wrote:
I also like the splitting into 3 sections, but I think that the column numbering may be confusing and I would prefer it's not included.
Nicola
So we have agreement on splitting the columns into three groups (standard 12, extended 13, and the rest), but not about if the first two batches should be numbered (as columns 1 to 25). Next I need to look at the missing columns, and ideally see which can be automatically extracted from the XML for the conversion tool... Peter
On Wed, Dec 4, 2013 at 4:21 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Mon, Dec 2, 2013 at 3:37 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Sun, Dec 1, 2013 at 7:10 PM, Björn Grüning <bjoern.gruening@pharmazie.uni-freiburg.de> wrote:
Great. I'll probably merge the c25 branch and push that to the Test Tool Shed next week (adding the descriptions as a new 25th column to the default extended tabular output).
+1
Done, see this and the preceding commits: https://github.com/peterjc/galaxy_blast/commit/eb1e522a864e5274a2d274a49fcb1...
And the Test Tool Shed repository has been updated: http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/8f9023b30384
We'll also be running this update locally for a little more testing, but I'd appreciate some of you also testing this, e.g. via the Test Tool Shed on your own local (development) Galaxy instances.
If there are no problems, that can probably go out to the main Tool Shed by the end of the week.
I pushed another update to the Test Tool Shed addressing a slight change in the autogenerated IDs used in the BLAST XML output, and an overlooked mention of 24 columns:
http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus/72170c3f515a
The functional tests are passing on the Test Tool Shed, so before I push this to the main Tool Shed, any other comments?
Feedback is still welcome, but I have now pushed this to the main Tool Shed as v0.0.22 of the BLAST+ wrappers, which targets BLAST+ 2.2.28: http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/4c4a0da938ff (I realise now this ended up skipping v0.0.21 which updated the dependencies to wrap BLAST+ 2.2.27) Maybe with the next set of changes like pick-you-own columns I will bump the minor version number and call this v0.1.0 to avoid any possibly confusion between the wrapper version and the underlying BLAST+ version - good plan? After all, it will cause some slight breakage if trying to re-run old jobs due to the introduction of a conditional block for the output format. Peter
On Thu, Nov 28, 2013 at 5:13 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hello all,
FAO: Administrators of local Galaxy instances using the NCBI BLAST+ wrappers.
Over on the galaxy_blast repository I have been updating the NCBI BLAST+ wrappers (including unit tests) to work with the current release, NCBI BLAST+ 2.2.28 (aka BLAST 2.2.28+): https://github.com/peterjc/galaxy_blast
The initial set of changes is now on the Test Tool Shed, http://testtoolshed.g2.bx.psu.edu/view/peterjc/ncbi_blast_plus
This includes a workaround for a known regression in the makeblastdb tool dealing with duplicated identifiers: https://github.com/peterjc/galaxy_blast/commit/349e31c6cec4429c5523fde5975e2...
For the record, this wasn't actually a regression, rather an old "bug" or corner case I'd not noticed before - BLAST+ lets you create databases with duplicate identifiers, which can give very confusing output: http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html
In terms of end-user features, the big improvement in the BLAST+ 2.2.28 release was the ability to get the BLAST match descriptions in the tabular output, and other fields: http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions....
...
This update is already on the main Galaxy Tool Shed, and we've been running it locally without any problems. Regards, Peter
participants (4)
-
Björn Grüning
-
Jim Johnson
-
Nicola Soranzo
-
Peter Cock