Hi all, As I mentioned on Twitter, at the end of last week I wrapped Blast2GO for Galaxy, using the b2g4pipe program (Blast2GO for pipelines). See http://blast2go.org/ Currently current code is on bitbucket under my tools branch, https://bitbucket.org/peterjc/galaxy-central/src/tools/ Specifically files tools/ncbi_blast_plus/blast2go.* viewable here: https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus... I've using a Galaxy location file, tool-data/blast2go.loc, to offer one or more Blast2GO configurations (properties files), mapping this to the -prop argument. This way you could have for example the Spanish Blast2GO server with its current database (May 2010), and a local Blast2GO database. I want to setup a local database and try this before submitting the wrapper to the Tool Shed. The input to the tool is a BLAST XML file, specifically blasting against a protein database like NR (so blastp or blastx, not blastn etc). I want to try some very large BLAST XML files to confirm b2g4pipe copes with the current BLAST+ output files - I gather there were some problems in this area in the past, so having the wrapper script fragment the XML might be a workaround. Currently the only real function of the wrapper script is to rename the output file - b2g4pipe insists on using the *.annot extension. Right now the only output is a tabular three column *.annot file, which can be loaded into the Blast2GO GUI. For analysis within Galaxy, I'm wondering about an option to split the first column (which holds the original FASTA query's identifier and any description) in two. i.e. Split at the first white space to give the FASTA identifier, and any optional description as a separate column. That would make linking/joining/filtering on the ID much easier. If anyone has any comments or feedback now, that would be welcome. Yesterday Alex Bossers indicated on Twitter that Gerrit had also been looking at this (CC'd). Regards, Peter
On Tue, May 31, 2011 at 11:26 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi all,
As I mentioned on Twitter, at the end of last week I wrapped Blast2GO for Galaxy, using the b2g4pipe program (Blast2GO for pipelines). See http://blast2go.org/
Currently current code is on bitbucket under my tools branch, https://bitbucket.org/peterjc/galaxy-central/src/tools/
Specifically files tools/ncbi_blast_plus/blast2go.* viewable here: https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus...
I've using a Galaxy location file, tool-data/blast2go.loc, to offer one or more Blast2GO configurations (properties files), mapping this to the -prop argument. This way you could have for example the Spanish Blast2GO server with its current database (May 2010), and a local Blast2GO database. I want to setup a local database and try this before submitting the wrapper to the Tool Shed.
The input to the tool is a BLAST XML file, specifically blasting against a protein database like NR (so blastp or blastx, not blastn etc). I want to try some very large BLAST XML files to confirm b2g4pipe copes with the current BLAST+ output files - I gather there were some problems in this area in the past, so having the wrapper script fragment the XML might be a workaround. Currently the only real function of the wrapper script is to rename the output file - b2g4pipe insists on using the *.annot extension.
Right now the only output is a tabular three column *.annot file, which can be loaded into the Blast2GO GUI. For analysis within Galaxy, I'm wondering about an option to split the first column (which holds the original FASTA query's identifier and any description) in two. i.e. Split at the first white space to give the FASTA identifier, and any optional description as a separate column. That would make linking/joining/filtering on the ID much easier.
If anyone has any comments or feedback now, that would be welcome. Yesterday Alex Bossers indicated on Twitter that Gerrit had also been looking at this (CC'd).
Apologies Alex, my memory was at fault - it was Peter van Heusden (@pvanheus not @a_bossers): http://twitter.com/#!/pvanheus/status/75121905962729472 Peter
On Tue, May 31, 2011 at 11:26 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi all,
As I mentioned on Twitter, at the end of last week I wrapped Blast2GO for Galaxy, using the b2g4pipe program (Blast2GO for pipelines). See http://blast2go.org/
Currently current code is on bitbucket under my tools branch, https://bitbucket.org/peterjc/galaxy-central/src/tools/
Specifically files tools/ncbi_blast_plus/blast2go.* viewable here: https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus...
I've using a Galaxy location file, tool-data/blast2go.loc, to offer one or more Blast2GO configurations (properties files), mapping this to the -prop argument. This way you could have for example the Spanish Blast2GO server with its current database (May 2010), and a local Blast2GO database. I want to setup a local database and try this before submitting the wrapper to the Tool Shed.
I've done that using the current latest data from May 2011, so one year newer than the current default public Blast2GO database provided by the Blast2GO developers (dated May 2011). This requires downloaded some large data files and importing them into a MySQL database, and that took about 24 hours to process in all. The last step is done via their Java tool and took most of the time. For more details, see: http://blast2go.org/localgodb However, the end result is worth the effort as running Blast2GO is now much much faster. I need to try it on some larger files, but we're talking at least an order of magnitude quicker, maybe two.
The input to the tool is a BLAST XML file, specifically blasting against a protein database like NR (so blastp or blastx, not blastn etc). I want to try some very large BLAST XML files to confirm b2g4pipe copes with the current BLAST+ output files - I gather there were some problems in this area in the past, so having the wrapper script fragment the XML might be a workaround. Currently the only real function of the wrapper script is to rename the output file - b2g4pipe insists on using the *.annot extension.
Now that I have a local Blast2GO database, I'll be able to try out b2g4pipe on some bigger XML files (without having to wait ages).
Right now the only output is a tabular three column *.annot file, which can be loaded into the Blast2GO GUI. For analysis within Galaxy, I'm wondering about an option to split the first column (which holds the original FASTA query's identifier and any description) in two. i.e. Split at the first white space to give the FASTA identifier, and any optional description as a separate column. That would make linking/joining/filtering on the ID much easier.
If anyone has any comments or feedback now, that would be welcome.
I've submitted the initial version to the Galaxy Tool Shed now, but would still welcome feedback and would even consider non-backwards compatible changes in the short term. Peter
On Thu, Jun 2, 2011 at 10:53 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Now that I have a local Blast2GO database, I'll be able to try out b2g4pipe on some bigger XML files (without having to wait ages).
It turns out that Blast2GO haven't fixed a Java heap space problem with their parser and large NCBI BLAST+ XML files yet - their suggested workaround as of Oct 2009 is to split them or reformat them to act like older versions of BLAST. I've updated the wrapper to do this, and tested it on a 700MB BLAST XML file. The update is now on the tool shed. Further work (after discussion with Gerrit) to support the optional Blast2GO project file (*.dat) and an optional InterProScan XML input file will ideally need to have extra file formats defined in Galaxy, thus this thread: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-June/005620.html Peter
participants (1)
-
Peter Cock