NCBI BLAST+ wrappers in Galaxy?
Hi all, Something I expect to find useful in several analysis pipelines is a Galaxy wrapper for the NCBI BLAST+ tools (or even the old NCBI "legacy" BLAST tools if such a wrapper exists). I've been looking over the tools in galaxy-dist and galaxy-central and the only NCBI BLAST wrapper I can see is for MEGABLAST, under tools/metag_tools. Is there some more general NCBI BLAST+ wrappers that I have missed? Or is anyone already working on this? Thanks, Peter
On Thu, Sep 9, 2010 at 11:29 AM, Peter <peter@maubp.freeserve.co.uk> wrote:
Hi all,
Something I expect to find useful in several analysis pipelines is a Galaxy wrapper for the NCBI BLAST+ tools (or even the old NCBI "legacy" BLAST tools if such a wrapper exists).
I've been looking over the tools in galaxy-dist and galaxy-central and the only NCBI BLAST wrapper I can see is for MEGABLAST, under tools/metag_tools.
Is there some more general NCBI BLAST+ wrappers that I have missed? Or is anyone already working on this?
Thanks,
Peter
Hi all, I met Björn Grüning (CC'd) from the University of Freiburg Pharmaceutical Bioinformatics at a workshop last week, and they had a few simple BLAST+ wrappers setup. If I recall correctly, all their databases were nucleotide databases. For configuration, Björn re-used the existing blastdb.loc file that comes with Galaxy for the NGS megablast_wrapper tool. However, that (currently) only holds nucleotide BLAST databases - and we would need to have separate lists for nucleotide, protein, and RPS-BLAST protein domain databases. I would suggest either: (a) Add new loc files specific to proteins and rpsblast or: (b) Extend the blastdb.loc format to include a fourth column giving the database type (which can default to nucleotide). What would the Galaxy team prefer? Thanks, Peter
Hi Peter, I think a separate loc file for proteins makes sense, easier to maintain backward compatibility that way. Only speaking for myself though. On Sep 20, 2010, at 5:55 AM, Peter wrote:
On Thu, Sep 9, 2010 at 11:29 AM, Peter <peter@maubp.freeserve.co.uk> wrote:
Hi all,
Something I expect to find useful in several analysis pipelines is a Galaxy wrapper for the NCBI BLAST+ tools (or even the old NCBI "legacy" BLAST tools if such a wrapper exists).
I've been looking over the tools in galaxy-dist and galaxy-central and the only NCBI BLAST wrapper I can see is for MEGABLAST, under tools/metag_tools.
Is there some more general NCBI BLAST+ wrappers that I have missed? Or is anyone already working on this?
Thanks,
Peter
Hi all,
I met Björn Grüning (CC'd) from the University of Freiburg Pharmaceutical Bioinformatics at a workshop last week, and they had a few simple BLAST+ wrappers setup. If I recall correctly, all their databases were nucleotide databases.
For configuration, Björn re-used the existing blastdb.loc file that comes with Galaxy for the NGS megablast_wrapper tool. However, that (currently) only holds nucleotide BLAST databases - and we would need to have separate lists for nucleotide, protein, and RPS-BLAST protein domain databases.
I would suggest either:
(a) Add new loc files specific to proteins and rpsblast
or:
(b) Extend the blastdb.loc format to include a fourth column giving the database type (which can default to nucleotide).
What would the Galaxy team prefer?
Thanks,
Peter
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- jt James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University
On Mon, Sep 20, 2010 at 6:00 PM, James Taylor <james@jamestaylor.org> wrote:
Hi Peter, I think a separate loc file for proteins makes sense, easier to maintain backward compatibility that way. Only speaking for myself though.
Hi James, That works for me. So we keep blastdb.loc as a list of nucleotide only databases, and introduce new files perhaps named blastp_db.loc and rpsblast_db.loc (or maybe blastdb_p.loc and blastdb_rps.loc - I don't mind) for protein and RPS-BLAST databases respectively. Assuming I produce something generally useful for contribution to the project, would the best route be: (a) a patch (b) an hg branch from http://bitbucket.org/galaxy/galaxy-central (c) an hg branch from http://bitbucket.org/galaxy/galaxy-dist Peter
A patch is probably best, or a fork of central. We don't integrate anything directly into dist. For tools, there is also the community site: usegalaxy.org/community Thanks! On Sep 21, 2010, at 5:21 AM, Peter wrote:
On Mon, Sep 20, 2010 at 6:00 PM, James Taylor <james@jamestaylor.org> wrote:
Hi Peter, I think a separate loc file for proteins makes sense, easier to maintain backward compatibility that way. Only speaking for myself though.
Hi James,
That works for me. So we keep blastdb.loc as a list of nucleotide only databases, and introduce new files perhaps named blastp_db.loc and rpsblast_db.loc (or maybe blastdb_p.loc and blastdb_rps.loc - I don't mind) for protein and RPS-BLAST databases respectively.
Assuming I produce something generally useful for contribution to the project, would the best route be:
(a) a patch (b) an hg branch from http://bitbucket.org/galaxy/galaxy-central (c) an hg branch from http://bitbucket.org/galaxy/galaxy-dist
Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
On Tue, Sep 21, 2010 at 12:45 PM, James Taylor <james@jamestaylor.org> wrote:
A patch is probably best, or a fork of central. We don't integrate anything directly into dist. For tools, there is also the community site: usegalaxy.org/community
I have a query about the existing Megablast wrapper, Python code here: http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/metag_tools/megabla... Looking at the above, it is clearly trying to call the command line tool 'megablast' which is part of the NCBI 'legacy' BLAST suite. This is replaced by the command line tool 'blastn' in the new NCBI BLAST+ suite (the default 'task' parameter is megablast). Currently the wiki instructions appear to be wrong, quoting:
Megablast installation
Megablast is a part of the BLAST+ suite of tools. To download it, go to the Megablast page and go to the download link. Select the BLAST+ file appropriate to your platform, noting that Galaxy uses version 2.2.22 currently. There is some information about installation in the BLAST+ user manual, available from the download page.
Quoted from http://bitbucket.org/galaxy/galaxy-central/wiki/NGSLocalSetup Have I misunderstood? Perhaps the script expects the be able to call 'megablast' via legacy_blast.pl - but most likely the documentation is out of sync any the Galaxy servers have both BLAST and BLAST+ installed. I think it would make sense to update megablast_wrapper.py to call the BLAST+ command line tool blastn instead of the legacy BLAST tool megablast... would that change be welcome? Peter
On Tue, Sep 21, 2010 at 2:13 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
I have a query about the existing Megablast wrapper, Python code here: http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/metag_tools/megabla...
Looking at the above, it is clearly trying to call the command line tool 'megablast' which is part of the NCBI 'legacy' BLAST suite. This is replaced by the command line tool 'blastn' in the new NCBI BLAST+ suite (the default 'task' parameter is megablast).
Currently the wiki instructions appear to be wrong, quoting:
Megablast installation
Megablast is a part of the BLAST+ suite of tools. To download it, go to the Megablast page and go to the download link. Select the BLAST+ file appropriate to your platform, noting that Galaxy uses version 2.2.22 currently. There is some information about installation in the BLAST+ user manual, available from the download page.
Quoted from http://bitbucket.org/galaxy/galaxy-central/wiki/NGSLocalSetup
Have I misunderstood? Perhaps the script expects the be able to call 'megablast' via legacy_blast.pl - but most likely the documentation is out of sync any the Galaxy servers have both BLAST and BLAST+ installed.
I think it would make sense to update megablast_wrapper.py to call the BLAST+ command line tool blastn instead of the legacy BLAST tool megablast... would that change be welcome?
Here is a fork of galaxy-central which updates megablast_wrapper.py to actually use BLAST+, which seems to work for me: http://bitbucket.org/peterjc/galaxy-central/changeset/ff54cf59749d Follow up change to update both the XML and py files to use the new BLAST+ arguments for the filter (yes/no instead of T/F) and update the list of columns in the documentation: https://bitbucket.org/peterjc/galaxy-central/changeset/71e6e7db6bea https://bitbucket.org/peterjc/galaxy-central/changeset/2efff78a82de Note that both 'legacy' BLAST 2.2.22 and BLAST+ 2.2.24 both output 12 columns in tabular mode, so I think the old XML wrapper documentaion about 12 columns is wrong or was at least out of date. These updates are on my branch 'megablast'. Could someone review these changes for possible inclusion in Galaxy? Would you prefer me to prepare a single patch file? Regards, Peter
Peter, this is great and we will look at it. The main thing I want to think about is does this affect reproducibility in any way. We may want to keep the old tool, and have another tool for the NCBI version (I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks! -- jt James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University On Sep 21, 2010, at 11:12 AM, Peter wrote:
On Tue, Sep 21, 2010 at 2:13 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
I have a query about the existing Megablast wrapper, Python code here: http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/metag_tools/megabla...
Looking at the above, it is clearly trying to call the command line tool 'megablast' which is part of the NCBI 'legacy' BLAST suite. This is replaced by the command line tool 'blastn' in the new NCBI BLAST+ suite (the default 'task' parameter is megablast).
Currently the wiki instructions appear to be wrong, quoting:
Megablast installation
Megablast is a part of the BLAST+ suite of tools. To download it, go to the Megablast page and go to the download link. Select the BLAST+ file appropriate to your platform, noting that Galaxy uses version 2.2.22 currently. There is some information about installation in the BLAST+ user manual, available from the download page.
Quoted from http://bitbucket.org/galaxy/galaxy-central/wiki/NGSLocalSetup
Have I misunderstood? Perhaps the script expects the be able to call 'megablast' via legacy_blast.pl - but most likely the documentation is out of sync any the Galaxy servers have both BLAST and BLAST+ installed.
I think it would make sense to update megablast_wrapper.py to call the BLAST+ command line tool blastn instead of the legacy BLAST tool megablast... would that change be welcome?
Here is a fork of galaxy-central which updates megablast_wrapper.py to actually use BLAST+, which seems to work for me:
http://bitbucket.org/peterjc/galaxy-central/changeset/ff54cf59749d
Follow up change to update both the XML and py files to use the new BLAST+ arguments for the filter (yes/no instead of T/F) and update the list of columns in the documentation:
https://bitbucket.org/peterjc/galaxy-central/changeset/71e6e7db6bea https://bitbucket.org/peterjc/galaxy-central/changeset/2efff78a82de
Note that both 'legacy' BLAST 2.2.22 and BLAST+ 2.2.24 both output 12 columns in tabular mode, so I think the old XML wrapper documentaion about 12 columns is wrong or was at least out of date.
These updates are on my branch 'megablast'.
Could someone review these changes for possible inclusion in Galaxy? Would you prefer me to prepare a single patch file?
Regards,
Peter
On Tue, Sep 21, 2010 at 4:14 PM, James Taylor <james@jamestaylor.org> wrote:
Peter, this is great and we will look at it. The main thing I want to think about is does this affect reproducibility in any way. We may want to keep the old tool, and have another tool for the NCBI version
Sure. By old tool I'm assuming you mean the NCBI legacy BLAST?
(I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks!
I'm working on it - we want to use it on our local server too. Peter
On Tue, Sep 21, 2010 at 4:35 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Tue, Sep 21, 2010 at 4:14 PM, James Taylor <james@jamestaylor.org> wrote:
(I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks!
I'm working on it - we want to use it on our local server too.
I have a blastplus branch here, currently just minimal wrappers for blastn and blastp: http://bitbucket.org/peterjc/galaxy-central Early feedback is welcome - I'd like this to follow Galaxy conventions and be taken up as part of the default installation. Before I do lots of work defining most of the parameters, I started a thread asking how to share definitions between wrapper XML files: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-September/003371.html I've also written and wrapped a very simple script to split a FASTA file into records with and without BLAST hits - something I plan to use in some simple workflows later on. I plan to add BLAST ASN1 as a new format, and wrap the blast_formatter application added in BLAST 2.2.24+ to turn this into blastxml, plain text, tabular etc. There are other things I'd like to add like blastxml to tabular conversion. In this case I'd like to use the Biopython BLAST XML parser - is adding Biopython as a Galaxy dependency going to be a problem? You already have numpy which is the main dependency of Biopython (there are others - all optional). Regards, Peter
Hi Peter, Nice work. We are working on some general tool wrappers as well (blat, mummer, etc) which have the same difficulty of lots to maintain if something changes...So I also follow your thread on the shared XML parts with great interest. Regarding the blast xml to table. Its already in your distribution for megablast at metag_tools/megablast_xml_parser.xml there is also a basic wrapper for megablast. Keep up the good work! Alex -----Oorspronkelijk bericht----- Van: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] Namens Peter Verzonden: dinsdag 28 september 2010 16:07 Aan: James Taylor CC: galaxy-dev@lists.bx.psu.edu Onderwerp: Re: [galaxy-dev] NCBI BLAST+ wrappers in Galaxy? On Tue, Sep 21, 2010 at 4:35 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Tue, Sep 21, 2010 at 4:14 PM, James Taylor <james@jamestaylor.org> wrote:
(I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks!
I'm working on it - we want to use it on our local server too.
I have a blastplus branch here, currently just minimal wrappers for blastn and blastp: http://bitbucket.org/peterjc/galaxy-central Early feedback is welcome - I'd like this to follow Galaxy conventions and be taken up as part of the default installation. Before I do lots of work defining most of the parameters, I started a thread asking how to share definitions between wrapper XML files: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-September/003371.html I've also written and wrapped a very simple script to split a FASTA file into records with and without BLAST hits - something I plan to use in some simple workflows later on. I plan to add BLAST ASN1 as a new format, and wrap the blast_formatter application added in BLAST 2.2.24+ to turn this into blastxml, plain text, tabular etc. There are other things I'd like to add like blastxml to tabular conversion. In this case I'd like to use the Biopython BLAST XML parser - is adding Biopython as a Galaxy dependency going to be a problem? You already have numpy which is the main dependency of Biopython (there are others - all optional). Regards, Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
On Tue, Sep 28, 2010 at 3:43 PM, Bossers, Alex <Alex.Bossers@wur.nl> wrote:
Hi Peter, Nice work. We are working on some general tool wrappers as well (blat, mummer, etc) which have the same difficulty of lots to maintain if something changes...So I also follow your thread on the shared XML parts with great interest.
Nice to know this use case (BLAST) isn't a special case.
Regarding the blast xml to table. Its already in your distribution for megablast at metag_tools/megablast_xml_parser.xml there is also a basic wrapper for megablast.
I'd seen the megablast wrapper (currently for legacy NCBI BLAST, not BLAST+, as discussed earlier in the thread). The metag_tools/megablast_xml_parser.xml script is close to what I had in mind, but not quite the same: I wanted to reproduce the default 12 column tabular output from the BLAST+ tools from the XML output. My thinking was we'd have lots of tools designed to work with the default tabular output from BLAST+, so an option to go to XML if needed for some steps and recover the tabular output later would be nice. [And similarly for the ASN.1 output.] I guess here (and in general) there is scope for supporting all the tab fields which the NCBI command line tools support... I see Galaxy has some clever metadata for tracking different columns in interval data types - we'd need to do something similar for different BLAST columns. That is going to be more work of course, and not something I need immediately. Peter
On Tue, Sep 28, 2010 at 3:07 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Tue, Sep 21, 2010 at 4:35 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Tue, Sep 21, 2010 at 4:14 PM, James Taylor <james@jamestaylor.org> wrote:
(I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks!
I'm working on it - we want to use it on our local server too.
I have a blastplus branch here, currently just minimal wrappers for blastn and blastp: http://bitbucket.org/peterjc/galaxy-central
I've extended that to cover the five main BLAST flavours by adding blastx, tblastn and tblastx. The order and descriptions should match that used on the NCBI BLAST webserver. Still to do, RPS-BLAST and PSI-BLAST, and of course most of the optional parameters. Creating (small) databases from FASTA files seems potentially useful (i.e. wrapping the BLAST+ tool makeblastdb which replaces formatdb) but as these are made up of several files I need to know more about how Galaxy works before tackling that. Alternatively, the BLAST+ feature of FASTA file verses FASTA file could be handy (using -subject instead of -db). I think I can see how to handle this... but it would complicate supporting other parameters. Peter
On Tue, Sep 28, 2010 at 4:52 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
Creating (small) databases from FASTA files seems potentially useful (i.e. wrapping the BLAST+ tool makeblastdb which replaces formatdb) but as these are made up of several files I need to know more about how Galaxy works before tackling that.
Alternatively, the BLAST+ feature of FASTA file verses FASTA file could be handy (using -subject instead of -db). I think I can see how to handle this... but it would complicate supporting other parameters.
I've done the later now, http://bitbucket.org/peterjc/galaxy-central/changeset/17b2cb598b5e Peter
Just FYI, In related news, Bill Pearson's latest release of the FASTA suite includes support for BLAST+ like tabular output files. I guess wrappers for these in Galaxy would also be nice to have. Peter ---------- Forwarded message ---------- From: William Pearson <wrp@virginia.edu> Date: Fri, Oct 1, 2010 at 6:26 PM Subject: fasta-36.2.7 available To: fasta_list@virginia.edu The latest version of the FASTA36 package, fasta-36.2.7, is available from: http://faculty.virginia.edu/wrpearson/fasta/fasta36/fasta-36.2.7.tar.gz A Mac OSX universal binary is also available. This version of FASTA36 fixes a bug in the fastx36(_t) sub-alignment code, that could cause the library sequence to be modified. In addition, it fixes some problems with sub-alignments that occurred with very short query sequences. There have also been some minor output format changes to reduce presentation of redundant information when multiple sub-alignments are shown. The "-L" long library descriptions has been re-enabled, and changing the -E threshold no longer disables multiple sub-alignments. The major new feature in this version are the introduction of two BLAST+ compatible tabular output formats: -m 8 (BLAST+ tabular output, equivalent to BLAST+ -outfmt=6) and -m 8C (BLAST+ -outfmt=7). In addition, the FASTA36 programs can be compiled to read the sequence database only once, and then compare multiple query sequences to the library held in memory (see comments in doc/readme.v36 and make/Makefile36m.common). Holding the library in memory allows the program to scale very efficiently on large multi-core computers (>40X speedup on 48 cores). This version of the program has not been thoroughly tested under MPI. As always, let me know about problems. Bill Pearson
On Tue, Oct 5, 2010 at 2:47 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
Just FYI,
In related news, Bill Pearson's latest release of the FASTA suite includes support for BLAST+ like tabular output files. I guess wrappers for these in Galaxy would also be nice to have.
Peter
I've started looking at that now, http://bitbucket.org/peterjc/galaxy-central/src/pearson_fasta Peter
On Tue, Sep 21, 2010 at 4:14 PM, James Taylor <james@jamestaylor.org> wrote:
Peter, this is great and we will look at it. The main thing I want to think about is does this affect reproducibility in any way. We may want to keep the old tool, and have another tool for the NCBI version (I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks!
-- jt
Hi James et al. Could you or someone from the Galaxy team take a look at my wrappers for blastn, blastp, blastx, tblastn and tblastx and the BLAST XML to tabular converter for possible inclusion in galaxy-central? http://bitbucket.org/peterjc/galaxy-central/src/blastplus The BLAST+ suite is so big and has so many options that this is by no means a "complete set of wrappers" but it covers the immediate core functionality that I expect to need personally. Thanks, Peter
On Mon, Oct 11, 2010 at 4:06 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
On Tue, Sep 21, 2010 at 4:14 PM, James Taylor <james@jamestaylor.org> wrote:
Peter, this is great and we will look at it. The main thing I want to think about is does this affect reproducibility in any way. We may want to keep the old tool, and have another tool for the NCBI version (I'd love to see a complete set of wrappers for NCBI blast+, which we could include with our cloud images right away). Thanks!
-- jt
Hi James et al.
Could you or someone from the Galaxy team take a look at my wrappers for blastn, blastp, blastx, tblastn and tblastx and the BLAST XML to tabular converter for possible inclusion in galaxy-central?
http://bitbucket.org/peterjc/galaxy-central/src/blastplus
The BLAST+ suite is so big and has so many options that this is by no means a "complete set of wrappers" but it covers the immediate core functionality that I expect to need personally.
Thanks,
Peter
P.S. One thing this is lacking is unit tests. I worry that these could be specific to the version of BLAST+ and the version of any database installed. Ultimately the reference platform here is the "official" public Galaxy server, right? Using the -subject feature we can BLAST one file against another which avoids the database version issue. Peter
participants (3)
-
Bossers, Alex
-
James Taylor
-
Peter