Full-text data library searching
Hi all; This weekend I worked on full-text search of data libraries. Kimberly had mentioned it earlier on the list and issue 418 is tracking the enhancement suggestion: https://bitbucket.org/galaxy/galaxy-central/issue/418/extend-search-capabili... I decided to tackle the hardest issue first -- full text searching of data library items -- with the idea of putting a framework in place that could then be extended and finalized. The approach used separates indexing and search functionality from Galaxy itself; two configurable URLs are called: fulltext_index_url = http://localhost:8090/index fulltext_find_url = http://localhost:8090/find The first gets passed a CSV file of identifiers and files to be indexed, and the second retrieves the IDs based on a search term. A small server uses Lucene on the backend to do all the full-text indexing and lookup: https://github.com/chapmanb/kwd-doc-find This is meant to be easy to setup and run, but a default Galaxy-only installation could also implement the index and search itself to provide much simpler functionality that searches against filenames or descriptions of library items. For a pure Galaxy default, Whoosh looks promising: https://bitbucket.org/mchaput/whoosh/wiki/Home On the Galaxy side, there are two patches. The first is a script that prepares a file for indexing and submit it to the index URL. This would be run from a cronjob to keep the indexes fresh: https://bitbucket.org/chapmanb/galaxy-central/changeset/b47d1bfa52da The second uses the search box in the top level Data Library grid to do full searching of library items. It reuses all the display and permissions machinery, making adjustments to handle displaying a set of collected search result files: https://bitbucket.org/chapmanb/galaxy-central/changeset/c038fd24cf48 This is working well here and scaled nicely to ~1000 items in our current data library. I have several ideas for enhancements after this initial version, but thought I would first discuss with the Galaxy team to see if this is of interest and takes a reasonable approach. If so, the easiest working strategy would be for me to submit patches to the bug report that y'all could check and approve so I could stay in sync with galaxy-central as much as possible. The two above should apply cleanly now (with a couple of stray nglims configuration lines in the first; sorry) and we could build off of that. Happy to hear any thoughts or feedback. Thanks, Brad
participants (1)
-
Brad Chapman