On Thu, Nov 18, 2010 at 5:46 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
Hi all,
Something I want to do in several of my workflows is to filter a FASTA file (or potentially other format sequence files) using a list of desired identifiers (e.g. a column from a tabular file).
Right now I can achieve this with three steps in Galaxy. Suppose I have:
Dataset #1, FASTA file
Dataset #2, Tabular file with identifiers of interest (e.g. BLAST hits, or filtered output from a sequence analysis tool)
Then:
Create tabular Dataset #3 using FASTA-to-tabular on Dataset #1, subject to the enhancement proposed here: http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003717.html
Create tabular Dataset #4 using join on Datasets #2 and #3 using the matched identifier columns. This does the filtering.
Create FASTA Dataset #5 using tabular-to-FASTA on Dataset #4.
This works (at least for reasonably sized datasets), but requires three steps and the creation of at least two temporary files.
I'd like to introduce another tool under "FASTA manipulation" to do it on one step (rather than three). Am I going against the apparent Galaxy ideal that complex manipulations should be done with tabular files? Would such a FASTA filter tool be of interest to add directly to Galaxy (e.g. under the "FASTA manipulation" section), or better off on the community tool shed?
Here is my current implementation for discussion/consideration: http://bitbucket.org/peterjc/galaxy-central/changeset/730b89c4da26
I made some further updates on the branch, mainly adding unit tests: http://bitbucket.org/peterjc/galaxy-central/src/filter_fasta In addition to this FASTA filter by ID script (in Python using the Galaxy libraries) I've also got a FASTQ filter by ID script (also in Python using the Galaxy libraries) and an SFF filter by ID script (in Python using Biopython). Would any of these be of interest for the main Galaxy distribution, or should I bundle them up as a tool suite for the Tool Shed instead? Thanks, Peter