Re: [galaxy-dev] FASTA filtering by ID

6 Dec 2010

      On Thu, Nov 18, 2010 at 5:46 PM, Peter <peter@maubp.freeserve.co.uk> wrote:
...
Hi all,
Something I want to do in several of my workflows is to filter a
FASTA file (or potentially other format sequence files) using a
list of desired identifiers (e.g. a column from a tabular file).
Right now I can achieve this with three steps in Galaxy.
Suppose I have:
Dataset #1, FASTA file
Dataset #2, Tabular file with identifiers of interest (e.g. BLAST hits,
or filtered output from a sequence analysis tool)
Then:
Create tabular Dataset #3 using FASTA-to-tabular on Dataset #1,
subject to the enhancement proposed here:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003717.html
Create tabular Dataset #4 using join on Datasets #2 and #3 using the
matched identifier columns. This does the filtering.
Create FASTA Dataset #5 using tabular-to-FASTA on Dataset #4.
This works (at least for reasonably sized datasets), but requires
three steps and the creation of at least two temporary files.
I'd like to introduce another tool under "FASTA manipulation"
to do it on one step (rather than three). Am I going against
the apparent Galaxy ideal that complex manipulations should
be done with tabular files? Would such a FASTA filter tool be
of interest to add directly to Galaxy (e.g. under the "FASTA
manipulation" section), or better off on the community tool shed?
Here is my current implementation for discussion/consideration:
http://bitbucket.org/peterjc/galaxy-central/changeset/730b89c4da26
I made some further updates on the branch, mainly adding unit tests:
http://bitbucket.org/peterjc/galaxy-central/src/filter_fasta

In addition to this FASTA filter by ID script (in Python using the Galaxy
libraries) I've also got a FASTQ filter by ID script (also in Python using
the Galaxy libraries) and an SFF filter by ID script (in Python using
Biopython).

Would any of these be of interest for the main Galaxy distribution,
or should I bundle them up as a tool suite for the Tool Shed instead?

Thanks,

Peter