[galaxy-dev] FASTA filtering by ID

18 Nov 2010

      Hi all,

Something I want to do in several of my workflows is to filter a
FASTA file (or potentially other format sequence files) using a
list of desired identifiers (e.g. a column from a tabular file).

Right now I can achieve this with three steps in Galaxy.
Suppose I have:

Dataset #1, FASTA file

Dataset #2, Tabular file with identifiers of interest (e.g. BLAST hits,
or filtered output from a sequence analysis tool)

Then:

Create tabular Dataset #3 using FASTA-to-tabular on Dataset #1,
subject to the enhancement proposed here:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003717.html

Create tabular Dataset #4 using join on Datasets #2 and #3 using the
matched identifier columns. This does the filtering.

Create FASTA Dataset #5 using tabular-to-FASTA on Dataset #4.

This works (at least for reasonably sized datasets), but requires
three steps and the creation of at least two temporary files.

I'd like to introduce another tool under "FASTA manipulation"
to do it on one step (rather than three). Am I going against
the apparent Galaxy ideal that complex manipulations should
be done with tabular files? Would such a FASTA filter tool be
of interest to add directly to Galaxy (e.g. under the "FASTA
manipulation" section), or better off on the community tool shed?

Here is my current implementation for discussion/consideration:
http://bitbucket.org/peterjc/galaxy-central/changeset/730b89c4da26

Thanks,

Peter