As part of the BLAST+ wrappers I wrote a short Python script to divide a FASTA file into those records with or without an ID found in column of a tabular file:
While working on the wrapper I ran into a problem: In general a tabular file might contain sequence IDs in any column. When I tried using a data_column parameter it does not work if the tabular file has no rows in it. I think this is actually quite a common situation (e.g. sequences with no BLAST hits, or any tabular file after a strict filter). For BLAST tabular output only columns 1 and 2 contain sequence identifiers (query and subject), so I used a select parameter instead: http://bitbucket.org/galaxy/galaxy-central/changeset/aa7f4bdc2eab
The functionality to filter a FASTA file using IDs in a tabular file is actually very general. As an example work flow, I want to be able to take a FASTA file of proteins and get a FASTA file of those proteins with prediction transmembrane helices. (1) Upload FASTA file (already in Galaxy) (2) Run TMHMM to get tabular file (see other thread for wrapper) (3) Filter tabular file to get just positive results (already in Galaxy) (4) Filter FASTA file using those IDs found in new tabular file (above script)
I therefore think it might make more sense for this script to live under tools/fasta_tools - does that seem reasonable? I would also need to generalise the help text etc - but how should the data_column problem be addressed?