Hi all,
As part of the BLAST+ wrappers I wrote a short Python script to divide
a FASTA file into those records with or without an ID found in column
of a tabular file:
http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/ncbi_blast_plus/...
http://bitbucket.org/galaxy/galaxy-central/src/tip/tools/ncbi_blast_plus/...
While working on the wrapper I ran into a problem: In general a
tabular file might contain sequence IDs in any column. When I tried
using a data_column parameter it does not work if the tabular file has
no rows in it. I think this is actually quite a common situation (e.g.
sequences with no BLAST hits, or any tabular file after a strict
filter). For BLAST tabular output only columns 1 and 2 contain
sequence identifiers (query and subject), so I used a select parameter
instead:
http://bitbucket.org/galaxy/galaxy-central/changeset/aa7f4bdc2eab
The functionality to filter a FASTA file using IDs in a tabular file
is actually very general. As an example work flow, I want to be able
to take a FASTA file of proteins and get a FASTA file of those
proteins with prediction transmembrane helices.
(1) Upload FASTA file (already in Galaxy)
(2) Run TMHMM to get tabular file (see other thread for wrapper)
(3) Filter tabular file to get just positive results (already in Galaxy)
(4) Filter FASTA file using those IDs found in new tabular file (above script)
I therefore think it might make more sense for this script to live
under tools/fasta_tools - does that seem reasonable? I would also need
to generalise the help text etc - but how should the data_column
problem be addressed?
Thanks,
Peter