Re: [galaxy-dev] Extracting regions of sequences given table of co-ords

11 Aug 2011


      On Thu, Aug 4, 2011 at 4:17 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
...
On Wed, Aug 3, 2011 at 4:56 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
...
Hi all,
I'm about to start working on a general tool for extracting sequences
(from FASTA etc) based on a tabular file describing desired regions
(this could be a BED or GFF3 file), giving a new sequence file with
one entry for each region in the tabular file. Essentially I want as
input columns describing:
* FASTA sequence ID
* Start
* End
* Strand (optional)
* Region ID for this sequence in the output file (optional)
In the case of a GFF3 input, those would be columns 1, 4, 5, 7 (with
the ID being hidden inside column 9). But this isn't really driven by
GFF3 where you'd probably not want to treat each line separately
anyway (compound features).
I'm thinking of very general examples like extracting general search
matches (BLAST, HMMER, motifs, etc), but specifically things like
cleaving proteins to remove a predicted signal peptide. This could
even be used for trimming of sequencing reads (for small datasets,
e.g. capillary reads).
Is there anything like this in Galaxy already? If so I can't find it ;)
I've also skimmed the tool shed.
It turns out there sort of is, extract/extract_genomic_dna.xml
However, as written it isn't clear to me how to make this work
with general tabular files, and how it would behave with proteins
(which in principle can be process the same way).
It also appears to insist on having both start and end present
(I want to be able to have these default to the start/end of the
referenced parent sequence).
Peter
Any thoughts from the Galaxy team? Are you open to making
extract/extract_genomic_dna.xml more general - for instance
renaming and rewording to indicate work on protein FASTA
files as well as nucleotide FASTA files and reference genomes
(assuming it does work)?

Peter