Re: [galaxy-user] Extract sequences from [gtf file] + [genome FASTA file]

28 Jan 2011

      -------- Original Message --------
Subject: Re: [galaxy-user] Extract sequences from [gtf file] + [genome 
FASTA file]
Date: Thu, 27 Jan 2011 17:23:11 -0700
From: Brian Foley PhD <btf@lanl.gov>
To: Jennifer Jackson <jen@bx.psu.edu>

Dear Jen,

     I am not much of a Galaxy user yet, but a long time user of GenBank and
other databases and sequence analysis tools (Phylogenetics software, etc).

    A common task I would like to do, is obtain a FASTA format file (ideally
aligned, but I can do the alignment later very easily) of the regions of
sequences hit in a BLAST search on GenBank.

     It is easy to ask GenBank to give me all (or the selected few) 
sequences
hit in the BLAST search, but not so easy to get each sequence "clipped" to
the matched region.  For example, if I search with the D-loop region of a
mammal mitochondrial genome, I would like to get that region clipped out of
all the hundreds of complete mitochondrial genomes.  Or if I search with a
mammalian endogenous retrovirus, get the retroviruses clipped from the
complete chromosome entries.

     Ideally, I would add one more criteria.  I would add that I would like
to be able to get some number of bases (lets say 100) flanking the matched
region.  So I could capture the integration sites of endogenous
retroviruses, for example.  Or get the intron flanks of a gene if I was
searching with a mammalian gene exon.
     The final thing would be to deal with the fact that GenBank BLAST match
results often get fragmented.   For example the LTRs of retroviruses
(endogenous or not) create a problem.  And any large in/dels or highly
variable regions often split one contiguous homologous string into two
individual matches split at the in/del or variable site.

     This looks somewhat similar to the task you describe below, so I am
wondering if it is something I can do in Galaxy (or with Galaxy plus a few
other tools).

     GenBank/BLAST will almost give me what I want.  The trouble I find is
that either I can get the result as a multiple sequence alignment but with
useless sequence names (just the gi number for identifier) and not in FASTA
format, or I can get full sequence entries but not the matched region
clipped out.  I have asked NCBI/GenBank if they would serve up the results
in FASTA format, but they are not responsive on that.

Brian Foley PhD
HIV Databases
btf@lanl.gov
http://www.hiv.lanl.gov

On 1/27/11 1:36 PM, "Jennifer Jackson" <jen@bx.psu.edu> wrote:
...
Hello Karen,
The following general workflow should help you to pull sequences from
any source.
1) cut out the sequence IDs from the query (in this case, a GTF & BED
file) and sort them.
Text Manipulation -> Cut columns from a table
Filter and Sort -> Sort
2) convert the target fasta file to tabular format
Convert Formats ->  FASTA-to-Tabular converter
3) join the two datasets based on the sequence ID
Join, Subtract and Group -> Join two Queries
4) covert to fasta
Convert Formats -> Tabular-to-FASTA
5) when starting with a GTF file, there will most likely be duplicates.
To remove, use:
NGS: QC and manipulation -> Collapse sequences
Once you create the actual workflow that performs the job, be sure to
save it so that you can just re-use it whenever you need to perform the
same task. To do this, from the history pane (most right) use Options ->
Extract workflow and following the instructions on the form to customize.
Hopefully this helps,
Jen
Galaxy team