-------- Original Message -------- Subject: Re: [galaxy-user] Extract sequences from [gtf file] + [genome FASTA file] Date: Thu, 27 Jan 2011 17:23:11 -0700 From: Brian Foley PhD <btf@lanl.gov> To: Jennifer Jackson <jen@bx.psu.edu> Dear Jen, I am not much of a Galaxy user yet, but a long time user of GenBank and other databases and sequence analysis tools (Phylogenetics software, etc). A common task I would like to do, is obtain a FASTA format file (ideally aligned, but I can do the alignment later very easily) of the regions of sequences hit in a BLAST search on GenBank. It is easy to ask GenBank to give me all (or the selected few) sequences hit in the BLAST search, but not so easy to get each sequence "clipped" to the matched region. For example, if I search with the D-loop region of a mammal mitochondrial genome, I would like to get that region clipped out of all the hundreds of complete mitochondrial genomes. Or if I search with a mammalian endogenous retrovirus, get the retroviruses clipped from the complete chromosome entries. Ideally, I would add one more criteria. I would add that I would like to be able to get some number of bases (lets say 100) flanking the matched region. So I could capture the integration sites of endogenous retroviruses, for example. Or get the intron flanks of a gene if I was searching with a mammalian gene exon. The final thing would be to deal with the fact that GenBank BLAST match results often get fragmented. For example the LTRs of retroviruses (endogenous or not) create a problem. And any large in/dels or highly variable regions often split one contiguous homologous string into two individual matches split at the in/del or variable site. This looks somewhat similar to the task you describe below, so I am wondering if it is something I can do in Galaxy (or with Galaxy plus a few other tools). GenBank/BLAST will almost give me what I want. The trouble I find is that either I can get the result as a multiple sequence alignment but with useless sequence names (just the gi number for identifier) and not in FASTA format, or I can get full sequence entries but not the matched region clipped out. I have asked NCBI/GenBank if they would serve up the results in FASTA format, but they are not responsive on that. Brian Foley PhD HIV Databases btf@lanl.gov http://www.hiv.lanl.gov On 1/27/11 1:36 PM, "Jennifer Jackson" <jen@bx.psu.edu> wrote:
Hello Karen,
The following general workflow should help you to pull sequences from any source.
1) cut out the sequence IDs from the query (in this case, a GTF & BED file) and sort them. Text Manipulation -> Cut columns from a table Filter and Sort -> Sort 2) convert the target fasta file to tabular format Convert Formats -> FASTA-to-Tabular converter 3) join the two datasets based on the sequence ID Join, Subtract and Group -> Join two Queries 4) covert to fasta Convert Formats -> Tabular-to-FASTA 5) when starting with a GTF file, there will most likely be duplicates. To remove, use: NGS: QC and manipulation -> Collapse sequences
Once you create the actual workflow that performs the job, be sure to save it so that you can just re-use it whenever you need to perform the same task. To do this, from the history pane (most right) use Options -> Extract workflow and following the instructions on the form to customize.
Hopefully this helps,
Jen Galaxy team