I was thinking of something different. Here is a example of a three-exon transcript, in gtf format: contig00035 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; contig00035 Cufflinks exon 3 10 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "1"; contig00035 Cufflinks exon 13 18 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "2"; contig00035 Cufflinks exon 20 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "3"; and the genome sequence that the transcript comes from is:
contig00035 GTAGCGTCTCCGACGCGGATATGACCGCACGCTGATGCTCCCAGGGATGAGAGGCGTGCG
I want the sequence for this transcript: I want to extract from the genome sequence the subsequences for positions 3-10, 13-18, and 20-22, and then concatenate the three subsequences to create the transcript sequence. In this case, it would be AGCGTCTC + ACGCGG + TAT, meaning the transcript sequence would be AGCGTCTCACGCGGTAT. Is it possible to do this in Galaxy? Karen :) On Thu, 27 Jan 2011, Jennifer Jackson wrote:
Hello Karen,
The following general workflow should help you to pull sequences from any source.
1) cut out the sequence IDs from the query (in this case, a GTF & BED file) and sort them. Text Manipulation -> Cut columns from a table Filter and Sort -> Sort 2) convert the target fasta file to tabular format Convert Formats -> FASTA-to-Tabular converter 3) join the two datasets based on the sequence ID Join, Subtract and Group -> Join two Queries 4) covert to fasta Convert Formats -> Tabular-to-FASTA 5) when starting with a GTF file, there will most likely be duplicates. To remove, use: NGS: QC and manipulation -> Collapse sequences
Once you create the actual workflow that performs the job, be sure to save it so that you can just re-use it whenever you need to perform the same task. To do this, from the history pane (most right) use Options -> Extract workflow and following the instructions on the form to customize.
Hopefully this helps,
Jen Galaxy team
On 1/26/11 12:05 PM, Karen Tang wrote:
Hi Galaxy people,
I have transcripts predicted by Cufflinks that are in a gtf file. How can I extract the sequences corresponding to those transcripts, using Galaxy?
[Cufflinks transcript predictions in gtf file] + [Genome sequence in FASTA file] ---> [FASTA file of transcript sequences]
My genome is a custom genome (not at UCSC).
---------
I'll also need to do the same thing, except my predicted transcripts are in a Scripture bed file.
Thanks for your help!
Karen Tang :) Plant Biology University of Minnesota
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user