Extract sequences from [gtf file] + [genome FASTA file]
Hi Galaxy people, I have transcripts predicted by Cufflinks that are in a gtf file. How can I extract the sequences corresponding to those transcripts, using Galaxy? [Cufflinks transcript predictions in gtf file] + [Genome sequence in FASTA file] ---> [FASTA file of transcript sequences] My genome is a custom genome (not at UCSC). --------- I'll also need to do the same thing, except my predicted transcripts are in a Scripture bed file. Thanks for your help! Karen Tang :) Plant Biology University of Minnesota
Hello Karen, The following general workflow should help you to pull sequences from any source. 1) cut out the sequence IDs from the query (in this case, a GTF & BED file) and sort them. Text Manipulation -> Cut columns from a table Filter and Sort -> Sort 2) convert the target fasta file to tabular format Convert Formats -> FASTA-to-Tabular converter 3) join the two datasets based on the sequence ID Join, Subtract and Group -> Join two Queries 4) covert to fasta Convert Formats -> Tabular-to-FASTA 5) when starting with a GTF file, there will most likely be duplicates. To remove, use: NGS: QC and manipulation -> Collapse sequences Once you create the actual workflow that performs the job, be sure to save it so that you can just re-use it whenever you need to perform the same task. To do this, from the history pane (most right) use Options -> Extract workflow and following the instructions on the form to customize. Hopefully this helps, Jen Galaxy team On 1/26/11 12:05 PM, Karen Tang wrote:
Hi Galaxy people,
I have transcripts predicted by Cufflinks that are in a gtf file. How can I extract the sequences corresponding to those transcripts, using Galaxy?
[Cufflinks transcript predictions in gtf file] + [Genome sequence in FASTA file] ---> [FASTA file of transcript sequences]
My genome is a custom genome (not at UCSC).
---------
I'll also need to do the same thing, except my predicted transcripts are in a Scripture bed file.
Thanks for your help!
Karen Tang :) Plant Biology University of Minnesota
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org
I was thinking of something different. Here is a example of a three-exon transcript, in gtf format: contig00035 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; contig00035 Cufflinks exon 3 10 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "1"; contig00035 Cufflinks exon 13 18 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "2"; contig00035 Cufflinks exon 20 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "3"; and the genome sequence that the transcript comes from is:
contig00035 GTAGCGTCTCCGACGCGGATATGACCGCACGCTGATGCTCCCAGGGATGAGAGGCGTGCG
I want the sequence for this transcript: I want to extract from the genome sequence the subsequences for positions 3-10, 13-18, and 20-22, and then concatenate the three subsequences to create the transcript sequence. In this case, it would be AGCGTCTC + ACGCGG + TAT, meaning the transcript sequence would be AGCGTCTCACGCGGTAT. Is it possible to do this in Galaxy? Karen :) On Thu, 27 Jan 2011, Jennifer Jackson wrote:
Hello Karen,
The following general workflow should help you to pull sequences from any source.
1) cut out the sequence IDs from the query (in this case, a GTF & BED file) and sort them. Text Manipulation -> Cut columns from a table Filter and Sort -> Sort 2) convert the target fasta file to tabular format Convert Formats -> FASTA-to-Tabular converter 3) join the two datasets based on the sequence ID Join, Subtract and Group -> Join two Queries 4) covert to fasta Convert Formats -> Tabular-to-FASTA 5) when starting with a GTF file, there will most likely be duplicates. To remove, use: NGS: QC and manipulation -> Collapse sequences
Once you create the actual workflow that performs the job, be sure to save it so that you can just re-use it whenever you need to perform the same task. To do this, from the history pane (most right) use Options -> Extract workflow and following the instructions on the form to customize.
Hopefully this helps,
Jen Galaxy team
On 1/26/11 12:05 PM, Karen Tang wrote:
Hi Galaxy people,
I have transcripts predicted by Cufflinks that are in a gtf file. How can I extract the sequences corresponding to those transcripts, using Galaxy?
[Cufflinks transcript predictions in gtf file] + [Genome sequence in FASTA file] ---> [FASTA file of transcript sequences]
My genome is a custom genome (not at UCSC).
---------
I'll also need to do the same thing, except my predicted transcripts are in a Scripture bed file.
Thanks for your help!
Karen Tang :) Plant Biology University of Minnesota
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
Hi Karen, I just implemented this functionality in Galaxy's 'Extract Genomic DNA' tool. This functionality will be available on our main server in the next couple weeks and is available now via our development repository ( bitbucket.org/galaxy/galaxy-central/ ) One note: GTF files produced by Cuff* are unusual in that, for each assembled transcript, they include a "transcript" element in additional to exons. This element is problematic because it spans the entire transcript. Hence, in order to get the sequence data for transcripts in a Cuff* GTF file, you'll want to select for only exons (use Galaxy's 'Extract Features' tool) and then use the resultant dataset as input to Extract. Let us know if you have any questions. Thanks, J. On Jan 28, 2011, at 2:08 PM, Karen Tang wrote:
I was thinking of something different. Here is a example of a three-exon transcript, in gtf format:
contig00035 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; contig00035 Cufflinks exon 3 10 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "1"; contig00035 Cufflinks exon 13 18 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "2"; contig00035 Cufflinks exon 20 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "3";
and the genome sequence that the transcript comes from is:
contig00035 GTAGCGTCTCCGACGCGGATATGACCGCACGCTGATGCTCCCAGGGATGAGAGGCGTGCG
I want the sequence for this transcript: I want to extract from the genome sequence the subsequences for positions 3-10, 13-18, and 20-22, and then concatenate the three subsequences to create the transcript sequence.
In this case, it would be AGCGTCTC + ACGCGG + TAT, meaning the transcript sequence would be AGCGTCTCACGCGGTAT.
Is it possible to do this in Galaxy?
Karen :)
On Thu, 27 Jan 2011, Jennifer Jackson wrote:
Hello Karen,
The following general workflow should help you to pull sequences from any source.
1) cut out the sequence IDs from the query (in this case, a GTF & BED file) and sort them. Text Manipulation -> Cut columns from a table Filter and Sort -> Sort 2) convert the target fasta file to tabular format Convert Formats -> FASTA-to-Tabular converter 3) join the two datasets based on the sequence ID Join, Subtract and Group -> Join two Queries 4) covert to fasta Convert Formats -> Tabular-to-FASTA 5) when starting with a GTF file, there will most likely be duplicates. To remove, use: NGS: QC and manipulation -> Collapse sequences
Once you create the actual workflow that performs the job, be sure to save it so that you can just re-use it whenever you need to perform the same task. To do this, from the history pane (most right) use Options -> Extract workflow and following the instructions on the form to customize.
Hopefully this helps,
Jen Galaxy team
On 1/26/11 12:05 PM, Karen Tang wrote:
Hi Galaxy people, I have transcripts predicted by Cufflinks that are in a gtf file. How can I extract the sequences corresponding to those transcripts, using Galaxy? [Cufflinks transcript predictions in gtf file] + [Genome sequence in FASTA file] ---> [FASTA file of transcript sequences] My genome is a custom genome (not at UCSC). --------- I'll also need to do the same thing, except my predicted transcripts are in a Scripture bed file. Thanks for your help! Karen Tang :) Plant Biology University of Minnesota _______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
participants (3)
-
Jennifer Jackson
-
Jeremy Goecks
-
Karen Tang