Hi, I have run some RNASeq analysis and I am trying to get the ensembl gene annotations to show in the cuffcompare files. I have done the following: 1. Ran cufflinks analysis on the .bam files. 2. I got the .gtf file for hg19 from ensembl. Based on the email below, I replaced the the chromosome name from 1, 2, 3 etc to chr1, chr2, chr3 etc.. Then I tried to run the processed .gtf file with self through cuffcompare as recommended below, I am getting an error 3. If I try to run cuffcompare on two of my cufflinks data file and use the processed gtf file as is, I am getting the same error. Any inputs on what I am doing wrong are appreciated. I will be happy to share the history if needed. Thanks and Regards, Aarti On 1/24/2011 9:24 PM, Rory Kirchner wrote:
For the ensembl annotation, you can download the gtf file from ensembl for your organism here:
http://uswest.ensembl.org/info/data/ftp/index.html
To use this, you need to fix it because the chromosome names are not correct (depending on your organism, it is not correct for mouse and rat at least). If you are on a mac or on a unix machine, do this from the terminal (assuming your downloaded gtf file is named ensembl.gtf):
awk -F "\t" '{OFS="\t"; $1 = "chr"$1; print}' ensembl.gtf | awk -F"\t" '{OFS="\t"; if($1=="chrMT") $1="chrM"; print}' > ensembl_cleaned.gtf
This changes the ensembl chromosome names from 1,2,3,4,X,MT to chr1, chr2, chr3, chrM to match the bowtie index ids.
This file is unsorted, so it won't work with SAM files but it will work with the BAM files that tophat outputs. If you need to work with SAM files for some reason, this might work:
sort -k 1,1 -k 4,4n infile > outfile
Sorted or unsorted. if you run the reformatted gtf file in cuffcompare against itself (use it as the reference gtf and the 'test' gtf) the GTF file that is output from that cuffcompare will have all of the cds, tss all that stuff when you use it as the reference for cuffdiff.
-rory
On Jan 24, 2011, at 10:02 AM, Jeremy Goecks wrote:
Hi Matteo and Vasu,
There are different ways to refer to genes. Names that start with NM_ are termed 'accession numbers,' and they are a valid way to refer to genes.
Matteo, what you may want is the canonical gene name (e.g. Xkr4). If so, you'll want to use a gene annotation/reference file from UCSC; when you are getting the file, you'll want to select the table with the word 'canonical' in it. E.g. for hg19/UCSC genes, there is a table called knownCanonical that provides the canonical gene names.
Thanks, J.
On Jan 24, 2011, at 9:38 AM, vasu punj wrote:
This is a knwon issue of GTF file from Ensembl
--- On *Mon, 1/24/11, Matteo Bovolenta /<bvlmtt@unife.it <mailto:bvlmtt@unife.it>>/* wrote:
From: Matteo Bovolenta <bvlmtt@unife.it <mailto:bvlmtt@unife.it>> Subject: [galaxy-user] Gene Name in Cufflink/compare/diff To: galaxy-user@bx.psu.edu <mailto:galaxy-user@bx.psu.edu> Date: Monday, January 24, 2011, 5:05 AM
Hi all,
when I run a RNASeq analysis using tophat, cufflink, coffcompare and cuffdiff by aligning my data to the RefSeq genes I obtain tables from cufflink/compare/diff which does not include the gene name, but only the NM_. Does someone knows how I can obtain all the tables with the gene name?
Thank you all very much for the support,
Best Regards,
Matteo
-- Matteo Bovolenta, PhD Dipartimento di Medicina Sperimentale e Diagnostica Sezione di Genetica Medica Università di Ferrara Via Fossato di Mortara, 74 44100 Ferrara tel +39 0532 974449(office) tel +39 0532 974502 (lab) fax +39 0532 236157 email bvlmtt@unife.it <http://us.mc1147.mail.yahoo.com/mc/compose?to=bvlmtt@unife.it> http://www.unife.it/medicina/geneticamedica http://www.bio-nmd.eu <http://www.bio-nmd.eu/> registered in ORPHANET http://www.orpha.net <http://www.orpha.net/>
NOTA DI RISERVATEZZA: ai sensi del D.Lgs. 196/2003 si precisa che le informazioni contenute in questo messaggio e nei relativi allegati sono riservate ed a uso esclusivo del destinatario. Qualora il messaggio in parola Le fosse pervenuto per errore, La invitiamo ad eliminarlo senza copiarlo, a non inoltrarlo a terzi e a non farne alcun uso, dando gentilmente comunicazione all'indirizzo del mittente: bvlmtt@unife.it <http://us.mc1147.mail.yahoo.com/mc/compose?to=bvlmtt@unife.it> Grazie.
CONFIDENTIALITY NOTICE: this message together with its annexes may contain confidential, proprietary or legally privileged information and is intended only for the use of the addressee named above. No confidentiality or privilege is waived or lost by any mistransmission. If you are not the intended recipient of this message you are hereby notified that you must not use, disseminate, copy it in any form or take any action in reliance on it. If you have received this message in error please delete it and any copies of it and kindly inform the sender of this e-mail by bvlmtt@unife.it <http://us.mc1147.mail.yahoo.com/mc/compose?to=bvlmtt@unife.it> Thank you
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu <http://us.mc1147.mail.yahoo.com/mc/compose?to=galaxy-user@lists.bx.psu.edu> http://lists.bx.psu.edu/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu <mailto:galaxy-user@lists.bx.psu.edu> http://lists.bx.psu.edu/listinfo/galaxy-user
J.
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu <mailto:galaxy-user@lists.bx.psu.edu> http://lists.bx.psu.edu/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user