Thanks Jeremy, I've changed the names from protein_id to p_id using text edit and
I'm going to try it again as soon as its loaded back into the galaxy site. I've
been shy of using the UCSC site because I'm not a bioinformatics person and I'm
learning on the fly so I was not confident I would download the right gtf file for hg19 in
one simple (with emphasis on the word simple!) step - if you know exactly how to do it
that'd be great. I'll let everyone know if the p_id thing fixes the ensemble.gtf
file for good.
Dr David A. Matthews
Senior Lecturer in Virology
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University of Bristol
Tel. +44 117 3312058
On 7 Dec 2010, at 15:58, Jeremy Goecks wrote:
> chr11 protein_coding CDS 129060 129388 . - 0 gene_id
"ENSG00000230724"; transcript_id "ENST00000382784"; exon_number
"1"; gene_name "AC069287.3"; transcript_name
"AC069287.3-201"; protein_id "ENSP00000372234"
> Is the p_id problem in cufflinks because the ensemble.gtf file uses the word
protein_id and not p_id???
Yes, this is likely a problem. From the Cuffdiff documentation:
Cuffdiff takes a GTF file of transcripts as input, along with two or more SAM files
containing the fragment alignments for two or more samples. It produces a number of output
files that contain test results for changes in expression at the level of transcripts,
primary transcripts, and genes. It also tracks changes in the relative abundance of
transcripts sharing a common transcription start site, and in the relative abundances of
the primary transcripts of each gene. Tracking the former allows one to see changes in
splicing, and the latter lets one see changes in relative promoter use within a gene. If
you have more than one replicate for a sample, supply the SAM files for the sample as a
single comma-separated list. It is not necessary to have the same number of replicates for
each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain
attributes in order to look for changes in primary transcript expression, splicing, coding
output, and promoter use. These attributes are:
tss_id The ID of this transcript's inferred start site. Determines which primary
transcript this processed transcript is believed to come from.
p_id The ID of the coding sequence this transcript contains. This is attribute is
attached to Cuffcompare output by Cuffcompare only when it is run with a reference
annotation that include CDS records. Further, differential CDS analysis is only performed
when all isoforms of a gene have p_id attributes, because neither Cufflinks nor
Cuffcompare attempt to assign an open reading frame to transcripts.
Does addressing this issue prompt Cuffdiff to output data to the differential coding
Also, you might try using gene annotation files from UCSC rather than Ensembl. Although
Ensembl is mentioned in the documentation, the problems that you and others have
encountered suggest that Cufflinks may have been developed using UCSC GTFs rather than
Ensembl GTFs and hence UCSC GTFs may work better.