Just a thought, I notice that in the ensemble.gtf file the protein ids are listed as follows: chr11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234" Is the p_id problem in cufflinks because the ensemble.gtf file uses the word protein_id and not p_id??? Cheers David __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 D.A.Matthews@bristol.ac.uk
Just a thought, I notice that in the ensemble.gtf file the protein ids are listed as follows: chr11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234" Is the p_id problem in cufflinks because the ensemble.gtf file uses the word protein_id and not p_id??? Cheers David __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 D.A.Matthews@bristol.ac.uk
chr11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234" Is the p_id problem in cufflinks because the ensemble.gtf file uses the word protein_id and not p_id???
Yes, this is likely a problem. From the Cuffdiff documentation: -- Cuffdiff takes a GTF file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are: Attribute Description tss_id The ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from. p_id The ID of the coding sequence this transcript contains. This is attribute is attached to Cuffcompare output by Cuffcompare only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts. -- Does addressing this issue prompt Cuffdiff to output data to the differential coding file? Also, you might try using gene annotation files from UCSC rather than Ensembl. Although Ensembl is mentioned in the documentation, the problems that you and others have encountered suggest that Cufflinks may have been developed using UCSC GTFs rather than Ensembl GTFs and hence UCSC GTFs may work better. J.
Thanks Jeremy, I've changed the names from protein_id to p_id using text edit and I'm going to try it again as soon as its loaded back into the galaxy site. I've been shy of using the UCSC site because I'm not a bioinformatics person and I'm learning on the fly so I was not confident I would download the right gtf file for hg19 in one simple (with emphasis on the word simple!) step - if you know exactly how to do it that'd be great. I'll let everyone know if the p_id thing fixes the ensemble.gtf file for good. Cheers David __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 D.A.Matthews@bristol.ac.uk On 7 Dec 2010, at 15:58, Jeremy Goecks wrote:
chr11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234" Is the p_id problem in cufflinks because the ensemble.gtf file uses the word protein_id and not p_id???
Yes, this is likely a problem. From the Cuffdiff documentation:
-- Cuffdiff takes a GTF file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are: Attribute Description tss_id The ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from. p_id The ID of the coding sequence this transcript contains. This is attribute is attached to Cuffcompare output by Cuffcompare only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts. --
Does addressing this issue prompt Cuffdiff to output data to the differential coding file?
Also, you might try using gene annotation files from UCSC rather than Ensembl. Although Ensembl is mentioned in the documentation, the problems that you and others have encountered suggest that Cufflinks may have been developed using UCSC GTFs rather than Ensembl GTFs and hence UCSC GTFs may work better.
J.
Thanks Jeremy, I've changed the names from protein_id to p_id using text edit and I'm going to try it again as soon as its loaded back into the galaxy site. I've been shy of using the UCSC site because I'm not a bioinformatics person and I'm learning on the fly so I was not confident I would download the right gtf file for hg19 in one simple (with emphasis on the word simple!) step - if you know exactly how to do it that'd be great. I'll let everyone know if the p_id thing fixes the ensemble.gtf file for good.
Let's first back up to the original issue because I didn't address it correctly earlier. Issue: Cuffdiff isn't producing all the output, and the problem is at least partially due to the GTF file being provided to it. Solution: The GTF that you want to provide to Cuffdiff is not the reference GTF but the GTF generated by Cuffcompare; in particular, you want the GTF file of combined transcripts. So, at least with Cuffdiff, the problem is not where you got your reference GTF. The combined transcripts produced by Cuffcompare will have tss_id and p_id attributes included, so you won't need to worry about adding them. Now, to the question of getting a reference gene annotation from UCSC. This is straightforward: (a) in Galaxy tools, go to Get Data --> UCSC Main (b) select clade/genome/assembly (c) for group, use 'Genes and Gene Prediction Tracks' (d) for track, use your favorite annotation; both RefSeq and Ensembl are good (e) for table, choose the defaultl (refGene for RefSeq, enGene for Ensembl (f) region: genome (g) output format: GTF (h) make sure 'Send to Galaxy' is checked That's it. Hope this helps, J.
Hi again, Thanks for your input on this. I normally use the combined gtf from cuffcompare in my analysis as you suggest, but there is never a p_id associated with the combined gtf. So I have fed into cuffcompare the following: 1. the ensemble gtf (fixed the chr) 2. the ensemble gtf (fixed the chr and changed protein_id to p_id) 3. a UCSC gtf based on refseq following your instructions 4. a ucsc gtf based on ucsc genes In each case, after running t0 and t8 in cuffcompare to generate a combined gtf I used that one to run cuffdiff. Everytime, the CDS files are empty and no p_ids can be found anywhere (in any of the gtf files). This is driving me mad!! I do not use a reference gtf for cufflinks (because that would restrict the analysis to known genes only) is that where I am going wrong? ON the other hand all the other things are filled out (e.g. splicing diff) and they seem - on the face of it - accurate. Any ideas? Do you want access to my files? Cheers David On 7 Dec 2010, at 18:20, Jeremy Goecks wrote:
Thanks Jeremy, I've changed the names from protein_id to p_id using text edit and I'm going to try it again as soon as its loaded back into the galaxy site. I've been shy of using the UCSC site because I'm not a bioinformatics person and I'm learning on the fly so I was not confident I would download the right gtf file for hg19 in one simple (with emphasis on the word simple!) step - if you know exactly how to do it that'd be great. I'll let everyone know if the p_id thing fixes the ensemble.gtf file for good.
Let's first back up to the original issue because I didn't address it correctly earlier.
Issue: Cuffdiff isn't producing all the output, and the problem is at least partially due to the GTF file being provided to it. Solution: The GTF that you want to provide to Cuffdiff is not the reference GTF but the GTF generated by Cuffcompare; in particular, you want the GTF file of combined transcripts. So, at least with Cuffdiff, the problem is not where you got your reference GTF. The combined transcripts produced by Cuffcompare will have tss_id and p_id attributes included, so you won't need to worry about adding them.
Now, to the question of getting a reference gene annotation from UCSC. This is straightforward:
(a) in Galaxy tools, go to Get Data --> UCSC Main (b) select clade/genome/assembly (c) for group, use 'Genes and Gene Prediction Tracks' (d) for track, use your favorite annotation; both RefSeq and Ensembl are good (e) for table, choose the defaultl (refGene for RefSeq, enGene for Ensembl (f) region: genome (g) output format: GTF (h) make sure 'Send to Galaxy' is checked
That's it.
Hope this helps, J.
In each case, after running t0 and t8 in cuffcompare to generate a combined gtf I used that one to run cuffdiff. Everytime, the CDS files are empty and no p_ids can be found anywhere (in any of the gtf files). This is driving me mad!! I do not use a reference gtf for cufflinks (because that would restrict the analysis to known genes only) is that where I am going wrong? ON the other hand all the other things are filled out (e.g. splicing diff) and they seem - on the face of it - accurate.
Any ideas? Do you want access to my files?
After writing you back yesterday, I reran some Cufflinks and Cuffcompare analyses to see if I could generate a GTF file with both tss_id and p_id. Like you, I've had no luck. I expect this is a bug in Cuffcompare and something that Adam can best address. Have you heard back from him? Looking on seqanswers, there's an open but old thread asking for a fix; I've bumped it in hopes of getting a response. http://seqanswers.com/forums/showthread.php?p=30942 Best, J.
Hi, Thanks for trying, at least its not something I'm missing which is reassuring! No response from Adam yet, I'll give him another prod... Cheers David On 8 Dec 2010, at 15:36, Jeremy Goecks wrote:
In each case, after running t0 and t8 in cuffcompare to generate a combined gtf I used that one to run cuffdiff. Everytime, the CDS files are empty and no p_ids can be found anywhere (in any of the gtf files). This is driving me mad!! I do not use a reference gtf for cufflinks (because that would restrict the analysis to known genes only) is that where I am going wrong? ON the other hand all the other things are filled out (e.g. splicing diff) and they seem - on the face of it - accurate.
Any ideas? Do you want access to my files?
After writing you back yesterday, I reran some Cufflinks and Cuffcompare analyses to see if I could generate a GTF file with both tss_id and p_id. Like you, I've had no luck.
I expect this is a bug in Cuffcompare and something that Adam can best address. Have you heard back from him? Looking on seqanswers, there's an open but old thread asking for a fix; I've bumped it in hopes of getting a response.
http://seqanswers.com/forums/showthread.php?p=30942
Best, J.
David, Turns out the problem is that cuffcompare requires the -s option to be present (as well as necessary sequence files) in order for p_id attributes to be generated in the combined transcripts file. This is definitely doable in Galaxy, but it's going to require a bit of time to add the necessary code to make it happen. I'd guess we'll have it done in about a week. If you want help setting things up locally, let me know and we'll make it happen. Best, J. On Wed, Dec 8, 2010 at 10:41 AM, David Matthews <D.A.Matthews@bristol.ac.uk>wrote:
Hi,
Thanks for trying, at least its not something I'm missing which is reassuring! No response from Adam yet, I'll give him another prod...
Cheers David
On 8 Dec 2010, at 15:36, Jeremy Goecks wrote:
In each case, after running t0 and t8 in cuffcompare to generate a combined
gtf I used that one to run cuffdiff. Everytime, the CDS files are empty and no p_ids can be found anywhere (in any of the gtf files). This is driving me mad!! I do not use a reference gtf for cufflinks (because that would restrict the analysis to known genes only) is that where I am going wrong? ON the other hand all the other things are filled out (e.g. splicing diff) and they seem - on the face of it - accurate.
Any ideas? Do you want access to my files?
After writing you back yesterday, I reran some Cufflinks and Cuffcompare analyses to see if I could generate a GTF file with both tss_id and p_id. Like you, I've had no luck.
I expect this is a bug in Cuffcompare and something that Adam can best address. Have you heard back from him? Looking on seqanswers, there's an open but old thread asking for a fix; I've bumped it in hopes of getting a response.
http://seqanswers.com/forums/showthread.php?p=30942
Best, J.
Jeremy, I think we've looked at the same threads in seqanswers! Hope this does fix the problem, pity its not explained in the cufflinks manual. Look forward to the update. Cheers David On 8 Dec 2010, at 17:04, Jeremy Goecks wrote:
David,
Turns out the problem is that cuffcompare requires the -s option to be present (as well as necessary sequence files) in order for p_id attributes to be generated in the combined transcripts file. This is definitely doable in Galaxy, but it's going to require a bit of time to add the necessary code to make it happen. I'd guess we'll have it done in about a week.
If you want help setting things up locally, let me know and we'll make it happen.
Best, J.
On Wed, Dec 8, 2010 at 10:41 AM, David Matthews <D.A.Matthews@bristol.ac.uk> wrote: Hi,
Thanks for trying, at least its not something I'm missing which is reassuring! No response from Adam yet, I'll give him another prod...
Cheers David
On 8 Dec 2010, at 15:36, Jeremy Goecks wrote:
In each case, after running t0 and t8 in cuffcompare to generate a combined gtf I used that one to run cuffdiff. Everytime, the CDS files are empty and no p_ids can be found anywhere (in any of the gtf files). This is driving me mad!! I do not use a reference gtf for cufflinks (because that would restrict the analysis to known genes only) is that where I am going wrong? ON the other hand all the other things are filled out (e.g. splicing diff) and they seem - on the face of it - accurate.
Any ideas? Do you want access to my files?
After writing you back yesterday, I reran some Cufflinks and Cuffcompare analyses to see if I could generate a GTF file with both tss_id and p_id. Like you, I've had no luck.
I expect this is a bug in Cuffcompare and something that Adam can best address. Have you heard back from him? Looking on seqanswers, there's an open but old thread asking for a fix; I've bumped it in hopes of getting a response.
http://seqanswers.com/forums/showthread.php?p=30942
Best, J.
Can CuffDiff accept BAM as well SAM format? On 12/7/10 10:58 AM, "Jeremy Goecks" <jeremy.goecks@emory.edu> wrote:
chr11 protein_coding CDS 129060 129388 . - 0 gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201"; protein_id "ENSP00000372234" Is the p_id problem in cufflinks because the ensemble.gtf file uses the word protein_id and not p_id???
Yes, this is likely a problem. From the Cuffdiff documentation:
-- Cuffdiff takes a GTF file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are: AttributeDescription tss_idThe ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from. p_idThe ID of the coding sequence this transcript contains. This is attribute is attached to Cuffcompare output by Cuffcompare only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts. --
Does addressing this issue prompt Cuffdiff to output data to the differential coding file?
Also, you might try using gene annotation files from UCSC rather than Ensembl. Although Ensembl is mentioned in the documentation, the problems that you and others have encountered suggest that Cufflinks may have been developed using UCSC GTFs rather than Ensembl GTFs and hence UCSC GTFs may work better.
J.
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
participants (3)
-
Ann Loraine
-
David Matthews
-
Jeremy Goecks