Hi again, Thanks for your input on this. I normally use the combined gtf from cuffcompare in my analysis as you suggest, but there is never a p_id associated with the combined gtf. So I have fed into cuffcompare the following: 1. the ensemble gtf (fixed the chr) 2. the ensemble gtf (fixed the chr and changed protein_id to p_id) 3. a UCSC gtf based on refseq following your instructions 4. a ucsc gtf based on ucsc genes In each case, after running t0 and t8 in cuffcompare to generate a combined gtf I used that one to run cuffdiff. Everytime, the CDS files are empty and no p_ids can be found anywhere (in any of the gtf files). This is driving me mad!! I do not use a reference gtf for cufflinks (because that would restrict the analysis to known genes only) is that where I am going wrong? ON the other hand all the other things are filled out (e.g. splicing diff) and they seem - on the face of it - accurate. Any ideas? Do you want access to my files? Cheers David On 7 Dec 2010, at 18:20, Jeremy Goecks wrote:
Thanks Jeremy, I've changed the names from protein_id to p_id using text edit and I'm going to try it again as soon as its loaded back into the galaxy site. I've been shy of using the UCSC site because I'm not a bioinformatics person and I'm learning on the fly so I was not confident I would download the right gtf file for hg19 in one simple (with emphasis on the word simple!) step - if you know exactly how to do it that'd be great. I'll let everyone know if the p_id thing fixes the ensemble.gtf file for good.
Let's first back up to the original issue because I didn't address it correctly earlier.
Issue: Cuffdiff isn't producing all the output, and the problem is at least partially due to the GTF file being provided to it. Solution: The GTF that you want to provide to Cuffdiff is not the reference GTF but the GTF generated by Cuffcompare; in particular, you want the GTF file of combined transcripts. So, at least with Cuffdiff, the problem is not where you got your reference GTF. The combined transcripts produced by Cuffcompare will have tss_id and p_id attributes included, so you won't need to worry about adding them.
Now, to the question of getting a reference gene annotation from UCSC. This is straightforward:
(a) in Galaxy tools, go to Get Data --> UCSC Main (b) select clade/genome/assembly (c) for group, use 'Genes and Gene Prediction Tracks' (d) for track, use your favorite annotation; both RefSeq and Ensembl are good (e) for table, choose the defaultl (refGene for RefSeq, enGene for Ensembl (f) region: genome (g) output format: GTF (h) make sure 'Send to Galaxy' is checked
That's it.
Hope this helps, J.