Thanks Jeremy, I've changed the names from protein_id to p_id
using text edit and I'm going to try it again as soon as its loaded back into the
galaxy site. I've been shy of using the UCSC site because I'm not a bioinformatics
person and I'm learning on the fly so I was not confident I would download the right
gtf file for hg19 in one simple (with emphasis on the word simple!) step - if you know
exactly how to do it that'd be great. I'll let everyone know if the p_id thing
fixes the ensemble.gtf file for good.
Let's first back up to the original issue because I didn't address it correctly
earlier.
Issue: Cuffdiff isn't producing all the output, and the problem is at least partially
due to the GTF file being provided to it.
Solution: The GTF that you want to provide to Cuffdiff is not the reference GTF but the
GTF generated by Cuffcompare; in particular, you want the GTF file of combined
transcripts. So, at least with Cuffdiff, the problem is not where you got your reference
GTF. The combined transcripts produced by Cuffcompare will have tss_id and p_id attributes
included, so you won't need to worry about adding them.
Now, to the question of getting a reference gene annotation from UCSC. This is
straightforward:
(a) in Galaxy tools, go to Get Data --> UCSC Main
(b) select clade/genome/assembly
(c) for group, use 'Genes and Gene Prediction Tracks'
(d) for track, use your favorite annotation; both RefSeq and Ensembl are good
(e) for table, choose the defaultl (refGene for RefSeq, enGene for Ensembl
(f) region: genome
(g) output format: GTF
(h) make sure 'Send to Galaxy' is checked
That's it.
Hope this helps,
J.