On Mar 31, 2011, at 12:30 PM, <ssassi(a)CCIB.MGH.HARVARD.EDU> <ssassi(a)CCIB.MGH.HARVARD.EDU> wrote:
> Hi Jeremy,
> I used your exercise to perform an RNA-seq analysis. First I encountered a problem where the gene IDs were missing from the results. Jen from the Galaxy team suggested this:
>
> "Yes, the team has taken a look and there are a few things going on.
>
> The first is that when running the Cuffcompare program, a reference annotation file in GTF format should be used in order to obtain the same results as in Jeremy's exercise. This seemed to be missing from your runs, which resulted in badly formatted output that later resulted in a poor result when Cuffdiff was used.
>
> The second has to do with the reference GTF file itself. For the best results, the GTF file must have the "gene_id" attribute defined in the 9th column of the file and the chromosome names must be in the same format as the genome native to Galaxy. Depending on the source of the reference GTF, one of these may need to be adjusted. Chromosome names can be adjusted using Galaxy's "Text Manipulation" tools. The gene_id attribute would need to be adjusted prior to loading into Galaxy.
>
> For mm9, using the "Get Data -> UCSC Main table browser" tool can help you to obtain all of the raw data necessary to create a complete GTF file with a gene_id identifier. Extract data from the track "RefSeq Genes" and output the primary data table "refGene" twice - first in GTF format, then again as the complete table in tabular format (not BED). Then, using your own tools, swap in the gene name from the complete table (name2 value, column 12) into the GTF file's gene_id value (which by default is the same as transcript_id). Upload and the tools will function as intended.
>
> The team is aware of the issues associated with GTF source files and is discussing solutions. Any changes to native data content will be reported to the mailing list in a News Brief or other communications.
>
> Our apologies for the inconvenience! Thanks for using Galaxy and please let us know if we can help again,
>
> Best,
>
> Jen
> Galaxy team"
>
>
> I followed the directions (or at least I think I did) and things seemed to work better but there is one more issue for example in file:
> Galaxy287-[Cuffdiff_on_data_197,_data_197,_and_data_274__isoform_FPKM_tracking].tabular.txt
> The column gene_short_name does not have any names in it. nearest_ref_id does have the gene ID info so I can still interpret the data, but I was wondering if there remains another problem that I'm not aware of with the GTF file.
Slim,
Please send questions to the galaxy-user mailing list (cc'd) rather than individual Galaxy team members; there are many people on the list that may be able to address your question, and discussions are archived for future use as well. Without seeing your analysis, I'd suggest trying two things:
(1) Provide gene annotation reference file to Cufflinks as well as Cuffcompare and Cuffdiff; in other words, you'll want to do guided assembly.
(2) Try using an Ensembl GTF, which has the gene name in the attributes.
I think (2) is more likely to generate the results you want, but there are the many known problems in using Ensembl GTFs with Cufflinks/compare/diff.
Good luck,
J.