From: David Matthews <D.A.Matthews(a)bristol.ac.uk>
Date: 7 December 2010 15:00:01 GMT
Cc: Jennifer Jackson <jen(a)bx.psu.edu>
Subject: Re: cuffdiff using the combined gtf files from cuffcompare
Sorry about not replying via the user group mail list - lack of concentration on my part.
I also emailed Adam Roberts about this problem of multiple identical gene names and his
reply is below:
> Cufflinks creates XLOCs based on sets of overlapping transcripts. Sometimes, a gene
will have transcripts that don't share any exons, and will therefore be bundled into
separate XLOCs. I'm assuming that is what is happening here. If you only care about
the total for the whole gene, you can just take the sum.
So it seems as though it is probably/possibly a problem of when are two transcripts
really the same gene (!)? Not always a straightforward question. However, for simple
summing of expression to get a simple answer I used the "group" command. I take
the isoforms.fpkm file and group on gene name and sum the fpkm values - that gives me a
straightforward list of expressed genes with a single number attached.
However, I now have two more problems (!!). I have a time course of samples T0, T8 and
T24 and so I compare T0 with T8 and T8 with T24. However, the expression values for T8 are
not the same - the T0/T8 values are slightly higher for T8 than the T8/T24 samples. I am
using the latest ensembl.gtf (I fixed the chr issue) and I made my own female_hg19 file
(the data is from HeLa cells). Again, I've emailed Adam about this to see what he
thinks. I ran the analysis twice to (hopefully) make sure I did not set it up wrong. So I
have installed the latest cufflinks here on my mac and I'm running it again but with
all three at the same time - it will take a day or two to complete.
My second and hopefully final problem is that the p_id and cds files from cuffdiff are
empty - can't figure this one out. I've tried looking on seq answers but with no
avail although others have seen the same thing. The ensembl.gtf file is mentioned in the
cufflinks manual so I assume its the best one to use but maybe I did something unintended
when I added the "chr" to the chromosome numbers.
I added you as a user so you can see for yourself, any suggestions gratefully received.
P.S. Just love the site - excellent stuff - bored everyone to tears here about how great
Dr David A. Matthews
Senior Lecturer in Virology
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University of Bristol
Tel. +44 117 3312058
On 7 Dec 2010, at 14:37, Jennifer Jackson wrote:
> Hello David,
> You are correct about the tools, so the problem is most likely with the original GTF
file. If gene_id is not assigned there correctly, then the data will not be sorted by
> Although GTF format is consistent (mostly!) between sources, the actual content can
vary. One example is from UCSC - the GTF format from the Table browser will have the
transcript name assigned to both the gene_id and the transcript_id tags in the attributes
field (f9). Post processing to extract gene name from the track and swapping it into the
GTF file's gene_id attribute tag would be a necessary pre-processing step before using
the downstream tools with functionality that would use the attribute.
> The good news is that you should be able to use Galaxy's Text Manipulation tools
to do whatever file processing you need to do, from whatever input source you are using,
once you have the data content loaded into your history. Create->save->use a
workflow so that you only have to work out tedious file conversions step-by-step one
> If you need more help, please let us know and share your history: Options -> Share
or Publish -> Share with a user "jen(a)bx.pus.edu".
> Galaxy team
> ps. It is best to send data questions to galaxy-use mail list to help the community
learn from each other. I am going to forward this answer there now, since this question
has come up a few times recently after the addition of the new tools.
> On 12/1/10 7:00 AM, David Matthews wrote:
>> Dear Jennifer,
>> Hope you can help, after using cuffdiff on my data using the combined
>> gtf files from cuffcompare I get the usual list of files back. However,
>> in the genes tracking file and the genes fpkm files many genes are
>> listed more than once. My understanding was that cuffdiff was supposed
>> to amalgamate these into one whole number for that gene id, am I doing
>> something wrong?