David,
Sorry I haven't been responsive lately. I've definitely been swamped: writing a paper, finishing class projects, and planning a course I'm TAing next semester -- All of which have to be done before I go on vacation on Monday. Obviously, this means that I might not get to all of your issues for a couple of weeks, but I will try my best.
The Cuffdiff thing definitely looks like it may be a bug. Is Galaxy using 0.9.3? If not, then it is probably the bug in 0.9.2 that we fixed with that release. Concerning the p_id issue, I honestly don't know much about these details of cuffcompare and cuffdiff, as my work has primarily been in the abundance estimation code. I suggest you email Cole about that, and let me know if he doesn't respond within a few days.
-Adam
On Wed, Dec 8, 2010 at 7:43 AM, David Matthews
<D.A.Matthews@bristol.ac.uk> wrote:
Hi again Adam,
Just been running a long email chat with the team at Galaxy and they too cannot get p_ids attached and their analysis shows no data in the cds files from cuffdiff - any news on that? Some people are suggesting the -s option must be run on cuffcompare - does this make sense to you?
Cheers
David
On 3 Dec 2010, at 16:34, Adam Roberts wrote:
No need to apologize. We really want Cufflinks to be useful and not just a method that we write papers about.
This problem with cuffdiff you mention definitely should not be happening, but before I go looking for a bug, I want to be sure that it really is happening. Are you sure you are looking at the matching sample? Also, can you give me a few lines from the output that show the different results?
Thanks.
-Adam
On Fri, Dec 3, 2010 at 3:32 AM, David Matthews
<D.A.Matthews@bristol.ac.uk> wrote:
Thanks Adam, that makes more sense to me now. A second question that has popped up is to do with analysing a 3 point time course. In Galaxy you can only get cuffdiff to look at two time points at once. But if I look at the output there is an oddity, the fpkms for a gene at the middle time point are not always the same in both analysis. For example, if I look at a gene called RAD50 at times 0h, 8h and 24h and compare 0 and 8 then compare 8 and 24 the fpkm for this gene at 8 hours should be the same in both analysis but this is not always the case. Any thoughts on this?
P.S. I know it sounds like I'm moaning but I am really impressed with the whole suite of programs and especially now its on Galaxy. Previously I was doing it all on my iMac and it struggled to do anything else sometimes! What this suite of programs offers is far superior to anything else anyone else is offering (especially the people at Sanger here in the UK who don't seem to be very interested in helping biologists like me get to grips with this kind of stuff!).
On 3 Dec 2010, at 04:07, Adam Roberts wrote:
Also, cuffdiff has code in it that automatically sums these. We have not implemented such code in cufflinks at this point, but we may do so in the future.
The FAIL means that the likelihood maximization algorithm did not converge on an isoform deconvolution. You can try setting -f 0.1, which will filter out isoforms with less than 10% of the expression for the gene and may help the calculation to converge.
-Adam
On Thu, Dec 2, 2010 at 8:05 PM, Adam Roberts
<adarob@gmail.com> wrote:
Cufflinks creates XLOCs based on sets of overlapping transcripts. Sometimes, a gene will have transcripts that don't share any exons, and will therefore be bundled into separate XLOCs. I'm assuming that is what is happening here. If you only care about the total for the whole gene, you can just take the sum.
-AdamOn Thu, Dec 2, 2010 at 3:51 PM, David Matthews
<D.A.Matthews@bristol.ac.uk> wrote:
Hi,
Thanks for emailing me back about this. Here is a good example of a gene reported many times by cuffdiff in the "genes FPKM tracking file":
XLOC_008048 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
20.5239 |
19.4454 |
21.6025 |
20.49 |
19.4152 |
21.5649 |
XLOC_008049 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
8.65474 |
7.90632 |
9.40316 |
5.3011 |
4.84316 |
5.75905 |
XLOC_008050 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
31.345 |
29.7825 |
32.9074 |
24.4523 |
23.242 |
25.6627 |
XLOC_008051 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
87.0488 |
82.0643 |
92.0333 |
170.051 |
160.486 |
179.615 |
XLOC_008052 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
4.60255 |
4.27034 |
4.93476 |
0 |
0 |
0 |
XLOC_008053 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
39.4463 |
37.6705 |
41.2221 |
59.5061 |
56.8463 |
62.166 |
XLOC_008054 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
40.3393 |
38.0356 |
42.6429 |
16.5153 |
15.586 |
17.4446 |
XLOC_008055 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
3.07796 |
2.88305 |
3.27286 |
5.12889 |
4.80319 |
5.45459 |
XLOC_008056 |
- |
- |
ABR |
TSS5913 |
chr17:906640-1012618 |
7.7536 |
7.3448 |
8.1624 |
7.78601 |
7.37482 |
8.1972 |
XLOC_008057 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
37.9869 |
35.9915 |
39.9823 |
19.9239 |
18.8859 |
20.9618 |
XLOC_008058 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
16.0369 |
14.9386 |
17.1352 |
7.83102 |
7.29567 |
8.36637 |
XLOC_008059 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
53.3928 |
51.194 |
55.5917 |
20.1807 |
19.3588 |
21.0025 |
XLOC_008060 |
- |
- |
ABR |
- |
chr17:906640-1012618 |
17.99 |
16.9108 |
19.0692 |
19.626 |
18.4516 |
20.8004 |
XLOC_009397 |
- |
- |
ABR |
TSS6816,TSS6817 |
chr17:906640-1012618 |
14.2054 |
8.00403 |
20.4068 |
12.0799 |
6.83849 |
17.3213 |
As you can see the XLOC is different, but the gene name and the chromosome locations are all the same. I have used the "group data" tool on the Galaxy website to sum the FPKMs for every gene with the same gene name to get a single set of numbers for a given gene, but it seems to me that this file should have done it already or maybe I'm missunderstanding something.
Thanks again!
Best Wishes,
David
P.S. I'm using the latest ensembl.gtf file when I run tophat and when I run cuffcompare to generate the combined gtf file. In both cases I do NOT restrict the program to annotated genes only.
On 2 Dec 2010, at 23:14, Adam Roberts wrote:
Can you send me some example lines from the files?
On Wed, Dec 1, 2010 at 7:10 AM, David Matthews
<D.A.Matthews@bristol.ac.uk> wrote:
Dear Adam,
I am using cufflinks to look at mRNA expression in virus infected cells. I am using cufflinks on the Galaxy server. When I run through the normal workflow, the files produced by cuffdiff do not seem to be quite right. The main issue is that both the gene expression files and the genes fpkm files contain repeat sets of numbers for the same gene - as I understand it those files should contain a summed value for everything assigned to that gene name - am I misunderstanding the problem? Am I doing something wrong? I use the ensemble gtf file during cuffcompare and use the combined gtf files for the cuffdiff along with the relevant sam files. If, on the other hand I run cuffdiff with the same files and the ensembl gtf file I get a single number for each gene (although strangely many genes fail to give any expression data and are reported as "FAIL").
What am I doing wrong?
Hope you can help,
Cheers
David
__________________________________
Dr David A. Matthews
Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.
Tel. +44 117 3312058