Hi Jeremy,
Just got this from Cole - the -s option is needed for sure and Adam also thinks that the version needs to be upgraded to the latest one - sorry to add more work to your plate!!
Glad to get to the bottom of it at last - perhaps a post in Seqanswers would be worthwhile...
David
Begin forwarded message:
> From: Cole Trapnell <cole(a)cs.umd.edu>
> Date: 8 December 2010 18:19:53 GMT
> To: David Matthews <D.A.Matthews(a)bristol.ac.uk>
> Cc: Adam Roberts <adarob(a)gmail.com>
> Subject: Re: Cufflinks
>
> Ah. Yes, we really need to add a note in the manual about this.
>
> Getting the p_ids to work requires both that your GTF file have annotated "CDS" type records (not just "exon" type records like the Cufflinks assembler spits out), and that you also supply a reference genome sequence with -s. David, is that what you're doing?
>
> C
>
>
> On Dec 8, 2010, at 1:15 PM, David Matthews wrote:
>
>> Hi Adam (and Cole),
>>
>> Thanks for the email, I certainly appreciate your situation and I imagine the popularity of the cufflinks suite means you get a lot of emails on that as well! The version they are using is 0.9.1 but the p_id problem seems to be a theme in seqanswers which is where the suggestion to use the -s option came from. Cole do you have any thoughts on the lack of p_ids and cds files? The galaxy team plan to introduce the -s option asap...
>>
>> Thanks again for your patience and input.
>>
>> Cheers
>> David
>>
>>
>> On 8 Dec 2010, at 18:08, Adam Roberts wrote:
>>
>>> David,
>>>
>>> Sorry I haven't been responsive lately. I've definitely been swamped: writing a paper, finishing class projects, and planning a course I'm TAing next semester -- All of which have to be done before I go on vacation on Monday. Obviously, this means that I might not get to all of your issues for a couple of weeks, but I will try my best.
>>>
>>> The Cuffdiff thing definitely looks like it may be a bug. Is Galaxy using 0.9.3? If not, then it is probably the bug in 0.9.2 that we fixed with that release. Concerning the p_id issue, I honestly don't know much about these details of cuffcompare and cuffdiff, as my work has primarily been in the abundance estimation code. I suggest you email Cole about that, and let me know if he doesn't respond within a few days.
>>>
>>> -Adam
>>>
>>> On Wed, Dec 8, 2010 at 7:43 AM, David Matthews <D.A.Matthews(a)bristol.ac.uk> wrote:
>>> Hi again Adam,
>>>
>>> Just been running a long email chat with the team at Galaxy and they too cannot get p_ids attached and their analysis shows no data in the cds files from cuffdiff - any news on that? Some people are suggesting the -s option must be run on cuffcompare - does this make sense to you?
>>>
>>> Cheers
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 3 Dec 2010, at 16:34, Adam Roberts wrote:
>>>
>>>> No need to apologize. We really want Cufflinks to be useful and not just a method that we write papers about.
>>>>
>>>> This problem with cuffdiff you mention definitely should not be happening, but before I go looking for a bug, I want to be sure that it really is happening. Are you sure you are looking at the matching sample? Also, can you give me a few lines from the output that show the different results?
>>>>
>>>> Thanks.
>>>>
>>>> -Adam
>>>>
>>>> On Fri, Dec 3, 2010 at 3:32 AM, David Matthews <D.A.Matthews(a)bristol.ac.uk> wrote:
>>>> Thanks Adam, that makes more sense to me now. A second question that has popped up is to do with analysing a 3 point time course. In Galaxy you can only get cuffdiff to look at two time points at once. But if I look at the output there is an oddity, the fpkms for a gene at the middle time point are not always the same in both analysis. For example, if I look at a gene called RAD50 at times 0h, 8h and 24h and compare 0 and 8 then compare 8 and 24 the fpkm for this gene at 8 hours should be the same in both analysis but this is not always the case. Any thoughts on this?
>>>>
>>>> P.S. I know it sounds like I'm moaning but I am really impressed with the whole suite of programs and especially now its on Galaxy. Previously I was doing it all on my iMac and it struggled to do anything else sometimes! What this suite of programs offers is far superior to anything else anyone else is offering (especially the people at Sanger here in the UK who don't seem to be very interested in helping biologists like me get to grips with this kind of stuff!).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 3 Dec 2010, at 04:07, Adam Roberts wrote:
>>>>
>>>>> Also, cuffdiff has code in it that automatically sums these. We have not implemented such code in cufflinks at this point, but we may do so in the future.
>>>>> The FAIL means that the likelihood maximization algorithm did not converge on an isoform deconvolution. You can try setting -f 0.1, which will filter out isoforms with less than 10% of the expression for the gene and may help the calculation to converge.
>>>>>
>>>>> -Adam
>>>>>
>>>>> On Thu, Dec 2, 2010 at 8:05 PM, Adam Roberts <adarob(a)gmail.com> wrote:
>>>>> Cufflinks creates XLOCs based on sets of overlapping transcripts. Sometimes, a gene will have transcripts that don't share any exons, and will therefore be bundled into separate XLOCs. I'm assuming that is what is happening here. If you only care about the total for the whole gene, you can just take the sum.
>>>>>
>>>>> -Adam
>>>>>
>>>>>
>>>>> On Thu, Dec 2, 2010 at 3:51 PM, David Matthews <D.A.Matthews(a)bristol.ac.uk> wrote:
>>>>> Hi,
>>>>>
>>>>> Thanks for emailing me back about this. Here is a good example of a gene reported many times by cuffdiff in the "genes FPKM tracking file":
>>>>>
>>>>> XLOC_008048 - - ABR - chr17:906640-1012618 20.5239 19.4454 21.6025 20.49 19.4152 21.5649
>>>>> XLOC_008049 - - ABR - chr17:906640-1012618 8.65474 7.90632 9.40316 5.3011 4.84316 5.75905
>>>>> XLOC_008050 - - ABR - chr17:906640-1012618 31.345 29.7825 32.9074 24.4523 23.242 25.6627
>>>>> XLOC_008051 - - ABR - chr17:906640-1012618 87.0488 82.0643 92.0333 170.051 160.486 179.615
>>>>> XLOC_008052 - - ABR - chr17:906640-1012618 4.60255 4.27034 4.93476 0 0 0
>>>>> XLOC_008053 - - ABR - chr17:906640-1012618 39.4463 37.6705 41.2221 59.5061 56.8463 62.166
>>>>> XLOC_008054 - - ABR - chr17:906640-1012618 40.3393 38.0356 42.6429 16.5153 15.586 17.4446
>>>>> XLOC_008055 - - ABR - chr17:906640-1012618 3.07796 2.88305 3.27286 5.12889 4.80319 5.45459
>>>>> XLOC_008056 - - ABR TSS5913 chr17:906640-1012618 7.7536 7.3448 8.1624 7.78601 7.37482 8.1972
>>>>> XLOC_008057 - - ABR - chr17:906640-1012618 37.9869 35.9915 39.9823 19.9239 18.8859 20.9618
>>>>> XLOC_008058 - - ABR - chr17:906640-1012618 16.0369 14.9386 17.1352 7.83102 7.29567 8.36637
>>>>> XLOC_008059 - - ABR - chr17:906640-1012618 53.3928 51.194 55.5917 20.1807 19.3588 21.0025
>>>>> XLOC_008060 - - ABR - chr17:906640-1012618 17.99 16.9108 19.0692 19.626 18.4516 20.8004
>>>>> XLOC_009397 - - ABR TSS6816,TSS6817 chr17:906640-1012618 14.2054 8.00403 20.4068 12.0799 6.83849 17.3213
>>>>>
>>>>> As you can see the XLOC is different, but the gene name and the chromosome locations are all the same. I have used the "group data" tool on the Galaxy website to sum the FPKMs for every gene with the same gene name to get a single set of numbers for a given gene, but it seems to me that this file should have done it already or maybe I'm missunderstanding something.
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> Best Wishes,
>>>>> David
>>>>>
>>>>> P.S. I'm using the latest ensembl.gtf file when I run tophat and when I run cuffcompare to generate the combined gtf file. In both cases I do NOT restrict the program to annotated genes only.
>>>>>
>>>>>
>>>>>
>>>>> On 2 Dec 2010, at 23:14, Adam Roberts wrote:
>>>>>
>>>>>> Can you send me some example lines from the files?
>>>>>>
>>>>>> On Wed, Dec 1, 2010 at 7:10 AM, David Matthews <D.A.Matthews(a)bristol.ac.uk> wrote:
>>>>>> Dear Adam,
>>>>>>
>>>>>> I am using cufflinks to look at mRNA expression in virus infected cells. I am using cufflinks on the Galaxy server. When I run through the normal workflow, the files produced by cuffdiff do not seem to be quite right. The main issue is that both the gene expression files and the genes fpkm files contain repeat sets of numbers for the same gene - as I understand it those files should contain a summed value for everything assigned to that gene name - am I misunderstanding the problem? Am I doing something wrong? I use the ensemble gtf file during cuffcompare and use the combined gtf files for the cuffdiff along with the relevant sam files. If, on the other hand I run cuffdiff with the same files and the ensembl gtf file I get a single number for each gene (although strangely many genes fail to give any expression data and are reported as "FAIL").
>>>>>>
>>>>>> What am I doing wrong?
>>>>>>
>>>>>> Hope you can help,
>>>>>>
>>>>>> Cheers
>>>>>> David
>>>>>>
>>>>>>
>>>>>> __________________________________
>>>>>> Dr David A. Matthews
>>>>>>
>>>>>> Senior Lecturer in Virology
>>>>>> Room E49
>>>>>> Department of Cellular and Molecular Medicine,
>>>>>> School of Medical Sciences
>>>>>> University Walk,
>>>>>> University of Bristol
>>>>>> Bristol.
>>>>>> BS8 1TD
>>>>>> U.K.
>>>>>>
>>>>>> Tel. +44 117 3312058
>>>>>>
>>>>>> D.A.Matthews(a)bristol.ac.uk
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>