Exceptionally high RPKM values of miRNA and other short genes in Cuffdiff's output
Hi all, I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to differential gene expression In the Cuffdiff's output, I got very high RPKM value for some of miRNA and some other short genes ( less than 100bp). These genes are in the top genes with the highest RPKM. I think the RPKM values of these genes are probably too high to be true. *test_id* *gene_id* *gene* *locus* *sample_1* *sample_2* *status* *value_1* *value_2* *log2(fold_change)* *test_stat* *p_value* *q_value* *significant* *ENSMUSG00000093077* *ENSMUSG00000093077* *Mir5105* *5:146231229-146302874* *Epithelium* *Fiber* *OK* *1.53E+06* * 445558* *-1.78097* *-355.367* * 0.00715* *0.016986* *yes* *ENSMUSG00000093098* *ENSMUSG00000093098* * Gm22641* *7:130162450-133124354* *Epithelium* *Fiber* *OK* *87894.1* * 36474.7* *-1.26887* *-0.59863* *0.4913* *0.587174* *no* * ENSMUSG00000089855* *ENSMUSG00000089855* *Gm15662* *10:105187662-105583874* *Epithelium* *Fiber* *OK* *42868.9* * 21566.5* *-0.99114* *-20.7066* *0.0186 * *0.039568* *yes* *ENSMUSG00000092984* *ENSMUSG00000092984* *Mir5115* * 2:73012853-73012927* *Epithelium* *Fiber* *OK* *21104.8* * 8317.49* * -1.34335* *-447.314* *0.0001* *0.000354* *yes* *ENSMUSG00000086324* * ENSMUSG00000086324* *Gm15564* *16:35926510-36037131* *Epithelium* *Fiber* * OK* *6443.35* * 3664.15* *-0.81433* *-1.52095* *0.2129* *0.301429* *no* * ENSMUSG00000092981* *ENSMUSG00000092981* *Mir5125* *17:23803186-23824739* * Epithelium* *Fiber* *OK* *5974.14* *2390.75* *-1.32127* *-0.34111* *0.5746* *0.661937* *no* I checked some forums and they said that this is the drawback of TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not so clear about this. Anyone got the same problem? What can I do with this situation? Anyone suggests any other good tools to test for (1) differential gene expression OR (2) both differential gene expression and gene discovery? Thank you Thanh
Hi Thanh, This is due to Cuffdiff correcting for the size of smaller transcripts, the authors call it the "effective length correction". It is supposed to correct the loss of shorter transcripts upon size selection in creating your RNA-seq library. The default setting on Galaxy is to use the "effective length correction". Cole Trapnell, the creator of the "Cuff-suite tools", discusses this length correction here: http://seqanswers.com/forums/showpost.php?p=76430&postcount=32 Some library preparation protocols don't include a size selection. The one we favor, and Illumina recommends, ScriptSeq v2 from Epicentre (owned by Illumina), does not include a size selection step. It would be great if there was an option in the Cuffdiff wrapper in Galaxy to turn off the effective length correction. Cheers, Mo Heydarian PhD candidate The Johns Hopkins School of Medicine Department of Biological Chemistry 725 Wolfe Street 402 Biophysics Baltimore, MD 21205 On Thu, Jul 18, 2013 at 12:55 PM, Hoang, Thanh <hoangtv@miamioh.edu> wrote:
Hi all, I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to differential gene expression In the Cuffdiff's output, I got very high RPKM value for some of miRNA and some other short genes ( less than 100bp). These genes are in the top genes with the highest RPKM. I think the RPKM values of these genes are probably too high to be true. *test_id* *gene_id* *gene* *locus* *sample_1* *sample_2* *status* * value_1* *value_2* *log2(fold_change)* *test_stat* *p_value* *q_value* * significant* *ENSMUSG00000093077* *ENSMUSG00000093077* *Mir5105* * 5:146231229-146302874* *Epithelium* *Fiber* *OK* *1.53E+06* * 445558* * -1.78097* *-355.367* *0.00715* *0.016986* *yes* *ENSMUSG00000093098* * ENSMUSG00000093098* *Gm22641* *7:130162450-133124354* *Epithelium* *Fiber* *OK* *87894.1* * 36474.7* *-1.26887* *-0.59863* *0.4913* *0.587174* *no* *ENSMUSG00000089855* *ENSMUSG00000089855* *Gm15662* * 10:105187662-105583874* *Epithelium* *Fiber* *OK* *42868.9* * 21566.5* * -0.99114* *-20.7066* *0.0186* *0.039568* *yes* *ENSMUSG00000092984* * ENSMUSG00000092984* *Mir5115* *2:73012853-73012927* *Epithelium* *Fiber* * OK* *21104.8* * 8317.49* *-1.34335* *-447.314* *0.0001* *0.000354* *yes* *ENSMUSG00000086324* *ENSMUSG00000086324* *Gm15564* *16:35926510-36037131* *Epithelium* *Fiber* *OK* *6443.35* * 3664.15* *-0.81433* *-1.52095* * 0.2129* *0.301429* *no* *ENSMUSG00000092981* *ENSMUSG00000092981* * Mir5125* *17:23803186-23824739* *Epithelium* *Fiber* *OK* *5974.14* * 2390.75* *-1.32127* *-0.34111* *0.5746* *0.661937* *no*
I checked some forums and they said that this is the drawback of TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not so clear about this. Anyone got the same problem? What can I do with this situation? Anyone suggests any other good tools to test for (1) differential gene expression OR (2) both differential gene expression and gene discovery?
Thank you Thanh
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at:
Hi, Thanh, If your primary goal is inference about differential 'gene' expression taking biological variability into account with biological replicates for each of two conditions, you might want (eg see Dillies et al., http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long and http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAb...) to try (and compare!) edgeR (and optionally DESeq and VOOM/limma). A set of *very much beta* tools is available for admin installation and user testing from the test toolshed in the statistics section owned by fubar. The edgeR tool can optionally run 2 way GLM. It requires raw count matrices as inputs which can be generated from a GTF/'gene' model of your choice and any number of mapped SAM/BAM inputs using the htseq based companion tool in the same tool shed section. Please don't install to a production machine yet but we're getting good results from it - feedback and code improvements are welcomed from willing beta testers. The R 3.0.x tool shed dependency package in particular is still under development and is likely to change substantially in the next week or two as we sort out a sane and generalised Atlas dependency installation. On Fri, Jul 19, 2013 at 2:55 AM, Hoang, Thanh <hoangtv@miamioh.edu> wrote:
Hi all, I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to differential gene expression In the Cuffdiff's output, I got very high RPKM value for some of miRNA and some other short genes ( less than 100bp). These genes are in the top genes with the highest RPKM. I think the RPKM values of these genes are probably too high to be true. *test_id* *gene_id* *gene* *locus* *sample_1* *sample_2* *status* * value_1* *value_2* *log2(fold_change)* *test_stat* *p_value* *q_value* * significant* *ENSMUSG00000093077* *ENSMUSG00000093077* *Mir5105* * 5:146231229-146302874* *Epithelium* *Fiber* *OK* *1.53E+06* * 445558* * -1.78097* *-355.367* *0.00715* *0.016986* *yes* *ENSMUSG00000093098* * ENSMUSG00000093098* *Gm22641* *7:130162450-133124354* *Epithelium* *Fiber* *OK* *87894.1* * 36474.7* *-1.26887* *-0.59863* *0.4913* *0.587174* *no* *ENSMUSG00000089855* *ENSMUSG00000089855* *Gm15662* * 10:105187662-105583874* *Epithelium* *Fiber* *OK* *42868.9* * 21566.5* * -0.99114* *-20.7066* *0.0186* *0.039568* *yes* *ENSMUSG00000092984* * ENSMUSG00000092984* *Mir5115* *2:73012853-73012927* *Epithelium* *Fiber* * OK* *21104.8* * 8317.49* *-1.34335* *-447.314* *0.0001* *0.000354* *yes* *ENSMUSG00000086324* *ENSMUSG00000086324* *Gm15564* *16:35926510-36037131* *Epithelium* *Fiber* *OK* *6443.35* * 3664.15* *-0.81433* *-1.52095* * 0.2129* *0.301429* *no* *ENSMUSG00000092981* *ENSMUSG00000092981* * Mir5125* *17:23803186-23824739* *Epithelium* *Fiber* *OK* *5974.14* * 2390.75* *-1.32127* *-0.34111* *0.5746* *0.661937* *no*
I checked some forums and they said that this is the drawback of TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not so clear about this. Anyone got the same problem? What can I do with this situation? Anyone suggests any other good tools to test for (1) differential gene expression OR (2) both differential gene expression and gene discovery?
Thank you Thanh
Thank you Mohammad and Ross for your valuable information. I am re-running Cuffdiff with No effective length correction and also running EdgeR tool to see how the results go. Thanh On Thu, Jul 18, 2013 at 8:42 PM, Ross <ross.lazarus@gmail.com> wrote:
Hi, Thanh, If your primary goal is inference about differential 'gene' expression taking biological variability into account with biological replicates for each of two conditions, you might want (eg see Dillies et al., http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.longand http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAb...) to try (and compare!) edgeR (and optionally DESeq and VOOM/limma). A set of *very much beta* tools is available for admin installation and user testing from the test toolshed in the statistics section owned by fubar.
The edgeR tool can optionally run 2 way GLM. It requires raw count matrices as inputs which can be generated from a GTF/'gene' model of your choice and any number of mapped SAM/BAM inputs using the htseq based companion tool in the same tool shed section. Please don't install to a production machine yet but we're getting good results from it - feedback and code improvements are welcomed from willing beta testers.
The R 3.0.x tool shed dependency package in particular is still under development and is likely to change substantially in the next week or two as we sort out a sane and generalised Atlas dependency installation.
On Fri, Jul 19, 2013 at 2:55 AM, Hoang, Thanh <hoangtv@miamioh.edu> wrote:
Hi all, I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to differential gene expression In the Cuffdiff's output, I got very high RPKM value for some of miRNA and some other short genes ( less than 100bp). These genes are in the top genes with the highest RPKM. I think the RPKM values of these genes are probably too high to be true. *test_id* *gene_id* *gene* *locus* *sample_1* *sample_2* *status* * value_1* *value_2* *log2(fold_change)* *test_stat* *p_value* *q_value* * significant* *ENSMUSG00000093077* *ENSMUSG00000093077* *Mir5105* * 5:146231229-146302874* *Epithelium* *Fiber* *OK* *1.53E+06* * 445558* * -1.78097* *-355.367* *0.00715* *0.016986* *yes* *ENSMUSG00000093098* * ENSMUSG00000093098* *Gm22641* *7:130162450-133124354* *Epithelium* *Fiber * *OK* *87894.1* * 36474.7* *-1.26887* *-0.59863* *0.4913* *0.587174* *no * *ENSMUSG00000089855* *ENSMUSG00000089855* *Gm15662* * 10:105187662-105583874* *Epithelium* *Fiber* *OK* *42868.9* * 21566.5* * -0.99114* *-20.7066* *0.0186* *0.039568* *yes* *ENSMUSG00000092984* * ENSMUSG00000092984* *Mir5115* *2:73012853-73012927* *Epithelium* *Fiber* *OK* *21104.8* * 8317.49* *-1.34335* *-447.314* *0.0001* *0.000354* *yes* *ENSMUSG00000086324* *ENSMUSG00000086324* *Gm15564* *16:35926510-36037131 * *Epithelium* *Fiber* *OK* *6443.35* * 3664.15* *-0.81433* *-1.52095* * 0.2129* *0.301429* *no* *ENSMUSG00000092981* *ENSMUSG00000092981* * Mir5125* *17:23803186-23824739* *Epithelium* *Fiber* *OK* *5974.14* * 2390.75* *-1.32127* *-0.34111* *0.5746* *0.661937* *no*
I checked some forums and they said that this is the drawback of TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not so clear about this. Anyone got the same problem? What can I do with this situation? Anyone suggests any other good tools to test for (1) differential gene expression OR (2) both differential gene expression and gene discovery?
Thank you Thanh
participants (3)
-
Hoang, Thanh
-
Mohammad Heydarian
-
Ross