Hi Thanh,
This is due to Cuffdiff correcting for the size of smaller transcripts, the authors call it the "effective length correction". It is supposed to correct the loss of shorter transcripts upon size selection in creating your RNA-seq library. The default setting on Galaxy is to use the "effective length correction". 

Cole Trapnell, the creator of the "Cuff-suite tools", discusses this length correction here: 
http://seqanswers.com/forums/showpost.php?p=76430&postcount=32

Some library preparation protocols don't include a size selection. The one we favor, and Illumina recommends, ScriptSeq v2 from Epicentre (owned by Illumina), does not include a size selection step. It would be great if there was an option in the Cuffdiff wrapper in Galaxy to turn off the effective length correction.



Cheers, 
Mo Heydarian

PhD candidate
The Johns Hopkins School of Medicine
Department of Biological Chemistry
725 Wolfe Street
402 Biophysics
Baltimore, MD 21205


On Thu, Jul 18, 2013 at 12:55 PM, Hoang, Thanh <hoangtv@miamioh.edu> wrote:
Hi all,
I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to differential gene expression 
In the Cuffdiff's output, I got very high RPKM value for some of miRNA and some other short genes ( less than 100bp). These genes are in the top genes with the highest RPKM. I think the RPKM values of these genes are probably  too high to be true.
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
ENSMUSG00000093077 ENSMUSG00000093077 Mir5105 5:146231229-146302874 Epithelium Fiber OK 1.53E+06   445558 -1.78097 -355.367 0.00715 0.016986 yes
ENSMUSG00000093098 ENSMUSG00000093098 Gm22641 7:130162450-133124354 Epithelium Fiber OK 87894.1  36474.7 -1.26887 -0.59863 0.4913 0.587174 no
ENSMUSG00000089855 ENSMUSG00000089855 Gm15662 10:105187662-105583874 Epithelium Fiber OK 42868.9  21566.5 -0.99114 -20.7066 0.0186 0.039568 yes
ENSMUSG00000092984 ENSMUSG00000092984 Mir5115 2:73012853-73012927 Epithelium Fiber OK 21104.8  8317.49 -1.34335 -447.314 0.0001 0.000354 yes
ENSMUSG00000086324 ENSMUSG00000086324 Gm15564 16:35926510-36037131 Epithelium Fiber OK 6443.35  3664.15 -0.81433 -1.52095 0.2129 0.301429 no
ENSMUSG00000092981 ENSMUSG00000092981 Mir5125 17:23803186-23824739 Epithelium Fiber OK 5974.14 2390.75 -1.32127 -0.34111 0.5746 0.661937 no

 I checked some forums and they said that this is the drawback of TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not so clear about this. Anyone got the same problem? What can I do with this situation?
Anyone suggests any other good tools to test for (1) differential gene expression OR (2) both differential gene expression and gene discovery?

Thank you
Thanh

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/