On Thu, Jul 18, 2013 at 8:42 PM, Ross <ross.lazarus@gmail.com> wrote:

Hi, Thanh,
If your primary goal is inference about differential 'gene' expression taking biological variability into account with biological replicates for each of two conditions, you might want (eg see Dillies et al., http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long and http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAbstracts.2FPosters.P4:_Comparing_R-based_methods_and_Cuffdiff2_for_analysis_of_RNA-seq_data_in_Galaxy) to try (and compare!) edgeR (and optionally DESeq and VOOM/limma). A set of *very much beta* tools is available for admin installation and user testing from the test toolshed in the statistics section owned by fubar.

The edgeR tool can optionally run 2 way GLM. It requires raw count matrices as inputs which can be generated from a GTF/'gene' model of your choice and any number of mapped SAM/BAM inputs using the htseq based companion tool in the same tool shed section. Please don't install to a production machine yet but we're getting good results from it - feedback and code improvements are welcomed from willing beta testers.

The R 3.0.x tool shed dependency package in particular is still under development and is likely to change substantially in the next week or two as we sort out a sane and generalised Atlas dependency installation.

On Fri, Jul 19, 2013 at 2:55 AM, Hoang, Thanh <hoangtv@miamioh.edu> wrote:

Hi all,
I have been analyzing my RNA-seq data on mouse tissues. My RNA-data is single-ended and 51 bp in length. I ran TopHat/Cufflink/Cuffdiff to test to differential gene expression

In the Cuffdiff's output, I got very high RPKM value for some of miRNA and some other short genes ( less than 100bp). These genes are in the top genes with the highest RPKM. I think the RPKM values of these genes are probably too high to be true.

test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant

ENSMUSG00000093077 ENSMUSG00000093077 Mir5105 5:146231229-146302874 Epithelium Fiber OK 1.53E+06 445558 -1.78097 -355.367 0.00715 0.016986 yes

ENSMUSG00000093098 ENSMUSG00000093098 Gm22641 7:130162450-133124354 Epithelium Fiber OK 87894.1 36474.7 -1.26887 -0.59863 0.4913 0.587174 no

ENSMUSG00000089855 ENSMUSG00000089855 Gm15662 10:105187662-105583874 Epithelium Fiber OK 42868.9 21566.5 -0.99114 -20.7066 0.0186 0.039568 yes

ENSMUSG00000092984 ENSMUSG00000092984 Mir5115 2:73012853-73012927 Epithelium Fiber OK 21104.8 8317.49 -1.34335 -447.314 0.0001 0.000354 yes

ENSMUSG00000086324 ENSMUSG00000086324 Gm15564 16:35926510-36037131 Epithelium Fiber OK 6443.35 3664.15 -0.81433 -1.52095 0.2129 0.301429 no

ENSMUSG00000092981 ENSMUSG00000092981 Mir5125 17:23803186-23824739 Epithelium Fiber OK 5974.14 2390.75 -1.32127 -0.34111 0.5746 0.661937 no

I checked some forums and they said that this is the drawback of TopHat/Cufflink/Cuffdiff when dealing with short genes. But I am still not so clear about this. Anyone got the same problem? What can I do with this situation?

Anyone suggests any other good tools to test for (1) differential gene expression OR (2) both differential gene expression and gene discovery?

Thank you
Thanh