Hi, Jennifer,
Thanks for your reply!
My raw RNA-seq data was mapped to the hg19 without reference GTF in our local instance. In order to troubleshoot, I tried the following:
(1) use Tophat to map data again with hg19, and iGenome ensembl.GTF, then use Cuffdiff to find differential expressed genes. There are still 250 significant genes.
(2) use Tophat to map data again with hg19 without reference GTF, use cufflink with Homo_sapiens.GRCh37.69.gtf downloaded from
ensembl.org. Same results with 250 significant genes.
(3) use Tophat to map data again with hg19 without reference GTF, use cufflink with refseq refFlat.GTF, The results are ~1000 significant genes.
(4) use Tophat to map data again with hg19 without reference GTF, use cufflink with refseq iGenome refseq.GTF, The results are ~1000 significant genes.
However, I need to confirm what release or version is the hg19 reference genome I am using. Do you think the different results are caused by mapping to different hg19 genome? if so, how can you find a match of hg19 with reference to a correct GTF? I thought the use of ensembl or refseq would not affect the results in cuffdiff step. These reference GTF file (refFlat.GTF, iGenome refseq.GTF, or iGenome ensembl.GTF) should represents complete transcripts.
Wei
On Mon, Jan 7, 2013 at 5:27 PM, Jennifer Hillman-Jackson
<jen@bx.psu.edu> wrote:
Hello Wei,
The contents of the reference GTF files (original, before analysis) will probably provide some explanation. My guess is that GTF files have different contents and are not directly comparable - RefSeq with full transcripts and Ensembl with full transcripts + potentially partial predictions and/or predicted splice sites. Alternative versions of each may be available. When possible, you most likely will want to be using a reference GTF file that represents complete transcripts.
I don't know what genome you are using, but you can check the source notes at Ensembl (& NCBI) to find out what each annotation build contains. A raw count on the number of entries in the GTF files can also be a clue - if greatly different, then you very likely have different populations in the two files.
Good luck with your project!
Jen
Galaxy team
On 1/7/13 1:47 PM, Wei Liao wrote:
Hi all,
I am analyzing significant differential expressed genes for a pair of
normal V.S tumor, using Cuffdiff 2.0.2.
I noticed that by using ensemble GTF and refseq GTF, the results showed
a big difference on the number of genes being significant expressed.
For ensemble GTF, there are only 250 genes differential expressed.
But for refseq GTF, there are about 1000 genes.
I am running these data on Galaxy server and with the same workflow.
Can anyone explain what is going on here? so which result should I trust?
Thanks.
--
Wei Liao
Research Scientist,
Brentwood Biomedical Research Institute
16111 Plummer St.
Bldg 7, Rm D-122
North Hills, CA 91343
818-891-7711 ext 7645
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org
--
Wei Liao
Research Scientist,
Brentwood Biomedical Research Institute
16111 Plummer St.
Bldg 7, Rm D-122
North Hills, CA 91343
818-891-7711 ext 7645