identify different number of differential expressed genes using ensemble or reseq GTF
Hi all, I am analyzing significant differential expressed genes for a pair of normal V.S tumor, using Cuffdiff 2.0.2. I noticed that by using ensemble GTF and refseq GTF, the results showed a big difference on the number of genes being significant expressed. For ensemble GTF, there are only 250 genes differential expressed. But for refseq GTF, there are about 1000 genes. I am running these data on Galaxy server and with the same workflow. Can anyone explain what is going on here? so which result should I trust? Thanks. -- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
Hello Wei, The contents of the reference GTF files (original, before analysis) will probably provide some explanation. My guess is that GTF files have different contents and are not directly comparable - RefSeq with full transcripts and Ensembl with full transcripts + potentially partial predictions and/or predicted splice sites. Alternative versions of each may be available. When possible, you most likely will want to be using a reference GTF file that represents complete transcripts. I don't know what genome you are using, but you can check the source notes at Ensembl (& NCBI) to find out what each annotation build contains. A raw count on the number of entries in the GTF files can also be a clue - if greatly different, then you very likely have different populations in the two files. Good luck with your project! Jen Galaxy team On 1/7/13 1:47 PM, Wei Liao wrote:
Hi all,
I am analyzing significant differential expressed genes for a pair of normal V.S tumor, using Cuffdiff 2.0.2. I noticed that by using ensemble GTF and refseq GTF, the results showed a big difference on the number of genes being significant expressed.
For ensemble GTF, there are only 250 genes differential expressed. But for refseq GTF, there are about 1000 genes.
I am running these data on Galaxy server and with the same workflow.
Can anyone explain what is going on here? so which result should I trust?
Thanks.
-- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
Hi, Another approach you can try is to use DESeq or EdgeR from Bioconductor to assess differential expression. I personally like these two methods LOTS better than Cuff* mainly because they are a lot closer to tried and true statistical methods developed for microarrays. I esp. like how both methods let you test different factors. For example, if you are testing a treatment (drug or no drug) and a genotype (mutant vs. wildtype) you can find out which genes' expression depends on having a wild-type copy of the gene by testing an "interaction term." Both methods start with simple counts - numbers of reads overlapping annotated genes. Probably there is a Galaxy workflow that can calculate counts of reads per gene, but I don't know if Galaxy currently incorporates R/Bioconductor tools. If you can get Galaxy to calculate reads per gene, then you can then download the file and run it through edgeR or DESeq. R is free but it does take some time to master it. But it is incredibly powerful and well worth the effort! To get started with R, I recommend doing the free-of-charge O'Reilly Press "try R" tutorial which is on-line here: http://tryr.codeschool.com/ I hope this will be helpful! Best wishes, Ann Loraine ------------------------------- Ann Loraine, Ph.D. Associate Professor Department of Bioinformatics and Genomics University of North Carolina at Charlotte North Carolina Research Campus 600 Laureate Way Kannapolis, NC 28081 704-250-5750 aloraine@uncc.edu http://www.transvar.org http://www.bioviz.org http://www.uncc.edu On 1/7/13 8:27 PM, "Jennifer Hillman-Jackson" <jen@bx.psu.edu> wrote:
Hello Wei,
The contents of the reference GTF files (original, before analysis) will probably provide some explanation. My guess is that GTF files have different contents and are not directly comparable - RefSeq with full transcripts and Ensembl with full transcripts + potentially partial predictions and/or predicted splice sites. Alternative versions of each may be available. When possible, you most likely will want to be using a reference GTF file that represents complete transcripts.
I don't know what genome you are using, but you can check the source notes at Ensembl (& NCBI) to find out what each annotation build contains. A raw count on the number of entries in the GTF files can also be a clue - if greatly different, then you very likely have different populations in the two files.
Good luck with your project!
Jen Galaxy team
On 1/7/13 1:47 PM, Wei Liao wrote:
Hi all,
I am analyzing significant differential expressed genes for a pair of normal V.S tumor, using Cuffdiff 2.0.2. I noticed that by using ensemble GTF and refseq GTF, the results showed a big difference on the number of genes being significant expressed.
For ensemble GTF, there are only 250 genes differential expressed. But for refseq GTF, there are about 1000 genes.
I am running these data on Galaxy server and with the same workflow.
Can anyone explain what is going on here? so which result should I trust?
Thanks.
-- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi, Jennifer, Thanks for your reply! My raw RNA-seq data was mapped to the hg19 without reference GTF in our local instance. In order to troubleshoot, I tried the following: (1) use Tophat to map data again with hg19, and iGenome ensembl.GTF, then use Cuffdiff to find differential expressed genes. There are still 250 significant genes. (2) use Tophat to map data again with hg19 without reference GTF, use cufflink with Homo_sapiens.GRCh37.69.gtf downloaded from ensembl.org. Same results with 250 significant genes. (3) use Tophat to map data again with hg19 without reference GTF, use cufflink with refseq refFlat.GTF, The results are ~1000 significant genes. (4) use Tophat to map data again with hg19 without reference GTF, use cufflink with refseq iGenome refseq.GTF, The results are ~1000 significant genes. However, I need to confirm what release or version is the hg19 reference genome I am using. Do you think the different results are caused by mapping to different hg19 genome? if so, how can you find a match of hg19 with reference to a correct GTF? I thought the use of ensembl or refseq would not affect the results in cuffdiff step. These reference GTF file (refFlat.GTF, iGenome refseq.GTF, or iGenome ensembl.GTF) should represents complete transcripts. Wei On Mon, Jan 7, 2013 at 5:27 PM, Jennifer Hillman-Jackson <jen@bx.psu.edu>wrote:
Hello Wei,
The contents of the reference GTF files (original, before analysis) will probably provide some explanation. My guess is that GTF files have different contents and are not directly comparable - RefSeq with full transcripts and Ensembl with full transcripts + potentially partial predictions and/or predicted splice sites. Alternative versions of each may be available. When possible, you most likely will want to be using a reference GTF file that represents complete transcripts.
I don't know what genome you are using, but you can check the source notes at Ensembl (& NCBI) to find out what each annotation build contains. A raw count on the number of entries in the GTF files can also be a clue - if greatly different, then you very likely have different populations in the two files.
Good luck with your project!
Jen Galaxy team
On 1/7/13 1:47 PM, Wei Liao wrote:
Hi all,
I am analyzing significant differential expressed genes for a pair of normal V.S tumor, using Cuffdiff 2.0.2. I noticed that by using ensemble GTF and refseq GTF, the results showed a big difference on the number of genes being significant expressed.
For ensemble GTF, there are only 250 genes differential expressed. But for refseq GTF, there are about 1000 genes.
I am running these data on Galaxy server and with the same workflow.
Can anyone explain what is going on here? so which result should I trust?
Thanks.
-- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
______________________________**_____________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/**listinfo/galaxy-dev<http://lists.bx.psu.edu/listinfo/galaxy-dev>
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
-- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
participants (3)
-
Jennifer Hillman-Jackson
-
Loraine, Ann
-
Wei Liao