Extract data and new genes
Hi Everyone, I am working with *Aedes aegypti * and I obtained around 500 million reads (HiSeq2000, 50bp). After doing all analysis of differential gene expression using known packages (Tophat, Cufflinks, Deseq etc) I was able to find a set of gene of interest, besides some functional group of genes that I already knew that I had to look at. Now, just looking over the 4,758 supercontigs and my data using IGV from Broad Institute (loading the genome and the SAM files from Tophat), I find a lot of potential new genes (hundreds or thousands of reads aligning to regions where there is no gene annotation), I also find new exons for some genes or exons with different sizes. I was thinking to do an *de novo* assembly to find new transcripts and genes, but I was wondering if there is something else I could do. For example, maybe I could just extract those regions where thousands of reads align (new gene). I know that we can extract the sequence data for specific transcript, is it possible to extract reads for regions without annotation, only based in the number of reads aligned? Maybe I could pull all the data together (from a couple sequencing lanes) and align it back to the genome, and then proceed to gene annotation. Another problem is that I am not sure how reliable would be the annotation only based on the data from HiSeq2000. I would appreciate if anyone one have some idea or suggestion in how to tackle this problem. Maybe *de novo* assembly is the way to go. Thank you. Luciano -- *Luciano Cosme* --------------------------------------------- PhD Candidate Texas A&M Entomology Vector Biology Research Group www.lcosme.com 979 845 1885 cosme@tamu.edu ---------------------------------------------
I find a lot of potential new genes (hundreds or thousands of reads aligning to regions where there is no gene annotation),
This shouldn't be completely unexpected. High-coverage RNA-seq data is constantly revealing new exons/splicing/transcripts, even in well-annotated genomes.
I also find new exons for some genes or exons with different sizes. I was thinking to do an de novo assembly to find new transcripts and genes, but I was wondering if there is something else I could do.
My suggestion: do reference-guided assembly with Cufflinks; this will yield both existing and new transcripts.
For example, maybe I could just extract those regions where thousands of reads align (new gene). I know that we can extract the sequence data for specific transcript, is it possible to extract reads for regions without annotation, only based in the number of reads aligned?
You could subtract known genes from the Cufflinks assembly to get only novel transcripts. Best, J.
Thanks Jeremy, I will do it before try the *de novo *assembly. Luciano On Fri, May 18, 2012 at 1:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu>wrote:
I find a lot of potential new genes (hundreds or thousands of reads aligning to regions where there is no gene annotation),
This shouldn't be completely unexpected. High-coverage RNA-seq data is constantly revealing new exons/splicing/transcripts, even in well-annotated genomes.
I also find new exons for some genes or exons with different sizes. I was thinking to do an *de novo* assembly to find new transcripts and genes, but I was wondering if there is something else I could do.
My suggestion: do reference-guided assembly with Cufflinks; this will yield both existing and new transcripts.
For example, maybe I could just extract those regions where thousands of reads align (new gene). I know that we can extract the sequence data for specific transcript, is it possible to extract reads for regions without annotation, only based in the number of reads aligned?
You could subtract known genes from the Cufflinks assembly to get only novel transcripts.
Best, J.
participants (2)
-
Jeremy Goecks
-
Luciano Cosme