Extract data and new genes

18 May 2012

      Hi Everyone,
    I am working with *Aedes aegypti * and I obtained around 500 million
reads (HiSeq2000, 50bp). After doing all analysis of differential gene
expression using known packages (Tophat, Cufflinks, Deseq etc) I was able
to find a set of gene of interest, besides some functional group of genes
that I already knew that I had to look at. Now, just looking over the 4,758
supercontigs and my data using IGV from Broad Institute (loading the genome
and the SAM files from Tophat), I find a lot of potential new genes
(hundreds or thousands of reads aligning to regions where there is no gene
annotation), I also find new exons for some genes or exons with different
sizes. I was thinking to do an *de novo* assembly to find new transcripts
and genes, but I was wondering if there is something else I could do. For
example, maybe I could just extract those regions where thousands of reads
align (new gene). I know that we can extract the sequence data for specific
transcript, is it possible to extract reads for regions without annotation,
only based in the number of reads aligned? Maybe I could pull all the data
together (from a couple sequencing lanes) and align it back to the genome,
and then proceed to gene annotation. Another problem is that I am not sure
how reliable would be the annotation only based on the data from HiSeq2000.
I would appreciate if anyone one have some idea or suggestion in how to
tackle this problem. Maybe *de novo* assembly is the way to go.

Thank you.
Luciano

-- 
*Luciano Cosme*

---------------------------------------------
PhD Candidate
Texas A&M Entomology
Vector Biology Research Group
www.lcosme.com
979 845 1885
cosme@tamu.edu
---------------------------------------------

Luciano Cosme

Jeremy Goecks

Luciano Cosme

tags

participants (2)