Hello,

Interesting genome. I see that SRA has some RNA-seq public data, but there isn't much else going on. And you goal is to characterize the expression for observed phenotypes (linked to known genotypes)? If you use the Tuxedo suite after assembly (Trinity or other), differential expression of alternative splicing is one of the discovery outputs.

From my experience (and other are welcome to add comments), most SNP differences (single base polymorphisms) do not in general impact the global assembly of whole genome data. Larger insertions/deletions are where you will observe differences. But that is DNA.

For transcription assembly, including RNA-seq, novel isoforms per sample and in particular rare events like SNPs, can become diluted when multiple samples are directly combined and assembled together straight de-novo. Still, obtaining full length cDNAs is certainly possible. And it has been done just about the same way, with various types of RNA data, for a very long time (most of RefSeq started out that way). The downside here is that "the most common variant" can overwhelm, but with a plant you might have that issue anyway depending on ploidy. So, test for yourself. Genomes can vary and the tools are so interesting - "same way" is a gross generalization on my part, in specifics the tools are very sophisticated.

And, most importantly, as you do have a reference genome to use as a guide (and that is really an invaluable tool not to be ignored) be sure to incorporate it unless it is from a sample that is known to be significantly, unacceptably, different from the wildtype. It sounds like the quality has been assessed to be unacceptable to use directly as a reference genome for some reason (correct? Or, you just want to build up the cDNA set -great project!). But the genome can still be utilized. Specifically - using it as an early stage assembly guide will give you a huge advantage, in my opinion (some assemblers cluster the data first by mapping - you want this if possible). But again, you could try it both ways and check out a few genes to see how the transcript profile worked out (vs any knowns - comparative OK, I always used these when I did this type of work), plus use the truth metrics (to me) of transcription assembly: how many singletons did you end up with (and what do they map to! can they really be ignored?) & how many over-clustered "genes" did you get (interesting, sparcer genes gobbled up by abundant housekeeping). Under-clustered genes/transcripts or incomplete transcripts are other factors, but depending on how you set the parameters in Cufflinks, this may be less important, if it isn't a pathological problem.

Many people will have advice about this, so ask, but also test. Looking at the results will inform you if the path is right. I hope this helps a little bit!

Jen
Galaxy team


On 11/25/13 1:16 PM, miroslav.sotak wrote:

To whom it may concern

I would like to kindly ask you if you do have any experience in de-novo transcriptomic analysis (no reference genome available) who might give us some advice.
Our main question is how to create the best set of cDNA contigs, on which we can map our RNAseq reads for the analysis of differential expression. Currently 4 larger sets of of RNAseq reads are available from different genotypes as well as draft genome assembly for one of the genotypes. We worry about the SNPs in different genotypes affecting the assembly, if we combine all the RNAseq datasets and using assemblers such as Trinity, Oases, Velvet. Might it be better to use the draft genomic assembly to obtain cDNA contigs using Tophat/cufflinks via all available RNAseq data or only using the RNAseq data from the same genotype as the genome draft?

Thank you in advance
Best wishes
Miro Sotak
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/

-- 
Jennifer Hillman-Jackson
http://galaxyproject.org