Hello,
Interesting genome. I see that SRA has some RNA-seq public data, but
there isn't much else going on. And you goal is to characterize the
expression for observed phenotypes (linked to known genotypes)? If
you use the Tuxedo suite after assembly (Trinity or other),
differential expression of alternative splicing is one of the
discovery outputs.
From my experience (and other are welcome to add comments), most SNP
differences (single base polymorphisms) do not in general
impact the global assembly of whole genome data. Larger
insertions/deletions are where you will observe differences. But
that is DNA.
For transcription assembly, including RNA-seq, novel isoforms per
sample and in particular rare events like SNPs, can become diluted
when multiple samples are directly combined and assembled together
straight de-novo. Still, obtaining full length cDNAs is certainly
possible. And it has been done just about the same way, with various
types of RNA data, for a very long time (most of RefSeq started out
that way). The downside here is that "the most common variant" can
overwhelm, but with a plant you might have that issue anyway
depending on ploidy. So, test for yourself. Genomes can vary and the
tools are so interesting - "same way" is a gross generalization on
my part, in specifics the tools are very sophisticated.
And, most importantly, as you do have a reference genome to use as a
guide (and that is really an invaluable tool not to be ignored) be
sure to incorporate it unless it is from a sample that is known to
be significantly, unacceptably, different from the wildtype. It
sounds like the quality has been assessed to be unacceptable to use
directly as a reference genome for some reason (correct? Or, you
just want to build up the cDNA set -great project!). But the genome
can still be utilized. Specifically - using it as an early stage
assembly guide will give you a huge advantage, in my opinion (some
assemblers cluster the data first by mapping - you want this if
possible). But again, you could try it both ways and check out a few
genes to see how the transcript profile worked out (vs any knowns -
comparative OK, I always used these when I did this type of work),
plus use the truth metrics (to me) of transcription assembly: how
many singletons did you end up with (and what do they map to! can
they really be ignored?) & how many over-clustered "genes" did
you get (interesting, sparcer genes gobbled up by abundant
housekeeping). Under-clustered genes/transcripts or incomplete
transcripts are other factors, but depending on how you set the
parameters in Cufflinks, this may be less important, if it isn't a
pathological problem.
Many people will have advice about this, so ask, but also test.
Looking at the results will inform you if the path is right. I hope
this helps a little bit!
Jen
Galaxy team
On 11/25/13 1:16 PM, miroslav.sotak
wrote:
To whom it may concern
I would like to kindly ask you if you do have any experience in
de-novo transcriptomic analysis (no reference genome available)
who might give us some advice.
Our main question is how to create the best set of cDNA contigs,
on which we can map our RNAseq reads for the analysis of
differential expression. Currently 4 larger sets of of RNAseq
reads are available from different genotypes as well as draft
genome assembly for one of the genotypes. We worry about the SNPs
in different genotypes affecting the assembly, if we combine all
the RNAseq datasets and using assemblers such as Trinity, Oases,
Velvet. Might it be better to use the draft genomic assembly to
obtain cDNA contigs using Tophat/cufflinks via all available
RNAseq data or only using the RNAseq data from the same genotype
as the genome draft?
Thank you in advance
Best wishes
Miro Sotak
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
--
Jennifer Hillman-Jackson
http://galaxyproject.org