Find novel genes
tophat_wrapper.py --min-intron-length $min_intron --max-intron-length $max_intron
#if $tophat_input.type == "fasta"
--file-format=fasta
#if $tophat_input.paired_end_fasta.type=="yes"
--paired-end=yes
--source-left=#slurp
#for $i in $tophat_input.paired_end_fasta.series_left
${i.source_left},#slurp
#end for
--source-right=#slurp
#for $i in $tophat_input.paired_end_fasta.series_right
${i.source_right},#slurp
#end for
#else
--paired-end=no
--source-left=#slurp
#for $i in $tophat_input.paired_end_fasta.series
${i.source},#slurp
#end for
#end if
#end if
#if $tophat_input.type == "fastqsolexa"
--file-format=fastq
#if $tophat_input.paired_end_fastqsolexa.type=="yes"
--paired-end=yes
--source-left=#slurp
#for $i in $tophat_input.paired_end_fastqsolexa.series_left
${i.source_left},#slurp
#end for
--source-right=#slurp
#for $i in $tophat_input.paired_end_fastqsolexa.series_right
${i.source_right},#slurp
#end for
#else
--paired-end=no
--source-left=#slurp
#for $i in $tophat_input.paired_end_fastqsolexa.series
${i.source},#slurp
#end for
#end if
#end if
#if $tophat_input.type == "fastqsanger"
--file-format=fastq
#if $tophat_input.paired_end_fastqsanger.type=="yes"
--paired-end=yes
--source-left=#slurp
#for $i in $tophat_input.paired_end_fastqsanger.series_left
${i.source_left},#slurp
#end for
--source-right=#slurp
#for $i in $tophat_input.paired_end_fastqsanger.series_right
${i.source_right},#slurp
#end for
#else
--paired-end=no
--source-left=#slurp
#for $i in $tophat_input.paired_end_fastqsanger.series
${i.source},#slurp
#end for
#end if
#end if
#if $tophat_input.type == "fastqsolexa":# --solexa1.3-quals yes
#else:# --solexa1.3-quals no
#end if
#if $annotation_input.type =="yes":#--annotation $annotation_file
--no-gff-juncs $annotation_input.no_gff_juncs.type
--no-novel-juncs $annotation_input.no_novel_juncs.type
#else:#--annotation none
#end if
--bowtie_index $bowtie_index
--coverage $coverage
--accepted_hits $accepted_hits
--junctions $junctions
--expr_file $expr_file
--log_report $log_report
--mate-inner-dist $mate_inner_dist
> $log_report 2> $log_report
**What it does**
**TopHat** is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie.
**How does TopHat find junctions?**
TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.
Short read sequencing machines can currently produce reads 100bp or longer, but many exons are shorter than this, and so would be missed in the initial mapping. TopHat solves this problem by splitting all input reads into smaller segments, and then mapping them independently. The segment alignments are "glued" back together in a final step of the program to produce the end-to-end read alignments.
TopHat generates its database of possible splice junctions from three sources of evidence. The first source is pairings of 'coverage islands', which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. The second source is only used when TopHat is run with paired end reads. When reads in a pair come from different exons of a transcript, they will generally be mapped far apart in the genome coordinate space. When this happens, TopHat tries to 'close' the gap between them by looking for subsequences of the genomic interval between mates with a total length about equal to the expected distance between mates. The 'introns' in this subsequence are added to the database. The third, and strongest, source of evidence for a splice junction is when two segments from the same read are mapped far apart, or when an internal segment fails to map. With all three sources, any putative junctions are supported by 'GT-AG' introns. 'GC-AG' and 'AT-AC' introns are not currently found ab initio.
------
**Input formats**
TopHat accepts files in FASTQ or FASTA format.
It is although possible to validate your own junctions with your RNA-Seq data via a GFF3 File. Note that the chromosome names in the files provided with the options below must match the names in the Bowtie index. These names are case-senstitive.
------
**Outputs**
The tophat script produces a number of files. Most of these files are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are:
1. accepted_hits.sam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here.
2. coverage.wig. A UCSC BedGraph wigglegram track, showing the depth of coverage at each position, including the spliced read alignments.
3. junctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction.
4. 'GFF3_file'.expr. If you supplied TopHat with annotations the RPKM and MeND values for each gene appear here.
------
**Version**
The currently installed version is 1.0.10
------
**Support**
To get further information on the tool visit:
http://tophat.cbcb.umd.edu/manual.html
If you have any problem running this tool feel free to contact me (matthias.dodt@mdc-berlin.de)