Hello Jianpeng, The numbers 87 and 93 are the actual maximum lengths of the aligned regions on either side of the junction. If you want to examine your pair-end data statistically, the "NGS: Picard (beta)" tool group has several tool options. However, examining the track at the gene/transcript level for a few well characterized gene bounds is really the best way to understand how the file describes the data. A browser with your tracks loaded (Trackster or UCSC), the text data files, and the Cufflinks manual/FAQ will likely address most of your questions or at least will be a good orientation. The visual portion of this helps a great deal. http://cufflinks.cbcb.umd.edu/faq.html To address the visualization at UCSC, I can point you to their User Guide: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html and contact mailing list: http://genome.ucsc.edu/contacts.html Good luck with your project. Please remember to keep questions to the Galaxy team on our mailing lists so that our entire team and community can contribute/benefit, Thanks! Jen Galaxy team -------- Original Message -------- Subject: RE: [galaxy-dev] Tophat output Date: Wed, 18 Apr 2012 15:23:43 +0000 From: Xu, Jianpeng <jianpeng.xu@emory.edu> To: Jennifer Jackson <jen@bx.psu.edu> Hi Jennifer, In the history, I have the splice junction file, and click it to show the display at UCSC main. Then I click display at UCSC main. It will open the USCS Genome Browser. Since this is the first time for me to visualize the splice junction, can you give me more instructions on how to visualize it with UCSC genome browser ? Thanks, Jianpeng On 4/18/12 7:58 AM, Xu, Jianpeng wrote:
Thanks a lot, Jennifer. It is very useful and helpful. I got the result using Paired-end reads. The read length for both ends is 100 bp.
chr20 199821 204701 JUNC00000001 17 - 199821 204701 255,0,0 2 87,93 0,4787
Since the read length is 100 bp, why the 87, 93 are less than 100 ?
Below is a sing end read result:
chr11 60277777 60278396 JUNC00000001 1 + 60277777 60278396 255,0,0 2 22,28 0,591
Can you explain a little bit more ?
Jianpeng ________________________________________ From: Jennifer Jackson [jen@bx.psu.edu] Sent: Wednesday, April 18, 2012 2:56 AM To: Xu, Jianpeng Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Tophat output
Hello Jianpeng,
The output files from TopHat are described on the TopHat tool form:
--- quote --- Outputs
Tophat produces two output files:
junctions -- A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. accepted_hits -- A list of read alignments in BAM format.
Two other possible outputs, depending on the options you choose, are insertions and deletions, both of which are in BED format. -------------
BED format is described in the Galaxy wiki, which includes links to the UCSC BED format description (they authored the format). http://wiki.g2.bx.psu.edu/Learn/Datatypes#Bed
Two important rules to remember about BED format:
rule #1: coordinate data is already reported with respect to the (+) strand
rule #2: "start" is defined as the smallest coordinate, "end" is defined as the largest coordinate, due to the rule #1.
BED files have a 0-based, fully-closed, "start" position in data files, but in browsers the data will display as 1-based. This means you'll need to add "1" to any "start" coordinate in a .bed file to locate it in a display application. The two will not and should not match. The "end" coordinate is also 0-based, but half-open. This will make it appear to be 1-based for casual users, so it will match between data files and display applications.
------------- Using the first data row as an example and this information, we can tell that:
chr20 199821 204701 JUNC00000001 17 - 199821 204701 255,0,0 2 87,93 0,4787
* column 5 is 'score', or 'number of alignments spanning the junction'. In this case, "17" alignments.
* column 11 is the blockSizes, or 'read maximal overhang' of the junctions (max alignment length). The first is 87 bases, the second 93 bases.
* column 12 is the blockStarts, or 'overhang start' of the junctions (alignment start). The first is 0, the second 4787 bases. I am fairly certain that the first is always 0 and the second could be interpreted as the 'intron' length, but someone please correct me if this is wrong!
Some calculations can be done with these numbers with respect to the overall position of the junction already defined in columns 1,2,3 (chrom, start, end): chr20:199821-204701 (-) that define the location of the predicted splices, the flanking aligned regions, and the (presumed) 'intron'. This example is a bit tricky because the alignment is on the (-) strand, but for most uses it is enough to simply calculate backwards from the end coordinate to the start. (Consider the end the start, and the start the end). If this sounds confusing, that's because it is! When you visualize the data the concept will make more sense and it is definitely worth learning about.
Brief explanation: The first start is 0, which literally means that it starts at the very beginning of the alignment (0-based), which would be at position chr20, base 204,701, on the (-) strand. This alignment would continue for 87 bases, then stop. Then the splice would be present. The second start is at position (204701 - 4787) = 199914 = chr20, base 199,914, on the (-) strand. This is where the second splice would be present. This alignment would continue for 93 bases. The places the end at (199914 - 93) = 199821 = chr20, base 199,821, on the (-) strand. Which is the same as the reported global junction start position, which we are considering our "end", because this is a (-) stranded alignment. And, it all adds up.
Trackster would be a good place to start for "Visualization (use the top menu bar link). The dataset can also be saved as a regular .bed file and loaded as a custom track into the UCSC Genome Browser (If the direct link is not fully configured yet).
Hopefully this helps,
Jen Galaxy team
On 4/17/12 7:20 AM, Xu, Jianpeng wrote:
I have installed local galaxy. I used the Tophat to do the RNA-seq alignment and got a output file: splice junction in bed format.
I can not understand it clearly. What does the number 17, 14 ... in the column 5 mean ? What does the 87,93 mean ? What does the 0, 4787 mean ? Can you explain a little bit to me ? Which tool can be used to view this file ?
track name=junctions description="TopHat junctions" chr20 199821 204701 JUNC00000001 17 - 199821 204701 255,0,0 2 87,93 0,4787 chr20 204631 205520 JUNC00000002 14 - 204631 205520 255,0,0 2 96,87 0,802 chr20 205428 205775 JUNC00000003 9 - 205428 205775 255,0,0 2 92,91 0,256 chr20 205699 205958 JUNC00000004 15 - 205699 205958 255,0,0 2 87,92 0,167 chr20 205929 207067 JUNC00000005 31 - 205929 207067 255,0,0 2 95,97 0,1041 chr20 206977 207909 JUNC00000006 19 - 206977 207909 255,0,0 2 93,97 0,835 chr20 207884 212679 JUNC00000007 15 - 207884 212679 255,0,0 2 87,76 0,4719 chr20 207910 218238 JUNC00000008 1 - 207910 218238 255,0,0 2 61,39 0,10289 chr20 212628 218293 JUNC00000009 28 - 212628 218293 255,0,0 2 94,94 0,5571
This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited.
If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments).
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://galaxyproject.org
-- Jennifer Jackson http://galaxyproject.org