Dear sir,
I need a up-to-date file of UCSC 17-way multi-alignment for the
upstream 1000 bp of Refseq. UCSC has a download (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/upstream1000.maf.gz). This file begin with human refseq ID and do multiz for the
upstream 1000bp of each refseq. It is exactly match my requirement, but
it is a little old. It was build at 2007. I need a new version of such
a file.
I have asked UCSC group, they told me that I can't get such a file
using UCSC table. But they told I can get it from your Galaxy. I tried
some times but failed. Can you told me how to get a 17-way (or just human and mouse) alignment
file for upstream 1000bp of all refseq?
Thanks very much.
Anyuan Guo
================
Anyuan Guo Ph.D.
Postdoc Fellow
Virginia Institute for Psychiatric and Behavioral Genetics
Virginia Commonwealth University
P.O. Box 980126
Richmond, VA 23298-0126, USA
Email: aguo@vcu.edu
Brooke Rhead wrote:
Hi
Anyuan,
I don't believe it is possible to retain the RefSeq ID in this case
when using the Table Browser. However, I think that Galaxy has this
capacity, either by doing the intersection from scratch using their
tools, or by joining your MAF with your custom track based on the
genome coordinates.
Galaxy has screencasts:
http://galaxy.psu.edu/screencasts.html
and a wiki:
http://g2.trac.bx.psu.edu/
This screencast might be particularly helpful:
http://screencast.g2.bx.psu.edu/galaxy/MAF_manipulation/
If you have more questions about how to accomplish your task using
Galaxy, you can contact them at galaxy-user@bx.psu.edu.
Good luck with your research.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
a lot of at galaxy-user@bx.psu.edu for help.
On 12/10/08 15:17, Anyuan Guo wrote:
Dear Brooke,
Thanks very much. I learned a lot about creating custom track in
your email. I can download a ~76Mb compressed file when I follow your
instruction to create a custom track for upstream 1000 bp of RefseqGene
and intersect with 17-way Cons. But I found the file format is not
begin with Refseq ID (NM_xxxx). The following is the first 4 lines of
the file.
##maf version=1
a score=-55252.000000
s hg18.chr1 14754 99 + 247249719
CTGTGGGTCGGAGCCGGAGCGTCAGAGC---------CACCCACGACCACCGGCACGCC----CCCACCACA-GGGCAGCGTGG-TGTTGAGACAAC------A
In fact, I need a file begin with Refseq ID, the downloaded maf
file
(http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/upstream1000.maf.gz)
exactly match my requirement. But because some refseq sequences were
updated, the downloaded file is out of date.
The following is the first 4 lines of the downloaded file, which I
need.
##maf version=1 scoring=zero
a score=0.000000
s NM_198943 0 1000 + 1000
GCATTTTAAACCCAAGTG----AAATCTCCTAGG----------CCCTTCATGCCACACTCA-----TCCATCCCTACCTAC--TTGTGTTGCAACCAAGGGCCCCAC
How can I get the up-to-date version of this download file?
Thanks.
Anyuan
Brooke Rhead wrote:
Hello Anyuan,
The reason that the sequence is different via the download file and the
Table Browser is that the sequence associated with NM_014223 at RefSeq
has changed since the download file was made. The items in the RefSeq
Genes track are updated daily; the download files are generally only
made once.
You can see the revision history for any GenBank accession at NCBI:
http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi?val=NM_014223
The download file was last updated on 7-7-2007. I tried blatting the
NM_014223 sequence from the "Jun 3 2007 1:10 PM" update to the hg18
assembly, and the sequence aligned starting at the genomic coordinate
chr1:40,929,952. The upstream sequence from the file you downloaded
corresponds to the 1,000 bases upstream of that base.
You can get an up-to-date version of the download file by creating
yourself with the Table Browser. First, make a custom track of the
upstream regions of RefSeq Genes. If you select the RefSeq Genes track
in the Table Browser and choose "output format: custom track", you will
be presented with an option to create one BED record per region that is
"Upstream by ___ bases". Enter 1,000 or 2,000 in this box and hit "get
custom track in genome browser". You should see a new custom track
containing blocks representing regions upstream of all RefSeq Genes.
Now you can intersect your new custom track with the multiz alignment
in the Conservation track to get only the upstream regions. To do this
step, select the 17-way (or 28-way) Conservation track in the Table
Browser. Select the table 'multiz17way' and region: genome. Hit the
"intersection: create" button and select your custom track. Choose the
option for "Base-pair-wise intersection (AND) of 17-Way Cons and
upstream regions from refGene" and hit submit. Back on the main Table
Browser page, select "output format: MAF". The size of the file you
will be creating is quite large (76 Mb compressed for 1,000 base
regions). I suggest entering a name for the file and selecting the
option to get a gzip compressed version of it. Hit "get output". You
should end up with a MAF file that contains only the regions upstream
of RefSeq Genes.
You may also be interested in the tools for working with MAF alignments
at Galaxy: http://galaxy.psu.edu/ . Galaxy is run by our collaborators
at Penn State and extends the functionality of the Table Browser. For
instance, there is a tool to filter any undesired species from a MAF
file, leaving only the species of interest to you.
I hope this is helpful. If you have further questions, please feel
free to contact us again at genome@soe.ucsc.edu. If you have questions
specific to Galaxy, their helpdesk email address is
galaxy-user@bx.psu.edu.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
Subject: question or bug about UCSC genome browser sequence
From: Anyuan Guo <aguo@vcu.edu>
Date: Mon, 17 Nov 2008 10:54:21 -0800
To: genome@soe.ucsc.edu
Dear author,
Thanks for you providing the wonderful database and website of UCSC
genome browser.
I have question about the sequence in it.
I downloaded the human upstream 1000bp multiz alignment file
from
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/upstream1000.maf.gz
When I check my sequence id NM_014223.
I can find the upstream 1000 bp sequence of this refseq gene in the
downloaded multiz alignment file.
I also can search this id in genome browser and get the upstream
1000 bp using the "DNA" or "Tables" menu at the top of genome browser
page.
But I find these two upstream 1000 bp sequence are totally
different. I think the one using genome browser is right.
But I am not just need the upstream 1000bp sequence, I need the
alignment with mouse sequence.
Can I just get the sequence alignment between human and mouse for
all the refseq gene and the upstream 1000 or 2000 of these genes? Where
can I find it?
I think those ortholog gene alignment (including upstream
regulatory sequence alignment) between two popular genome will be very
useful.
thanks.
Anyuan