Dear Brooke,
Thanks very much. I learned a lot about creating custom track
in your email. I can download a ~76Mb compressed file when I
follow your instruction to create a custom track for upstream 1000
bp of RefseqGene and intersect with 17-way Cons. But I found the
file format is not begin with Refseq ID (NM_xxxx). The following
is the first 4 lines of the file.
##maf version=1
a score=-55252.000000
s hg18.chr1 14754 99 + 247249719
CTGTGGGTCGGAGCCGGAGCGTCAGAGC---------CACCCACGACCACCGGCACGCC----
CCCACCACA-GGGCAGCGTGG-TGTTGAGACAAC------A
How can I get the up-to-date version of this download file?
Thanks.
Hello Anyuan,
The reason that the sequence is different via the download file
and the Table Browser is that the sequence associated with
NM_014223 at RefSeq has changed since the download file was
made. The items in the RefSeq Genes track are updated daily; the
download files are generally only made once.
You can see the revision history for any GenBank accession at NCBI:
http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi?val=NM_014223
The download file was last updated on 7-7-2007. I tried blatting
the NM_014223 sequence from the "Jun 3 2007 1:10 PM" update to
the hg18 assembly, and the sequence aligned starting at the
genomic coordinate chr1:40,929,952. The upstream sequence from
the file you downloaded corresponds to the 1,000 bases upstream
of that base.
You can get an up-to-date version of the download file by
creating yourself with the Table Browser. First, make a custom
track of the upstream regions of RefSeq Genes. If you select the
RefSeq Genes track in the Table Browser and choose "output
format: custom track", you will be presented with an option to
create one BED record per region that is "Upstream by ___
bases". Enter 1,000 or 2,000 in this box and hit "get custom
track in genome browser". You should see a new custom track
containing blocks representing regions upstream of all RefSeq
Genes.
Now you can intersect your new custom track with the multiz
alignment in the Conservation track to get only the upstream
regions. To do this step, select the 17-way (or 28-way)
Conservation track in the Table Browser. Select the table
'multiz17way' and region: genome. Hit the "intersection: create"
button and select your custom track. Choose the option for "Base-
pair-wise intersection (AND) of 17-Way Cons and upstream regions
from refGene" and hit submit. Back on the main Table Browser
page, select "output format: MAF". The size of the file you will
be creating is quite large (76 Mb compressed for 1,000 base
regions). I suggest entering a name for the file and selecting
the option to get a gzip compressed version of it. Hit "get
output". You should end up with a MAF file that contains only
the regions upstream of RefSeq Genes.
You may also be interested in the tools for working with MAF
alignments at Galaxy: http://galaxy.psu.edu/ . Galaxy is run by
our collaborators at Penn State and extends the functionality of
the Table Browser. For instance, there is a tool to filter any
undesired species from a MAF file, leaving only the species of
interest to you.
I hope this is helpful. If you have further questions, please
feel free to contact us again at genome@soe.ucsc.edu. If you
have questions specific to Galaxy, their helpdesk email address
is galaxy-user@bx.psu.edu.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
Subject: question or bug about UCSC genome browser sequence
From: Anyuan Guo <aguo@vcu.edu>
Date: Mon, 17 Nov 2008 10:54:21 -0800
To: genome@soe.ucsc.edu
Dear author,
Thanks for you providing the wonderful database and website
of UCSC
genome browser.
I have question about the sequence in it.
I downloaded the human upstream 1000bp multiz alignment
file from
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/upstream1000.maf.gz
When I check my sequence id NM_014223.
I can find the upstream 1000 bp sequence of this refseq gene
in the
downloaded multiz alignment file.
I also can search this id in genome browser and get the
upstream
1000 bp using the "DNA" or "Tables" menu at the top of genome
browser page.
But I find these two upstream 1000 bp sequence are totally
different. I think the one using genome browser is right.
But I am not just need the upstream 1000bp sequence, I need the
alignment with mouse sequence.
Can I just get the sequence alignment between human and mouse
for
all the refseq gene and the upstream 1000 or 2000 of these genes?
Where
can I find it?
I think those ortholog gene alignment (including upstream
regulatory sequence alignment) between two popular genome will be
very
useful.
thanks.
Anyuan