Guru, Thanks, for the handy tip on getting rid of duplicates. The join file now contains 130 items with no duplicates. I guess there is a mismatch between what SNP130 considers a missense mutation and what UCSC Genes considers to be coding sequence. Paul Guruprasad Ananda wrote:
Hi Paul,
* The SNP file contains 149 regions but when joined to the Codons there are 311 items in the output. I was expecting one joined record per SNP. * The joined file contains many duplicate SNPs and missing SNPs
Your gene list might contain several overlapping genes/reading frames and therefore when you fetch codons you'll have the same positions present multiple times. As a result, a given SNP might join with multiple codons from overlapping genes/reading frames. If you want to avoid this, you can remove duplicate codons using "Statistics > Count" tool (with chr, start and end columns selected). Please note that this tool will return a tabular output. You'll need to click on the pencil icon next to the output dataset and change datatype to 'interval' and set chr, start, end columns to 2,3,4 respectively.
Hope this answers some of your questions. Thanks for using Galaxy, Guru.
On Jun 1, 2010, at 4:34 AM, Paul Webster wrote:
Hi,
I'm trying to investigate conservation in SNPs using Galaxy, but running into a few "issues" so I'm probably not doing this the best way.
Here is what I did in Galaxy: (1) Get some high heterozygosity missense SNPs from UCSC for chr21 (2) Get all Genes from UCSC for chr21 (3) Split the genes into codons using the "Gene BED to Codon BED expander" (4) Join the SNPs(1) to the Codons(2) using {Operate on genomic intervals}->Join (5) Create a multiple alignment for the codons which had SNPs using {Fetch Alignments}->{Extract MAF blocks}
Some problems I found were: * The SNP file contains 149 regions but when joined to the Codons there are 311 items in the output. I was expecting one joined record per SNP. * The joined file contains many duplicate SNPs and missing SNPs * MAF blocks are all in same orientation but about half the codons should be in the reverse direction
Can anyone offer advice?
Thanks, Paul
****************************************************************** sample output ****************************************************************** (1) SNPs (149 records) chr21 15436474 15436475 rs3859679 missense TAT,TTT, Y,F, chr21 15481364 15481365 rs7278737 missense GAC,GAA, D,E, chr21 15516947 15516948 rs2822432 missense GAA,AAA, E,K,
(2) Genes (901 records) chr21 9690070 9690100 uc002zkg.1 0 + 9690070 9690070 0 1 30, 0, chr21 9711934 9769223 uc011abu.1 0 + 9711934 9711934 0 10 104,31,70,82,29,73,71,164,195,379, 0,34186,36895,40899,43769,43889,49915,54029,55562,56910, chr21 9907192 9908487 uc010gqn.1 0 - 9907192 9907192 0 2 982,210, 0,1085,
(3) Codons (327,371 records) chr21 9908330 9908333 uc002zka.1 0 - chr21 9908333 9908336 uc002zka.1 0 - chr21 9908336 9908339 uc002zka.1 0 -
(4) Join (311 records) chr21 15481364 15481365 rs7278737 missense GAC,GAA, D,E, chr21 15481364 15481367 uc002yjm.2 0 - GAC chr21 15516947 15516948 rs2822432 missense GAA,AAA, E,K, chr21 15516945 15516948 uc002yjm.2 0 - GAA chr21 15596771 15596772 rs409782 missense TTG,GTG, L,V, chr21 15596771 15596774 uc002yjn.3 0 + TTG
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user