Hi, I'm trying to find over the entire human genome, for each
gene, which exons are the most constitutively expressed. To do this, I'd
like to combine expression data (RNA-seq or Microarray) and exons data
(UCSC track). Then, for each gene, I'd like to pick the 1 or 2 exons
with the highest levels of expression (my proxy for constitutiveness).
An additional nicety would be to somehow work in a preference for 5'
exons. For example, let's say a gene has 3 exons and, with the
expression data, all 3 exons are equally expressed. I'd like to
selectively get the first 2 exons.
I've started learning Galaxy and was able to import BED files for
UCSC exons (as in the Galaxy 101 tutorial) and a BED file for Affy
microarray expression data. (I tried also importing the Burge RNA-seq
track as BED but couldn't get it to work). I did an inner join on
genomic sequences to join the expression data with the exons and sorted
them from most expressed to least. But how do I sort within genes? That
is, how do I get the top 2 exons per gene (highest expressing exons per
gene) and, if there are more than 2 with equally high expression, how do
I preferentially get the 5` exons?
I'm also open to ways to do this without using Galaxy, etc. I want
to do this for an entire genome, so I figured it would be good to have a
Galaxy workflow, which I could then apply to other genomes as needed.
Thanks for any help