Hi folks, I'm trying to find over the entire human genome, for each gene, which exons are the most constitutively expressed. To do this, I'd like to combine expression data (RNA-seq or Microarray) and exons data (UCSC track). Then, for each gene, I'd like to pick the 1 or 2 exons with the highest levels of expression (my proxy for constitutiveness).
An additional nicety would be to somehow work in a preference for 5' exons. For example, let's say a gene has 3 exons and, with the expression data, all 3 exons are equally expressed. I'd like to selectively get the first 2 exons.
I've started learning Galaxy and was able to import BED files for UCSC exons (as in the Galaxy 101 tutorial) and a BED file for Affy microarray expression data. (I tried also importing the Burge RNA-seq track as BED but couldn't get it to work). I did an inner join on genomic sequences to join the expression data with the exons and sorted them from most expressed to least. But how do I sort within genes? That is, how do I get the top 2 exons per gene (highest expressing exons per gene) and, if there are more than 2 with equally high expression, how do I preferentially get the 5` exons?
I'm also open to ways to do this without using Galaxy, etc. I want to do this for an entire genome, so I figured it would be good to have a Galaxy workflow, which I could then apply to other genomes as needed.
Thanks for any help. jim