[galaxy-user] ChIP-seq data analysis question

2 Jun 2012

      Hello,

My name is Christopher Terranova and am a M.S student at the University of
Buffalo SUNY.I have been attempting to analyze my MACS data using Galaxy, already
have my custom peaks on the UCSC Genome browser and have some specific questions.
I am attempting to show how my peaks (and peak center coordinates) relate to gene
units(+/-TSS and Genic) and intergenic regions specifically. I have been
attempting to do this two different ways and am not sure if I am doing this
correctly. Below I will list the steps I have been using with particular
questions highlighted near my problem. I would also like to apologize for this
extended e-mail, however, I have only been working with Galaxy for approx a month
and attempting to figure all the manipulations is kind of difficult. If some can
answer my questions I would greatly appreciate it!!!  

These questions relate specifically to promoters-

1.Retrieving TSS coordinates

    1.Go to the UCSC genome browser, click "Tables" in the top of the page, and
select mouse mm9 as the organism
    2.select "RefSeq genes" in tracks, BED as the "output format" and check "Send
output to galaxy"
    3.click "Get output" then "Send output to galaxy", and you are redirected to
your Galaxy account, which contains an additional dataset
    4.use the galaxy "Filter" tool (left column) to select all "+" strand genes
    5.use the "Cut" tool (left column) to extract columns 1,2,2,4,5,6 (**is the
c2 column repeated twice??**) in order to build a BED file     containing the TSS
for all "+" strand genes
    6.do the same for the genes on the "-" strand 

Computing peak center coordinates

    1.In Galaxy, select the tool "Compute expression on every row" in the left
column (Text manipulation section)
    2.as an expression, select c2+(c3-c2+1)/2, round result "YES"
    3.select the dataset containing the peaks for one of the TFs (HNF4a or CBPA),
and click "execute"; this creates a new dataset with an additional column
containing the coordinate of the peak center.
    4.now select the tool "Cut", and extract the columns c1,c6,c6,c4,c5(**is the
c6 column repeated twice??**) to create a new BED file containing the peak center
    5.edit the metadata of this new dataset (clicking on the small pencil icon),
and change the format to BED 

Computing distance to closest TSS

    1.select the tool "Fetch closest non-overlapping feature", select the new
dataset containing the peak center coordinates, and the dataset containing the
mouse TSS. A new dataset is created containing for each peak, the closest TSS
    2.compute the distance from the peak center to the closest TSS using the
"Compute expression on every row" tool(**what expression should I use to do this**)
    3.plot the distribution using the "Histogram of a numeric column" tool. 

Secondary way: I understand this is not identifying the peak center closest to
the TSS or a particular strand, however, still have a couple questions? 

Now we have a data set corresponding to all human RefSeqs (34,765) and we want to
convert this set into one corresponding to human promoter regions. First, we will
make sure our data set just contains the start and end coordinates of the genes.
Select the "Text Manipulation" tool and then "Cut" colums from a table. Set "cut
columns" to "c1,c2,c3,c4,c6" (**Is this the right c1... conformation??**). Make
sure our previously downloaded RefSeq tdat set is selected and click on
"Execute". When this is finished, click on the pencil icon to assign names to the
columns. Set name to "RefSeqs", click "save" and change the data type to
"interval" and click "save". Now click the pencil icon again to define the
columns. Set the start column to "2" and the end column to "3", the strand column
to "5" and the "Name/Identifier" column to "4" and click on "save". Now, go to
the "Operate on Genomic Intervals" section of the "Tools" menu and select "Get
flanks" to get the flanking regions for the RefSeq data set we just created. Make
sure our RefSeq data set is selected and we want to get the "upstream" flanking
regions for this data set. Set the length of the flanking region to 1000 to get
the coordinates for 1kb upstream. Later on we could use different intervals.
Click on "Execute". When this has finished, go to "Operate on Genomic Intervals"
again and select "Join". Now set "First query" to "Get flanks.." and "Second
query" to the peaks file of the "MACS" output and then click on "Execute". We now
end up with 710 regions where our ChIP-Seq peaks overlap with our 1kb upstream
region (promoter region).

Lastly, while not discussed here, what exactly does the offset command do when
getting flanks? 

Thank you very much and again, I apologize for the extensive questions!

Sincerely,
Christopher Terranova

[galaxy-user] ChIP-seq data analysis question

cjt5＠buffalo.edu