Galaxy for gene expression comparison

Martin, David A.

18 Jan 2011 18 Jan '11

12:35 a.m.

Hello, I am comparing RNA expression in two groups of rats, a drug treated group against a control group. There are 10 biological replicates in each group. I am unsure of how to flow this analysis through Galaxy using Tophat followed by Cufflinks/compare/diff. Should the files for each group be merged at any point? I would think they should be kept separate in order to properly account for the spread across animals. I am just a little unsure of how to group the files on galaxy, and where to differentiate biological and technical replicates. On a different note, is there a way to control the bowtie mapping parameters more closely when using tophat? Thank you for any kind of knowledge on these matters! -David Martin

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Jeremy Goecks

18 Jan 18 Jan

9:29 p.m.

...

I am comparing RNA expression in two groups of rats, a drug treated group against a control group. There are 10 biological replicates in each group. I am unsure of how to flow this analysis through Galaxy using Tophat followed by Cufflinks/compare/diff. Should the files for each group be merged at any point? I would think they should be kept separate in order to properly account for the spread across animals. I am just a little unsure of how to group the files on galaxy, and where to differentiate biological and technical replicates.

Hi David, Yes, you're right -- merging along the way will prevent you from quantifying within-group variation; consequently, quantifying across-group variation will be very challenging as well. Here's the right thing to do: (1) map each replicate using Tophat and assemble transcripts using Cufflinks; (2) for all Cufflinks' outputs (assembled transcripts), build a set of comprehensive transcripts using Cuffcompare; (3) for Cuffdiff, group the replicates from each group and let Cuffdiff determine and quantitate within group and across group variation. However, Galaxy's tools currently don't support replicates, so you can't yet perform this analysis. We're working to enhance them, however, and we should have this functionality available on our main server in the next couple weeks. Enis can comment about when this functionality will be available on the cloud. (To be clear, you can perform step 1 using Galaxy now. You can also perform step 2, but you'll have to do so by repeatedly merging Cufflinks' outputs. You cannot perform step 3 right now with Galaxy.)

...

On a different note, is there a way to control the bowtie mapping parameters more closely when using tophat?

There's limited control that you can exert over the bowtie commands within Tophat. Looking at the Tophat manual: http://tophat.cbcb.umd.edu/manual.html it looks like max-multihits (Maximum number of alignments) is the only Bowtie parameter you can directly control. There are, however, many Tophat parameters that enable you to control splice junction mapping directly; set Tophat's settings to 'Full parameter list' to see all the parameters you can control. What exactly are you looking to do? Thanks, J.

Martin, David A.

20 Jan 20 Jan

2:41 p.m.

Jeremy, Thank you for the suggestions. Ultimately, I would like to compare gene(isoform) expression between two groups of 10 animals with one lane per animal. I am using the public server to practice with some small data sets right now, but will be getting the real data very soon and plan on using an Amazon Cloud account to actually do the analysis. I can see now that this approach is going to be met with some difficulty with the current state of the data volume restrictions and limited functionality of Galaxy for Cuffcompare/diff. Can you comment any further on the timeline of the availability of the full functionality of these programs? You seemed to suggest they will be available on the public server before they are available on the Cloud? Also, for the time being, would you mind clarifying for me what you mean by repeatedly merging Cufflinks outputs? I imagine using Tophat to map the reads and find splice junctions and assembling transcripts using Cufflinks for each of the 20 animals. Are you talking about running the Cufflinks GTF output through Cuffcompare, which allows two GTF files in Galaxy, and merging that output(the union file) with the third Cufflinks file and so on for all ten animals? Then do the same thing for the other group of ten animals, and then comparing the two for a rough idea of the differences? I guess I'm wondering how far I will be able to get with the analysis as things stand on the Cloud or the public server.... I also need to come up with a strategy to work around the 1000Gb space limit, as with 20 samples of 25 million reads and repeatedly generating files I think it will get used up quickly.... As far as changing the bowtie options through Tophat, I was just going to play around with the bowtie mapping settings to get an idea of which strategy is optimal and use those settings for Tophat, but this is probably unnecessary in the grand scheme of the analysis. I really appreciate your help - Thanks, David -----Original Message----- From: Jeremy Goecks on behalf of Jeremy Goecks Sent: Tue 1/18/2011 9:29 PM To: Martin, David A. Cc: eafgan@emory.edu; galaxy-user@bx.psu.edu Subject: Re: [galaxy-user] Galaxy for gene expression comparison

...

I am comparing RNA expression in two groups of rats, a drug treated group against a control group. There are 10 biological replicates in each group. I am unsure of how to flow this analysis through Galaxy using Tophat followed by Cufflinks/compare/diff. Should the files for each group be merged at any point? I would think they should be kept separate in order to properly account for the spread across animals. I am just a little unsure of how to group the files on galaxy, and where to differentiate biological and technical replicates.

...

On a different note, is there a way to control the bowtie mapping parameters more closely when using tophat?

Jeremy Goecks

21 Jan 21 Jan

8:40 a.m.

...

Thank you for the suggestions. Ultimately, I would like to compare gene(isoform) expression between two groups of 10 animals with one lane per animal. I am using the public server to practice with some small data sets right now, but will be getting the real data very soon and plan on using an Amazon Cloud account to actually do the analysis. I can see now that this approach is going to be met with some difficulty with the current state of the data volume restrictions and limited functionality of Galaxy for Cuffcompare/diff. Can you comment any further on the timeline of the availability of the full functionality of these programs?

Hi David, Cuffcompare and Cuffdiff generate many more outputs than most other tools; specifically, both generate multiple output files for each additional input given. While Galaxy can handle an arbitrary number of inputs easily, handling so many outputs is challenging and requires extending the framework to handle so many output files. I did a bit of the necessary work this week, but more work is required and the path forward is a bit murky. I'm still hoping to have it available in a couple weeks, but no guarantees. Also, this is a good time to mention that we welcome code patches/contributions; if you can make something work in Galaxy, we'll review the code and, if it looks good, integrate it into our code base.

...

You seemed to suggest they will be available on the public server before they are available on the Cloud?

I did not mean to imply this, only that the Cloud folks have their own process and schedule for rolling out changes, and I do not know their schedule.

...

Also, for the time being, would you mind clarifying for me what you mean by repeatedly merging Cufflinks outputs? I imagine using Tophat to map the reads and find splice junctions and assembling transcripts using Cufflinks for each of the 20 animals. Are you talking about running the Cufflinks GTF output through Cuffcompare, which allows two GTF files in Galaxy, and merging that output(the union file) with the third Cufflinks file and so on for all ten animals? Then do the same thing for the other group of ten animals, and then comparing the two for a rough idea of the differences?

Cuffdiff requires a GTF reference file; this file contains the transcripts that will be used for comparing samples/replicates. If you're looking only at existing transcripts, using one from UCSC works fine and no merging is necessary. However, if you're looking for novel transcripts, you'll want to use the combined transcripts file that Cuffcompare produces. In this case, you'll want to iteratively merge the Cufflinks' outputs for all 20 animals so that you have a complete list of transcripts for Cuffdiff.

...

I guess I'm wondering how far I will be able to get with the analysis as things stand on the Cloud or the public server....

You should be able to get some preliminary results. As it stands now, you can run pairwise comparisons through the Tophat-->Cufflinks-->Cuffcompare-->Cuffdiff pipeline. You might try looking at two different pairwise comparisons and seeing how many similarities/differences there are.

...

I also need to come up with a strategy to work around the 1000Gb space limit, as with 20 samples of 25 million reads and repeatedly generating files I think it will get used up quickly....

This is a question that Enis can comment on. Best, J.

Rory Kirchner

9:13 a.m.

I am not sure about cuffcompare, but cuffdiff doesn't generate any extra files if you add more groups and replicates to the command line. It adds columns to the output files but the number of files remains the same. For a workflow for Martin for now, I would suggest doing this for making calls with no novel genes: 1) upload your reads 2) fastq groom them into sanger format 3) run tophat on each lane individually 4) run cuffcompare with the gtf file you downloaded from uscs or wherever against itself, this puts it in a nice format to use with cuffdiff 5) merge the bam files from tophat for the 10 lanes from each group into one file 6) run cuffdiff using the transcript gtf output file from cuffcompare and the two merged bam files Merging is kind of crappy because you use in-replicate variation information, but its the best you can do now. I have patched galaxy to have cuffdiff handle replicates and to do normalization, when that gets merged into the main branch your workflow will be the same except you won't have to merge all of the bam files from each condition together to use cuffdiff. -rory On Jan 21, 2011, at 9:40 AM, Jeremy Goecks wrote:

...

Hi David,

Cuffcompare and Cuffdiff generate many more outputs than most other tools; specifically, both generate multiple output files for each additional input given. While Galaxy can handle an arbitrary number of inputs easily, handling so many outputs is challenging and requires extending the framework to handle so many output files.

Jeremy Goecks

9:40 a.m.

...

I am not sure about cuffcompare, but cuffdiff doesn't generate any extra files if you add more groups and replicates to the command line. It adds columns to the output files but the number of files remains the same.

Hmm. This is a new and welcome change.

...

For a workflow for Martin for now, I would suggest doing this for making calls with no novel genes:

1) upload your reads 2) fastq groom them into sanger format 3) run tophat on each lane individually 4) run cuffcompare with the gtf file you downloaded from uscs or wherever against itself, this puts it in a nice format to use with cuffdiff 5) merge the bam files from tophat for the 10 lanes from each group into one file 6) run cuffdiff using the transcript gtf output file from cuffcompare and the two merged bam files

You can also run this workflow easily enough for de novo transcripts as well; only change is whether Cufflinks is fed a reference GTF.

...

I have patched galaxy to have cuffdiff handle replicates and to do normalization, when that gets merged into the main branch your workflow will be the same except you won't have to merge all of the bam files from each condition together to use cuffdiff.

Yes, looking into this soon. J.

David Matthews

11:48 a.m.

New subject: Stalled cuffdiff run

Hi Jeremy, I have a stalled cufflinks run - its been queued all day - any idea why its stalled? Cheers David __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 D.A.Matthews@bristol.ac.uk

Jeremy Goecks

12:41 p.m.

...

...
I have patched galaxy to have cuffdiff handle replicates and to do normalization, when that gets merged into the main branch your workflow will be the same except you won't have to merge all of the bam files from each condition together to use cuffdiff.

Hi all, I merged Rory's changes into galaxy-central, so Cuffdiff now supports replicates. I'll see what I can do for Cuffcompare; in the near-term, repeated merging using Cuffcompare will produce a GTF file that is both correct and usable with Cuffdiff. Thanks, J.

Martin, David A.

31 Jan 31 Jan

2:33 p.m.

New subject: Using the public server for analysis

Hello, I have a total 84 GB of Illumina reads (20 lanes total). I wasn't sure if I could analyze this amount of data on the public server or if this would bog down the system. I am looking to do a gene expression comparison between two groups of 10 animals using Tophat and Cufflinks. Can anyone tell me if this is okay or if I must try and use the Cloud for this analysis.. I am assuming the FTP option is the best for uploading this data. Thank you, David Martin -----Original Message----- From: jgoecks@gmail.com on behalf of Jeremy Goecks Sent: Fri 1/21/2011 12:41 PM To: Rory Kirchner Cc: Martin, David A.; galaxy-user@bx.psu.edu Subject: Re: [galaxy-user] Galaxy for gene expression comparison

...

...
I have patched galaxy to have cuffdiff handle replicates and to do normalization, when that gets merged into the main branch your workflow will be the same except you won't have to merge all of the bam files from each condition together to use cuffdiff.

Jennifer Jackson

7 Feb 7 Feb

9:39 p.m.

New subject: Using the public server for analysis

Hello David, Given the data size, using a local instance is your best option right now. Fairly soon, using the cloud would also work (as long as the analysis keeps total size under the 1TB). There are a few issues that prevent us from recommending the cloud right now, but we are actively working to bring functionality up to full speed, so feel free to check back soon (~ few weeks) if you'd like an update. Best wishes for your project, Jen Galaxy team On 1/31/11 12:33 PM, Martin, David A. wrote:

...

Hello,

I have a total 84 GB of Illumina reads (20 lanes total). I wasn't sure if I could analyze this amount of data on the public server or if this would bog down the system. I am looking to do a gene expression comparison between two groups of 10 animals using Tophat and Cufflinks. Can anyone tell me if this is okay or if I must try and use the Cloud for this analysis.. I am assuming the FTP option is the best for uploading this data. Thank you,

David Martin

-----Original Message----- From: jgoecks@gmail.com on behalf of Jeremy Goecks Sent: Fri 1/21/2011 12:41 PM To: Rory Kirchner Cc: Martin, David A.; galaxy-user@bx.psu.edu Subject: Re: [galaxy-user] Galaxy for gene expression comparison

...
...
I have patched galaxy to have cuffdiff handle replicates and to do normalization, when that gets merged into the main branch your

workflow will

...
be the same except you won't have to merge all of the bam files from each condition together to use cuffdiff.

Hi all,

I merged Rory's changes into galaxy-central, so Cuffdiff now supports replicates. I'll see what I can do for Cuffcompare; in the near-term, repeated merging using Cuffcompare will produce a GTF file that is both correct and usable with Cuffdiff.

Thanks, J.

_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user

-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org

5538

Age (days ago)

5559

Last active (days ago)

List overview

Download

9 comments

5 participants

participants (5)

David Matthews
Jennifer Jackson
Jeremy Goecks
Martin, David A.
Rory Kirchner

Galaxy for gene expression comparison

tags

participants (5)