Thank you for the suggestions. Ultimately, I would like to compare gene(isoform) expression between two groups of 10 animals with one lane per animal. I am using the public server to practice with some small data sets right now, but will be getting the real data very soon and plan on using an Amazon Cloud account to actually do the analysis. I can see now that this approach is going to be met with some difficulty with the current state of the data volume restrictions and limited functionality of Galaxy for Cuffcompare/diff. Can you comment any further on the timeline of the availability of the full functionality of these programs?
Hi David, Cuffcompare and Cuffdiff generate many more outputs than most other tools; specifically, both generate multiple output files for each additional input given. While Galaxy can handle an arbitrary number of inputs easily, handling so many outputs is challenging and requires extending the framework to handle so many output files. I did a bit of the necessary work this week, but more work is required and the path forward is a bit murky. I'm still hoping to have it available in a couple weeks, but no guarantees. Also, this is a good time to mention that we welcome code patches/contributions; if you can make something work in Galaxy, we'll review the code and, if it looks good, integrate it into our code base.
You seemed to suggest they will be available on the public server before they are available on the Cloud?
I did not mean to imply this, only that the Cloud folks have their own process and schedule for rolling out changes, and I do not know their schedule.
Also, for the time being, would you mind clarifying for me what you mean by repeatedly merging Cufflinks outputs? I imagine using Tophat to map the reads and find splice junctions and assembling transcripts using Cufflinks for each of the 20 animals. Are you talking about running the Cufflinks GTF output through Cuffcompare, which allows two GTF files in Galaxy, and merging that output(the union file) with the third Cufflinks file and so on for all ten animals? Then do the same thing for the other group of ten animals, and then comparing the two for a rough idea of the differences?
Cuffdiff requires a GTF reference file; this file contains the transcripts that will be used for comparing samples/replicates. If you're looking only at existing transcripts, using one from UCSC works fine and no merging is necessary. However, if you're looking for novel transcripts, you'll want to use the combined transcripts file that Cuffcompare produces. In this case, you'll want to iteratively merge the Cufflinks' outputs for all 20 animals so that you have a complete list of transcripts for Cuffdiff.
I guess I'm wondering how far I will be able to get with the analysis as things stand on the Cloud or the public server....
You should be able to get some preliminary results. As it stands now, you can run pairwise comparisons through the Tophat-->Cufflinks-->Cuffcompare-->Cuffdiff pipeline. You might try looking at two different pairwise comparisons and seeing how many similarities/differences there are.
I also need to come up with a strategy to work around the 1000Gb space limit, as with 20 samples of 25 million reads and repeatedly generating files I think it will get used up quickly....
This is a question that Enis can comment on. Best, J.