OK OK, cloud computing is expensive.

But I also know from my own experience that you can cut I/O by a factor of 10-20 and CPU by a factor of ten as well:
- use bowtie for mapping (but index is quite big): saves a lot of CPU
- compress input fastq files (reduces size to 1/5) and read only compressed files
- extreme solution: strip all quality values from fastq (reduces size to 1/4)
- remove all file-concatenation steps
- pipe into samtools to convert to bam immediately after mapping, always save in bam format
- strip all unmapped reads directly with samtools -F4

but I wonder how much that would save in the end...?

cheers
Max
--
Maximilian Haussler
Tel: +447574246789
http://www.manchester.ac.uk/research/maximilian.haussler/


On Tue, Nov 23, 2010 at 9:02 PM, Enis Afgan <eafgan@emory.edu> wrote:
It's just that computing, and cloud computing with that, is expensive. Depending on the usage, either the EBS volumes or the CPU time (i.e., instances) is what will represent majority of the cost. Most likely, it will be the instances, unless you use very few instances for a short period and a lot of storage.

There are a couple of papers I can recall analyzing the cost of science in the cloud, if you want to take a look:
- Deelman E, Singh G, Livny M, Berriman B, Good J: The cost of doing science on the cloud: the Montage example
- Wilkening J, Wilke A, Desai N, Meyer F: Using Clouds for Metagenomics: A Case Study


Enis

On Tue, Nov 23, 2010 at 3:43 PM, Maximilian Haussler <maximilianh@gmail.com> wrote:
I'd be interested in why AWS is so expensive for these datasets:

Is it mostly
a) the data transfer between nodes?
b) the data storage on EBS?
c) the CPU time ?
why next-gen analysis is expensive on the cloud? 

Can anyone who is actively using AWS look up the distribution of the total cost on the individual types?

I guess that there is a lot of room for improvement for the different costs, depending on the type of algorithm that you're using.

thanks in advance
Max


On Tue, Nov 23, 2010 at 8:17 PM, David Martin <dmarti@lsuhsc.edu> wrote:
Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
dmarti@lsuhsc.edu

_______________________________________________
galaxy-user mailing list
galaxy-user@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
galaxy-user@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-user