Best practices with data on clusters
Hi developers, I have a question that may be an OT, but since galaxy can work in a clustered environment withh queueing system, I'll try to ask here. Is there anibody here who copies data in a local temporary directory before performing any analysis step and copy it back into the "final results"? Thanks d Sent from my iPad
On Dec 20, 2011, at 5:04 AM, Cittaro Davide wrote:
Hi developers, I have a question that may be an OT, but since galaxy can work in a clustered environment withh queueing system, I'll try to ask here. Is there anibody here who copies data in a local temporary directory before performing any analysis step and copy it back into the "final results"?
Hi Davide, We did this for a while when we had a poorly performing fileserver. It can reduce load in that environment, but in cases where you are only going to read small portions of input files, you'll probably have longer execution time. Likewise if you'll simply be writing the output(s) in one big stream, since you then have to write it once locally and then back over the network. That said, if you have a lot interim steps that produce large data that then get merged via some process back to final outputs, it absolutely makes sense to use local disk for those steps (assuming local disk is large enough - another problem that we sometimes encounter). --nate
Thanks
d
Sent from my iPad ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Nate, On Jan 3, 2012, at 10:15 PM, Nate Coraor wrote: That said, if you have a lot interim steps that produce large data that then get merged via some process back to final outputs, it absolutely makes sense to use local disk for those steps (assuming local disk is large enough - another problem that we sometimes encounter). Wouldn't mean that most of the workflows dealing with NGS data should run on local disks? d /* Davide Cittaro, PhD Head of Bioinformatics Core Center for Translational Genomics and Bioinformatics San Raffaele Scientific Institute Via Olgettina 58 20132 Milano Italy Office: +39 02 26439140 Mail: cittaro.davide@hsr.it<mailto:cittaro.davide@hsr.it> Skype: daweonline */
On Jan 4, 2012, at 6:03 AM, Cittaro Davide wrote:
Hi Nate,
On Jan 3, 2012, at 10:15 PM, Nate Coraor wrote:
That said, if you have a lot interim steps that produce large data that then get merged via some process back to final outputs, it absolutely makes sense to use local disk for those steps (assuming local disk is large enough - another problem that we sometimes encounter).
Wouldn't mean that most of the workflows dealing with NGS data should run on local disks?
It depends on the location and ordering of the steps - If you're parallelizing single steps across multiple nodes, it wouldn't make sense. If you run multiple steps serially on a single node, then you could work locally between those steps. --nate
d
/* Davide Cittaro, PhD
Head of Bioinformatics Core Center for Translational Genomics and Bioinformatics San Raffaele Scientific Institute Via Olgettina 58 20132 Milano Italy
Office: +39 02 26439140 Mail: cittaro.davide@hsr.it Skype: daweonline */
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Cittaro Davide
-
Nate Coraor