Cluster setup - shared temporary directory
Hi all, I'm reading http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Cluster Could someone expand a little on this section please:
Create a shared temporary directory
Some tools make use of temporary files created on the server, but accessed on the nodes. For this, you'll need to make a directory (galaxy_dist/database/tmp by default) ...
I presume this is talking about the universe_wsgi.ini setting new_file_path = database/tmp (if so, could that be explicit)? I would like to know more about this from the tool author point of view. Could you at least give one example of a tool that uses this temporary folder? As a tool author I am unclear what the purpose is (and it would be a shock if I accidentally use this mapped folder instead of the local temp drive of a node). Thanks, Peter
I can give you a very good example - if you are doing alignment and for some reason need to convert the input file before operating on them, such that you need a complete copy, /tmp may not have enough room. I have had this happen to me running lots of instances of an aligner, temporarily using 100G+ of temp space. I don't see the need to have a "shared" temp space, but I do see the need to be able to tell the tools where you want them to put temp files. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Peter Cock Sent: Tuesday, July 26, 2011 8:10 AM To: Galaxy Dev Subject: [galaxy-dev] Cluster setup - shared temporary directory Hi all, I'm reading http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Cluster Could someone expand a little on this section please:
Create a shared temporary directory
Some tools make use of temporary files created on the server, but accessed on the nodes. For this, you'll need to make a directory (galaxy_dist/database/tmp by default) ...
I presume this is talking about the universe_wsgi.ini setting new_file_path = database/tmp (if so, could that be explicit)? I would like to know more about this from the tool author point of view. Could you at least give one example of a tool that uses this temporary folder? As a tool author I am unclear what the purpose is (and it would be a shock if I accidentally use this mapped folder instead of the local temp drive of a node). Thanks, Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
On Tue, Jul 26, 2011 at 5:16 PM, Duddy, John <jduddy@illumina.com> wrote:
I can give you a very good example - if you are doing alignment and for some reason need to convert the input file before operating on them, such that you need a complete copy, /tmp may not have enough room. I have had this happen to me running lots of instances of an aligner, temporarily using 100G+ of temp space.
I don't see the need to have a "shared" temp space, but I do see the need to be able to tell the tools where you want them to put temp files.
So in your setup, the cluster nodes are not likely to have 100G+ on /tmp (i.e. the local hard drive of the node), so you want them to use a temp folder on the cluster shared storage? I think needs will differ between tools - in some cases you really want a fast local drive for temp files, and putting them on a network drive will just kill performance. Using /tmp seems a safe default. Is there any guidance for tool authors on where to put temp files, and how to access any related Galaxy settings? There is nothing currently listed here: http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax Peter
I benchmarked MrBayes 3.1.2 program on my cluster for two cases: 1. use local /tmp for temporary files 2. use the network shared /home/galaxy/galaxy-dist/database/tmp MrBayes is about 10 times slower for case 2 than for case 1. What I did was to set the network shared folder as the default but in the MrBayes wrapper, I changed the environment variable TEMP to be a local folder. Luobin On Tue, Jul 26, 2011 at 10:39 AM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Tue, Jul 26, 2011 at 5:16 PM, Duddy, John <jduddy@illumina.com> wrote:
I can give you a very good example - if you are doing alignment and for some reason need to convert the input file before operating on them, such that you need a complete copy, /tmp may not have enough room. I have had this happen to me running lots of instances of an aligner, temporarily using 100G+ of temp space.
I don't see the need to have a "shared" temp space, but I do see the need to be able to tell the tools where you want them to put temp files.
So in your setup, the cluster nodes are not likely to have 100G+ on /tmp (i.e. the local hard drive of the node), so you want them to use a temp folder on the cluster shared storage?
I think needs will differ between tools - in some cases you really want a fast local drive for temp files, and putting them on a network drive will just kill performance. Using /tmp seems a safe default.
Is there any guidance for tool authors on where to put temp files, and how to access any related Galaxy settings? There is nothing currently listed here: http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Tue, Jul 26, 2011 at 8:16 PM, Luobin Yang <yangluob@isu.edu> wrote:
I benchmarked MrBayes 3.1.2 program on my cluster for two cases: 1. use local /tmp for temporary files 2. use the network shared /home/galaxy/galaxy-dist/database/tmp MrBayes is about 10 times slower for case 2 than for case 1. What I did was to set the network shared folder as the default but in the MrBayes wrapper, I changed the environment variable TEMP to be a local folder. Luobin
Does that mean Galaxy will configure the TEMP environment variable for tools to point at the universe_wsgi.ini new_file_path setting? In your case, /home/galaxy/galaxy-dist/database/tmp This is the kind of thing I think should be documented somewhere for tool authors. Thanks Peter
On Wed, Jul 27, 2011 at 9:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Jul 26, 2011 at 8:16 PM, Luobin Yang <yangluob@isu.edu> wrote:
I benchmarked MrBayes 3.1.2 program on my cluster for two cases: 1. use local /tmp for temporary files 2. use the network shared /home/galaxy/galaxy-dist/database/tmp MrBayes is about 10 times slower for case 2 than for case 1. What I did was to set the network shared folder as the default but in the MrBayes wrapper, I changed the environment variable TEMP to be a local folder. Luobin
Does that mean Galaxy will configure the TEMP environment variable for tools to point at the universe_wsgi.ini new_file_path setting? In your case, /home/galaxy/galaxy-dist/database/tmp
This is the kind of thing I think should be documented somewhere for tool authors ...
... and mentioned in the universe_wsgi.ini text for the new_file_path setting. I've just seen Shantanu Pavgi's thread which also covers this issue, and the related TMP and TMPDIR environment variables set up or overridden by SGE. I think we agree that Galaxy needs some documentation and guidance in this area for tool authors. Peter
Peter Cock wrote:
On Wed, Jul 27, 2011 at 9:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Jul 26, 2011 at 8:16 PM, Luobin Yang <yangluob@isu.edu> wrote:
I benchmarked MrBayes 3.1.2 program on my cluster for two cases: 1. use local /tmp for temporary files 2. use the network shared /home/galaxy/galaxy-dist/database/tmp MrBayes is about 10 times slower for case 2 than for case 1. What I did was to set the network shared folder as the default but in the MrBayes wrapper, I changed the environment variable TEMP to be a local folder. Luobin
Does that mean Galaxy will configure the TEMP environment variable for tools to point at the universe_wsgi.ini new_file_path setting? In your case, /home/galaxy/galaxy-dist/database/tmp
This is the kind of thing I think should be documented somewhere for tool authors ...
... and mentioned in the universe_wsgi.ini text for the new_file_path setting.
I've just seen Shantanu Pavgi's thread which also covers this issue, and the related TMP and TMPDIR environment variables set up or overridden by SGE. I think we agree that Galaxy needs some documentation and guidance in this area for tool authors.
Galaxy doesn't modify $TEMP, but in the past, setting $TEMP would cause Galaxy to use that directory for all files created by Python's tempfile module. Since there was confusion over the difference between $TEMP (which was documented as the way to control the creation of temp files) and new_file_path, Galaxy was changed to put all temp files in the location of new_file_path, ignoring $TEMP. However, this only applies to the framework. Tools run in a seperate process and are not forced to use new_file_path. If you want your tool to use it, you could pass the value of new_file_path to your tool. Otherwise, whatever rules are used by whatever you're using to create temporary files will be what determines where temp files go; in Python, it's this: http://docs.python.org/library/tempfile.html#tempfile.tempdir Our cluster nodes have the majority of their local disk partitioned as /space, so we set TEMP=/space in the environment used on the nodes. --nate
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (4)
-
Duddy, John
-
Luobin Yang
-
Nate Coraor
-
Peter Cock