Problems with Galaxy on a mapped drive
Hi all, In my recent email I mentioned problems with our setup and mapped drives. I am running a test Galaxy on a server under a CIFS mapped drive. If I map the drive with noperms then things seem to work with submitting jobs to the cluster etc, but that doesn't seem secure at all. Mounting with strict permissions seems to cause various network latency related problems in Galaxy though. Specifically during loading the converters and history export tool, Galaxy creates a temporary XML file which it then tries to parse. I was able to resolve this by switching from tempfile.TemporaryFile to tempfile.mkstemp and adding a 1s sleep, but it isn't very elegant. Couldn't you use a StringIO handle instead? Later during start up there were two errors with a similar issue - Galaxy creates a temp folder then immediately tries to write a tar ball or zip file to it. Again, adding a 1 second sleep after creating the directory before using it seems to work. See lib/galaxy/web/controllers/dataset.py After that Galaxy started, but still gives problems - like the issue reported here which Galaxy handled badly (see patch): http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-July/006213.html Here again, inserting a one second sleep between writing the cluster script file and setting its permissions made it work. If those are the only issues, that can be dealt with. But are there likely to be lots more similar problems of this nature later on? That is my worry. How are most people setting up mapped drives for Galaxy with a cluster? Thanks, Peter
We had similar problems on NFS mounts to Isilon. We traced it to the default timeout for attribute caching on NFS mounts, which does not force a re-read of directory contents (hence file existence or size) for up to 30 seconds. We worked around it by adding no-ac to the mount, but this can drastically increase the network traffic to the isilon, so there are tradeoffs to be made. Even when you solve this, nfsv2 does not have open-close write consistency, so it is possible for a job to complete on a node and Galaxy to try to read the output files while the compute node is still flushing its write cache to the file. All of these scenarios are unlikely on a busy cluster, on which job<->Galaxy interactions will likely occur far enough apart in time for the caches to clear on their own. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Peter Cock Sent: Friday, July 29, 2011 6:36 AM To: Galaxy Dev Subject: [galaxy-dev] Problems with Galaxy on a mapped drive Hi all, In my recent email I mentioned problems with our setup and mapped drives. I am running a test Galaxy on a server under a CIFS mapped drive. If I map the drive with noperms then things seem to work with submitting jobs to the cluster etc, but that doesn't seem secure at all. Mounting with strict permissions seems to cause various network latency related problems in Galaxy though. Specifically during loading the converters and history export tool, Galaxy creates a temporary XML file which it then tries to parse. I was able to resolve this by switching from tempfile.TemporaryFile to tempfile.mkstemp and adding a 1s sleep, but it isn't very elegant. Couldn't you use a StringIO handle instead? Later during start up there were two errors with a similar issue - Galaxy creates a temp folder then immediately tries to write a tar ball or zip file to it. Again, adding a 1 second sleep after creating the directory before using it seems to work. See lib/galaxy/web/controllers/dataset.py After that Galaxy started, but still gives problems - like the issue reported here which Galaxy handled badly (see patch): http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-July/006213.html Here again, inserting a one second sleep between writing the cluster script file and setting its permissions made it work. If those are the only issues, that can be dealt with. But are there likely to be lots more similar problems of this nature later on? That is my worry. How are most people setting up mapped drives for Galaxy with a cluster? Thanks, Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
On Fri, Jul 29, 2011 at 5:09 PM, Duddy, John <jduddy@illumina.com> wrote:
We had similar problems on NFS mounts to Isilon. We traced it to the default timeout for attribute caching on NFS mounts, which does not force a re-read of directory contents (hence file existence or size) for up to 30 seconds.
We worked around it by adding no-ac to the mount, but this can drastically increase the network traffic to the isilon, so there are tradeoffs to be made.
Even when you solve this, nfsv2 does not have open-close write consistency, so it is possible for a job to complete on a node and Galaxy to try to read the output files while the compute node is still flushing its write cache to the file.
All of these scenarios are unlikely on a busy cluster, on which job<->Galaxy interactions will likely occur far enough apart in time for the caches to clear on their own.
John Duddy
Thanks for your comments John, it's good to know others have run into similar issues. You may be right that on a real test load many of these issues would go away - but at least some of the problems I was seeing were at start-up or job submission time (and thus prior to the cluster actually running the job). We may need to re-organise our network topology, right now there are probably too many routers/hubs/switches between the Galaxy server and the cluster and associated storage, making the mapped drive less responsive than it could be. Regards, Peter
Peter Cock wrote:
On Fri, Jul 29, 2011 at 5:09 PM, Duddy, John <jduddy@illumina.com> wrote:
We had similar problems on NFS mounts to Isilon. We traced it to the default timeout for attribute caching on NFS mounts, which does not force a re-read of directory contents (hence file existence or size) for up to 30 seconds.
We worked around it by adding no-ac to the mount, but this can drastically increase the network traffic to the isilon, so there are tradeoffs to be made.
Even when you solve this, nfsv2 does not have open-close write consistency, so it is possible for a job to complete on a node and Galaxy to try to read the output files while the compute node is still flushing its write cache to the file.
All of these scenarios are unlikely on a busy cluster, on which job<->Galaxy interactions will likely occur far enough apart in time for the caches to clear on their own.
John Duddy
Thanks for your comments John, it's good to know others have run into similar issues.
You may be right that on a real test load many of these issues would go away - but at least some of the problems I was seeing were at start-up or job submission time (and thus prior to the cluster actually running the job).
We may need to re-organise our network topology, right now there are probably too many routers/hubs/switches between the Galaxy server and the cluster and associated storage, making the mapped drive less responsive than it could be.
We do the same as John, disable attribute caching. I've updated the wiki with this information: http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Cluster --nate
Regards,
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (3)
-
Duddy, John
-
Nate Coraor
-
Peter Cock