On Fri, Jul 29, 2011 at 5:09 PM, Duddy, John <jduddy@illumina.com> wrote:
We had similar problems on NFS mounts to Isilon. We traced it to the default timeout for attribute caching on NFS mounts, which does not force a re-read of directory contents (hence file existence or size) for up to 30 seconds.
We worked around it by adding no-ac to the mount, but this can drastically increase the network traffic to the isilon, so there are tradeoffs to be made.
Even when you solve this, nfsv2 does not have open-close write consistency, so it is possible for a job to complete on a node and Galaxy to try to read the output files while the compute node is still flushing its write cache to the file.
All of these scenarios are unlikely on a busy cluster, on which job<->Galaxy interactions will likely occur far enough apart in time for the caches to clear on their own.
John Duddy
Thanks for your comments John, it's good to know others have run into similar issues. You may be right that on a real test load many of these issues would go away - but at least some of the problems I was seeing were at start-up or job submission time (and thus prior to the cluster actually running the job). We may need to re-organise our network topology, right now there are probably too many routers/hubs/switches between the Galaxy server and the cluster and associated storage, making the mapped drive less responsive than it could be. Regards, Peter