You mention that you moved it to an NFS volume – but it seems you also moved to a grid configuration using PBS?
If that’s the case, what you are seeing might be an issue with NFS attribute caching or write caching, which causes files created from one machine to not appear
until some time later (from the perspective of other machines). The PBS job notifications are not impacted by the filesystem latencies.
You can prove this by experiment if you alter the finish_job method in lib/galaxy/jobs/runners/pbs.py to do a sleep/wait loop, waiting up to 60 seconds for
the files to be readable. If that hack works, latency is your problem.
The solution is either to:
-
Configure your mounts not to use attribute caching (has performance impacts), or
-
Make the hack permanent.
This happened to us on SGE, which is why I know these details ;-}
John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy@illumina.com
From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu]
On Behalf Of Luobin Yang
Sent: Monday, October 17, 2011 10:31 AM
To: galaxy-dev@lists.bx.psu.edu
Subject: [galaxy-dev] What's causing this error?
Hi,
Recently I moved my locally installed Galaxy from a local hard drive to an NFS mounted hard drive, when I run some tools, I go the following error from the log file:
Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error.
I am pretty sure the job was not manually dequeued. Any idea how this happened and how this can be fixed?
Thanks,
Luobin