Hi, Recently I moved my locally installed Galaxy from a local hard drive to an NFS mounted hard drive, when I run some tools, I go the following error from the log file: Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error. I am pretty sure the job was not manually dequeued. Any idea how this happened and how this can be fixed? Thanks, Luobin
You mention that you moved it to an NFS volume - but it seems you also moved to a grid configuration using PBS? If that's the case, what you are seeing might be an issue with NFS attribute caching or write caching, which causes files created from one machine to not appear until some time later (from the perspective of other machines). The PBS job notifications are not impacted by the filesystem latencies. You can prove this by experiment if you alter the finish_job method in lib/galaxy/jobs/runners/pbs.py to do a sleep/wait loop, waiting up to 60 seconds for the files to be readable. If that hack works, latency is your problem. The solution is either to: - Configure your mounts not to use attribute caching (has performance impacts), or - Make the hack permanent. This happened to us on SGE, which is why I know these details ;-} John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com<mailto:jduddy@illumina.com> From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Luobin Yang Sent: Monday, October 17, 2011 10:31 AM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] What's causing this error? Hi, Recently I moved my locally installed Galaxy from a local hard drive to an NFS mounted hard drive, when I run some tools, I go the following error from the log file: Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error. I am pretty sure the job was not manually dequeued. Any idea how this happened and how this can be fixed? Thanks, Luobin
Well, I didn't move it to a grid configuration using PBS recently, it was already using a grid configuration with PBS before I moved it to the NFS volume and no such an error ever happened :) So it must be related to NFS. Thanks for you great suggestions! I will try them out and see how they work on my system. Thanks, Luobin On Mon, Oct 17, 2011 at 12:45 PM, Duddy, John <jduddy@illumina.com> wrote:
You mention that you moved it to an NFS volume – but it seems you also moved to a grid configuration using PBS?****
** **
If that’s the case, what you are seeing might be an issue with NFS attribute caching or write caching, which causes files created from one machine to not appear until some time later (from the perspective of other machines). The PBS job notifications are not impacted by the filesystem latencies.****
** **
You can prove this by experiment if you alter the finish_job method in lib/galaxy/jobs/runners/pbs.py to do a sleep/wait loop, waiting up to 60 seconds for the files to be readable. If that hack works, latency is your problem.****
** **
The solution is either to:****
**- **Configure your mounts not to use attribute caching (has performance impacts), or****
**- **Make the hack permanent.****
** **
This happened to us on SGE, which is why I know these details ;-} ****
** **
*John Duddy Sr. Staff Software Engineer Illumina, Inc. *9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com****
** **
*From:* galaxy-dev-bounces@lists.bx.psu.edu [mailto: galaxy-dev-bounces@lists.bx.psu.edu] *On Behalf Of *Luobin Yang *Sent:* Monday, October 17, 2011 10:31 AM *To:* galaxy-dev@lists.bx.psu.edu *Subject:* [galaxy-dev] What's causing this error?****
** **
Hi, ****
** **
Recently I moved my locally installed Galaxy from a local hard drive to an NFS mounted hard drive, when I run some tools, I go the following error from the log file:****
** **
Job output not returned by PBS: the output datasets were deleted while the job was running, the job was manually dequeued or there was a cluster error. ****
** **
I am pretty sure the job was not manually dequeued. Any idea how this happened and how this can be fixed?****
** **
Thanks,****
Luobin****
** **
participants (2)
-
Duddy, John
-
Luobin Yang