galaxy on ubuntu 14.04: hangs on metadata cleanup
Dear all, Has anyone tried running Galaxy on Ubuntu 14.04? I’m trying a test setup on two virtual machines (worker+master) with a SLURM queue. Getting in strange problems when jobs finish, the master hangs, completely unresponsive with CPU at 100% (as reported by virt-manager, not by top). Only drmaa jobs seem to be affected. After hanging, a reboot shows the job is finished (and green in history). It took me some debugging to figure out where things go wrong, but it seems it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py in method cleanup_external_metadata. I can reproduce the problem by calling os.remove(metadatafile) by hand (in an interactive python shell) when using pdb to create a breakpoint just before the call. If I comment out the os.remove it runs on until it hits another delete call in lib/galaxy/jobs/__init__.py: self.app.object_store.delete(self.get_job(), base_dir='job_work', entire_dir=True, dir_only=True, extra_dir=str(self.job_id)) It’s in the JobWrapper class in the cleanup() method. I should mention here that my galaxy version is a bit old since I’m running my own fork with local modifications on datatypes. This object_store.delete also leads to a shutil.rmtree and os.remove function. So, remove calls to the filesystem seem to hang the whole thing, but only at this point in time. Rebooting and removing by hand is no problem, pdb-stepping also sometimes fixes it (but if I just press continue it hangs). I don’t know where to go from here with debugging, but has anyone seen anything similar? Right now it feels like it may be caused by timing rather than actual code problems. cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
I should probably mention that the data filesystem is NFS, exported by the master from /mnt/galaxy/data and mounted on the worker. No separate fileserver. Master is the one that hangs. cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden On 07 May 2014, at 15:57, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:
Dear all,
Has anyone tried running Galaxy on Ubuntu 14.04?
I’m trying a test setup on two virtual machines (worker+master) with a SLURM queue. Getting in strange problems when jobs finish, the master hangs, completely unresponsive with CPU at 100% (as reported by virt-manager, not by top). Only drmaa jobs seem to be affected. After hanging, a reboot shows the job is finished (and green in history).
It took me some debugging to figure out where things go wrong, but it seems it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py in method cleanup_external_metadata. I can reproduce the problem by calling os.remove(metadatafile) by hand (in an interactive python shell) when using pdb to create a breakpoint just before the call. If I comment out the os.remove it runs on until it hits another delete call in lib/galaxy/jobs/__init__.py: self.app.object_store.delete(self.get_job(), base_dir='job_work', entire_dir=True, dir_only=True, extra_dir=str(self.job_id)) It’s in the JobWrapper class in the cleanup() method. I should mention here that my galaxy version is a bit old since I’m running my own fork with local modifications on datatypes.
This object_store.delete also leads to a shutil.rmtree and os.remove function. So, remove calls to the filesystem seem to hang the whole thing, but only at this point in time. Rebooting and removing by hand is no problem, pdb-stepping also sometimes fixes it (but if I just press continue it hangs). I don’t know where to go from here with debugging, but has anyone seen anything similar? Right now it feels like it may be caused by timing rather than actual code problems.
cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
It seems to be an NFS related issue. When I run a separate VM as an NFS server that hosts the galaxy data (files, job workdir, tmp, ftp), problems are gone. There’s probably an explanation for that, but I’m going to leave it at this. cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden On 07 May 2014, at 16:03, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:
I should probably mention that the data filesystem is NFS, exported by the master from /mnt/galaxy/data and mounted on the worker. No separate fileserver. Master is the one that hangs.
cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
On 07 May 2014, at 15:57, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:
Dear all,
Has anyone tried running Galaxy on Ubuntu 14.04?
I’m trying a test setup on two virtual machines (worker+master) with a SLURM queue. Getting in strange problems when jobs finish, the master hangs, completely unresponsive with CPU at 100% (as reported by virt-manager, not by top). Only drmaa jobs seem to be affected. After hanging, a reboot shows the job is finished (and green in history).
It took me some debugging to figure out where things go wrong, but it seems it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py in method cleanup_external_metadata. I can reproduce the problem by calling os.remove(metadatafile) by hand (in an interactive python shell) when using pdb to create a breakpoint just before the call. If I comment out the os.remove it runs on until it hits another delete call in lib/galaxy/jobs/__init__.py: self.app.object_store.delete(self.get_job(), base_dir='job_work', entire_dir=True, dir_only=True, extra_dir=str(self.job_id)) It’s in the JobWrapper class in the cleanup() method. I should mention here that my galaxy version is a bit old since I’m running my own fork with local modifications on datatypes.
This object_store.delete also leads to a shutil.rmtree and os.remove function. So, remove calls to the filesystem seem to hang the whole thing, but only at this point in time. Rebooting and removing by hand is no problem, pdb-stepping also sometimes fixes it (but if I just press continue it hangs). I don’t know where to go from here with debugging, but has anyone seen anything similar? Right now it feels like it may be caused by timing rather than actual code problems.
cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
participants (1)
-
Jorrit Boekel