Hi, I'm having the following problems with our Galaxy server: This morning one of our users complained that Galaxy was crashing a lot, either: -The main page was coming up as an operational error page: Database Locked -The output history item returned an error "database is locked" -The output history item would appear to be waiting to run (grey box) but never complete This error is intermittent. I took a look at the manage jobs section on the admin page and found that all the jobs had a "new" status except one which had not been updated for over an hour and had a status of queued. I stopped this job from the admin page and after this the remaining jobs completed. However this doesn't seem to have solved the problem. I've taken a look at the database and as of today only, there are 11 datasets in the dataset table with a "queued" status. Some of these are associated with histories (and marked as deleted in hda table) and jobs (with an error status). The remaining 6 are associated with a history but not marked as deleted and there is no entry in the job_to_output_dataset table to associate these datasets with a job. I wonder if these datasets are causing the problem and galaxy is continually trying to resolve their queued status resulting in the "locked database" error. I'd be grateful if anyone can help with this or even shed some light on how the problem might have occurred and the processes galaxy uses to initiate jobs. Thanks Shaun Webb -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
SHAUN WEBB wrote:
Hi, I'm having the following problems with our Galaxy server:
This morning one of our users complained that Galaxy was crashing a lot, either: -The main page was coming up as an operational error page: Database Locked -The output history item returned an error "database is locked" -The output history item would appear to be waiting to run (grey box) but never complete This error is intermittent.
Hi Shaun, It sounds like you're using SQLite, which is generally only useful for development due to the very same locking issues. Once you're going multi-user, it's best to switch to PostgreSQL or MySQL.
I took a look at the manage jobs section on the admin page and found that all the jobs had a "new" status except one which had not been updated for over an hour and had a status of queued. I stopped this job from the admin page and after this the remaining jobs completed.
However this doesn't seem to have solved the problem. I've taken a look at the database and as of today only, there are 11 datasets in the dataset table with a "queued" status. Some of these are associated with histories (and marked as deleted in hda table) and jobs (with an error status). The remaining 6 are associated with a history but not marked as deleted and there is no entry in the job_to_output_dataset table to associate these datasets with a job.
I wonder if these datasets are causing the problem and galaxy is continually trying to resolve their queued status resulting in the "locked database" error.
It's possible that status checking on these jobs is causing enough database activity for the locking errors.
I'd be grateful if anyone can help with this or even shed some light on how the problem might have occurred and the processes galaxy uses to initiate jobs.
Once queued, jobs should be tracked by the job runner. There could be a bug somewhere that's causing them to be "lost". Can you search your logs with the job id and look for any tracebacks or other indications of job failure? --nate
participants (2)
-
Nate Coraor
-
SHAUN WEBB