Possible Bug in SGE Runner when recovering jobs
Hi, Not sure if this is a bug or my configuration problem, but: 1. universe_wsgi.ini has: track_jobs_in_database = True enable_job_recovery = True 2. A tool (which uses the SGE runner) is in 'queued' state: psql# select id,tool_id,state from job where state = 'queued'; id | tool_id | state ----+-----------------+-------- 39 | Show beginning1 | queued (1 row) 3. Starting galaxy with "sh run.sh" gives: Traceback (most recent call last): File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/web/buildapp.py", line 61, in app_factory app = UniverseApplication( global_conf = global_conf, **kwargs ) File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/app.py", line 64, in __init__ self.job_manager = jobs.JobManager( self ) File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 35, in __init__ self.job_queue = JobQueue( app, self.dispatcher ) File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 112, in __init__ self.__check_jobs_at_startup() File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 137, in __check_jobs_at_startup self.dispatcher.recover( job, job_wrapper ) File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 661, in recover self.job_runners[runner_name].recover( job, job_wrapper ) File "/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/runners/sge.py", line 358, in recover sge_job_state.old_state = DRMAA.Session.QUEUED AttributeError: type object 'Session' has no attribute 'QUEUED' ------------- The immediate cause is that the DRMAA Session class does not have a "QUEUED" state. The closest thing is "QUEUED_ACTIVE" state, as evident from both sge.py lines 18-28, and from DRMAA.py line 363. On a related note, I'm not quite sure how job recovery works with SGE: Let's say a job is queued with QSUB, then galaxy is stopped. The database still says "queued", but in the meantime, the SGE cluster might run the job, and even complete it - So the job can in fact be 'running' or 'ok' or 'error', but Galaxy thinks it is queued. What will happen when Galaxy is restarted ? Is the job automatically restarted, or is the new state queried from the SGE ? Thanks, Gordon.
participants (1)
-
Assaf Gordon