Possible Bug in SGE Runner when recovering jobs

14 Aug 2009

      Hi,

Not sure if this is a bug or my configuration problem, but:

1. universe_wsgi.ini has:
   track_jobs_in_database = True
   enable_job_recovery = True

2. A tool (which uses the SGE runner) is in 'queued' state:

psql# select id,tool_id,state from job where state = 'queued';
 id |     tool_id     | state  
----+-----------------+--------
 39 | Show beginning1 | queued
(1 row)

3. Starting galaxy with "sh run.sh" gives:

Traceback (most recent call last):
  File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/web/buildapp.py", line 61, in app_factory
    app = UniverseApplication( global_conf = global_conf, **kwargs )
  File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/app.py", line 64, in __init__
    self.job_manager = jobs.JobManager( self )
  File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 35, in __init__
    self.job_queue = JobQueue( app, self.dispatcher )
  File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 112, in __init__
    self.__check_jobs_at_startup()
  File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 137, in __check_jobs_at_startup
    self.dispatcher.recover( job, job_wrapper )
  File "/bh/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/__init__.py", line 661, in recover
    self.job_runners[runner_name].recover( job, job_wrapper )
  File "/data/hannon/gordon/galaxy/galaxy_bh_dev/lib/galaxy/jobs/runners/sge.py", line 358, in recover
    sge_job_state.old_state = DRMAA.Session.QUEUED
AttributeError: type object 'Session' has no attribute 'QUEUED'
-------------
The immediate cause is that the DRMAA Session class does not have a "QUEUED" state.
The closest thing is "QUEUED_ACTIVE" state, as evident from both sge.py lines 18-28, and from DRMAA.py line 363.

On a related note, I'm not quite sure how job recovery works with SGE:
Let's say a job is queued with QSUB, then galaxy is stopped.
The database still says "queued", but in the meantime, the SGE cluster might run the job, and even complete it - 
So the job can in fact be 'running' or 'ok' or 'error', but Galaxy thinks it is queued.
What will happen when Galaxy is restarted ?
Is the job automatically restarted, or is the new state queried from the SGE ?

Thanks,
  Gordon.

Assaf Gordon

tags

participants (1)