drmaa and JSV

15 Apr 2013

      Hi,

Our galaxy instance runs jobs in a SGE cluster using 2 job-handlers. The
SGE cluster uses a Job Submission Verifier (JSV) that rejects any job 
submission that specify core
binding strategies.

When Galaxy starts, the first jobs we submit works perfectly:

First job submission:

galaxy.jobs.manager DEBUG 2013-04-15 14:29:59,285 (194) Job assigned to
handler 'handler0' galaxy.jobs DEBUG 2013-04-15 14:29:59,934 (194) 
Working directory for job is: 
/scratch/nfs/galaxy.crg.es/job_working_directory/000/194
galaxy.jobs.handler DEBUG 2013-04-15 14:29:59,942 dispatching job 194 to 
drmaa runner
galaxy.jobs.handler INFO 2013-04-15 14:30:00,166 (194) Job dispatched
galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:00,468 (194) submitting 
file /scratch/nfs/galaxy.crg.es/ogs/galaxy_194.sh
galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:00,468 (194) command 
is: python 
/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/tools/fastq/fastq_stats.py 
'/data/www-bi/galaxy.crg.es/files/000/dataset_4.dat' 
'/data/www-bi/galaxy.crg.es/files/000/dataset_238.dat' 'sanger'
galaxy.jobs.runners.drmaa INFO 2013-04-15 14:30:01,538 (194) queued as 
458816
galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:02,115 (194/458816) 
state change: job is queued and active

# qstat -cb -j 458816
==============================================================
job_number:                 458816
exec_file:                  job_scripts/458816
submission_time:            Mon Apr 15 14:30:01 2013
owner:                      www-bi
uid:                        66401
group:                      www-bi
gid:                        501
sge_o_home:                 /data/www-bi
sge_o_log_name:             www-bi
sge_o_path:                 
/data/galaxy/apache/galaxy.crg.es/htdocs/scripts/galaxy-env/bin:/software/galaxy/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data/www-bi/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              
/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist
sge_o_host:                 galaxy
account:                    sge
stderr_path_list:           
NONE:galaxy:/scratch/nfs/galaxy.crg.es/job_working_directory/000/194/194.drmerr
reserve:                    y
hard resource_list:         virtual_free=12G,h_rt=21600
mail_list:                  www-bi@galaxy.crg.es
notify:                     FALSE
job_name:                   g194_fastq_stats_jtaly_crg_es
stdout_path_list:           
NONE:galaxy:/scratch/nfs/galaxy.crg.es/job_working_directory/000/194/194.drmout
jobshare:                   0
hard_queue_list:            www-el6
env_list:
script_file:                /scratch/nfs/galaxy.crg.es/ogs/galaxy_194.sh
parallel environment:  smp range: 2
verify_suitable_queues:     2
binding:                    set linear:2:0,0
scheduling info:            queue instance "pr-el6@fenn.linux.crg.es" 
dropped because it is overloaded: np_load_avg=1.703333 (= 1.703333 + 
0.50 * 0.000000 with nproc=12) >= 1.7
                             queue instance 
"short@node-ib0209bi.linux.crg.es" dropped because it is overloaded: 
np_load_avg=2.837500 (= 2.837500 + 0.50 * 0.000000 with nproc=8) >= 1.3
                             queue instance 
"long@node-ib0209bi.linux.crg.es" dropped because it is overloaded: 
np_load_avg=2.837500 (= 2.837500 + 0.50 * 0.000000 with nproc=8) >= 1.3

The core binding has been added by our jsv script. This is correct.

But our second submission fails:

galaxy.jobs.runners.drmaa ERROR 2013-04-15 14:30:56,263 Uncaught 
exception queueing job
Traceback (most recent call last):
   File 
"/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", 
line 144, in run_next
     self.queue_job( obj )
   File 
"/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", 
line 232, in queue_job
     job_id = self.ds.runJob(jt)
   File 
"/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", 
line 331, in runJob
     _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
   File 
"/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", 
line 213, in c
     return f(*(args + (error_buffer, sizeof(error_buffer))))
   File 
"/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", 
line 90, in error_check
     raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
DeniedByDrmException: code 17: contact us: XXX@XXX.es

if we look at the submited params:

# cat /tmp/qsub_err.txt
$VAR1 = {
           'w' => 'e',
           'N' => 'g195_fastq_stats_jtaly_crg_es',
           'binding_amount' => '2',
           'CMDNAME' => '/scratch/nfs/galaxy.crg.es/ogs/galaxy_195.sh',
           'binding_type' => 'set',
           'M' => {
                    'www-bi@galaxy.crg.es' => undef
                  },
           'binding_strategy' => 'linear',
           'l_hard' => {
                         'virtual_free' => '12G',
                         'h_rt' => '6:00:00'
                       },
           'shell' => 'n',
           'pe_min' => '2',
           'USER' => 'www-bi',
           'binding_socket' => '0',
           'e' => {

'/scratch/nfs/galaxy.crg.es/job_working_directory/000/195/195.drmerr' => 
undef
                  },
           'GROUP' => 'www-bi',
           'binding_core' => '0',
           'pe_max' => '2',
           'CMDARGS' => '0',
           'q_hard' => {
                         'www-el6' => undef
                       },
           'pe_name' => 'smp',
           'CLIENT' => 'drmaa',
           'b' => 'y',
           'R' => 'y',
           'VERSION' => '1.0',
           'CONTEXT' => 'client',
           'o' => {

'/scratch/nfs/galaxy.crg.es/job_working_directory/000/195/195.drmout' => 
undef
                  }
         };

There's a core binding strategy.

The problem is that second job submission is inheriting submission
parameters from the first job, and, as the JSV script does not allow to 
specify
core binding strategy by the user, the job is rejected.

If you wait some time (600 seconds), the new submit works again...

We are wondering if anyone can help us to understand why the submission 
parameters been inherit by each job?
Maybe the DRMAA session is not properly closed? or the environment not 
cleaned?

Thank you for your help

Best

Jean-François

$hg summary
parent: 8795:9fd7fe0c5712
  merge from stable
branch: default
commit: 1 modified, 59 unknown
update: (current)

-- 
#####################################
Jean-François Taly
Bioinformatician

Bioinformatics Core Facility
http://biocore.crg.cat
CRG - Centre de Regulació Genòmica (Room 439)
Parc de Recerca Biomèdica de Barcelona (PRBB)
Doctor Aiguader, 88
08003 Barcelona
Spain

email: jean-francois.taly@crg.eu
phone: +34 93 316 0202
fax: +34 93 316 0099
#####################################

jean-François Taly

tags

participants (1)