Hi, Our galaxy instance runs jobs in a SGE cluster using 2 job-handlers. The SGE cluster uses a Job Submission Verifier (JSV) that rejects any job submission that specify core binding strategies. When Galaxy starts, the first jobs we submit works perfectly: First job submission: galaxy.jobs.manager DEBUG 2013-04-15 14:29:59,285 (194) Job assigned to handler 'handler0' galaxy.jobs DEBUG 2013-04-15 14:29:59,934 (194) Working directory for job is: /scratch/nfs/galaxy.crg.es/job_working_directory/000/194 galaxy.jobs.handler DEBUG 2013-04-15 14:29:59,942 dispatching job 194 to drmaa runner galaxy.jobs.handler INFO 2013-04-15 14:30:00,166 (194) Job dispatched galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:00,468 (194) submitting file /scratch/nfs/galaxy.crg.es/ogs/galaxy_194.sh galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:00,468 (194) command is: python /data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/tools/fastq/fastq_stats.py '/data/www-bi/galaxy.crg.es/files/000/dataset_4.dat' '/data/www-bi/galaxy.crg.es/files/000/dataset_238.dat' 'sanger' galaxy.jobs.runners.drmaa INFO 2013-04-15 14:30:01,538 (194) queued as 458816 galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:02,115 (194/458816) state change: job is queued and active # qstat -cb -j 458816 ============================================================== job_number: 458816 exec_file: job_scripts/458816 submission_time: Mon Apr 15 14:30:01 2013 owner: www-bi uid: 66401 group: www-bi gid: 501 sge_o_home: /data/www-bi sge_o_log_name: www-bi sge_o_path: /data/galaxy/apache/galaxy.crg.es/htdocs/scripts/galaxy-env/bin:/software/galaxy/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data/www-bi/bin sge_o_shell: /bin/bash sge_o_workdir: /data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist sge_o_host: galaxy account: sge stderr_path_list: NONE:galaxy:/scratch/nfs/galaxy.crg.es/job_working_directory/000/194/194.drmerr reserve: y hard resource_list: virtual_free=12G,h_rt=21600 mail_list: www-bi@galaxy.crg.es notify: FALSE job_name: g194_fastq_stats_jtaly_crg_es stdout_path_list: NONE:galaxy:/scratch/nfs/galaxy.crg.es/job_working_directory/000/194/194.drmout jobshare: 0 hard_queue_list: www-el6 env_list: script_file: /scratch/nfs/galaxy.crg.es/ogs/galaxy_194.sh parallel environment: smp range: 2 verify_suitable_queues: 2 binding: set linear:2:0,0 scheduling info: queue instance "pr-el6@fenn.linux.crg.es" dropped because it is overloaded: np_load_avg=1.703333 (= 1.703333 + 0.50 * 0.000000 with nproc=12) >= 1.7 queue instance "short@node-ib0209bi.linux.crg.es" dropped because it is overloaded: np_load_avg=2.837500 (= 2.837500 + 0.50 * 0.000000 with nproc=8) >= 1.3 queue instance "long@node-ib0209bi.linux.crg.es" dropped because it is overloaded: np_load_avg=2.837500 (= 2.837500 + 0.50 * 0.000000 with nproc=8) >= 1.3 The core binding has been added by our jsv script. This is correct. But our second submission fails: galaxy.jobs.runners.drmaa ERROR 2013-04-15 14:30:56,263 Uncaught exception queueing job Traceback (most recent call last): File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 144, in run_next self.queue_job( obj ) File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 232, in queue_job job_id = self.ds.runJob(jt) File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DeniedByDrmException: code 17: contact us: XXX@XXX.es if we look at the submited params: # cat /tmp/qsub_err.txt $VAR1 = { 'w' => 'e', 'N' => 'g195_fastq_stats_jtaly_crg_es', 'binding_amount' => '2', 'CMDNAME' => '/scratch/nfs/galaxy.crg.es/ogs/galaxy_195.sh', 'binding_type' => 'set', 'M' => { 'www-bi@galaxy.crg.es' => undef }, 'binding_strategy' => 'linear', 'l_hard' => { 'virtual_free' => '12G', 'h_rt' => '6:00:00' }, 'shell' => 'n', 'pe_min' => '2', 'USER' => 'www-bi', 'binding_socket' => '0', 'e' => { '/scratch/nfs/galaxy.crg.es/job_working_directory/000/195/195.drmerr' => undef }, 'GROUP' => 'www-bi', 'binding_core' => '0', 'pe_max' => '2', 'CMDARGS' => '0', 'q_hard' => { 'www-el6' => undef }, 'pe_name' => 'smp', 'CLIENT' => 'drmaa', 'b' => 'y', 'R' => 'y', 'VERSION' => '1.0', 'CONTEXT' => 'client', 'o' => { '/scratch/nfs/galaxy.crg.es/job_working_directory/000/195/195.drmout' => undef } }; There's a core binding strategy. The problem is that second job submission is inheriting submission parameters from the first job, and, as the JSV script does not allow to specify core binding strategy by the user, the job is rejected. If you wait some time (600 seconds), the new submit works again... We are wondering if anyone can help us to understand why the submission parameters been inherit by each job? Maybe the DRMAA session is not properly closed? or the environment not cleaned? Thank you for your help Best Jean-François $hg summary parent: 8795:9fd7fe0c5712 merge from stable branch: default commit: 1 modified, 59 unknown update: (current) -- ##################################### Jean-François Taly Bioinformatician Bioinformatics Core Facility http://biocore.crg.cat CRG - Centre de Regulació Genòmica (Room 439) Parc de Recerca Biomèdica de Barcelona (PRBB) Doctor Aiguader, 88 08003 Barcelona Spain email: jean-francois.taly@crg.eu phone: +34 93 316 0202 fax: +34 93 316 0099 #####################################
participants (1)
-
jean-François Taly