On Tue, Jan 15, 2013 at 7:02 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hello all,
Our local Galaxy server had been running happily under SGE, using one of the last free releases (not sure exactly which - I could ask). Due to concerns about long term maintenance, the SysAdmin has moved us to an SGE compatible setup - Univa Grid Engine (UGE).
However, in at least one respect this is not a drop in replacement, while other cluster usage appears to be working fine our Galaxy installation is not, e.g.
...
Debugging this by attempting a manual submission,
$ qsub /mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh Unable to run job: Colon (':') not allowed in objectname. Exiting.
Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails.
Additionally removing the SGE specific settings in universe_wsgi.ini did allow the job to be submitted I am still having problems. Perhaps I need to fix all the other filenames too (e.g. stdout, stderr, error code), or do that in one go by removing the colon in the job name?
Part of the problem I am facing involves the SGE/UGE specific arguments I have defined in universe_wsgi.ini (which still work fine if I use them with qsub manually). My original settings looked like this, [galaxy:tool_runners] ncbi_blastp_wrapper = drmaa://-V -l hostname="n08-04-008-*|n11-04-048-cortana" -pe smp 4/ That worked fine in Galaxy with SGE, and still works fine with UGE using qsub manually. However, the "-pe smp 4" part does not work for queue submission anymore with UGE. Simplifying to: [galaxy:tool_runners] ncbi_blastp_wrapper = drmaa://-V -pe smp 4/ fails: galaxy.jobs.handler INFO 2013-01-16 11:49:39,603 (346) Job dispatched galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,346 (346) submitting file /mnt/galaxy/galaxy-central/database/pbs/galaxy_346.sh galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,347 (346) command is: blastp -version &> /mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_346; blastp -query "/mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat" -db "/mnt/shared/cluster/blast/galaxy/oomycete_CDS" -task blastp -evalue 0.001 -out /mnt/galaxy/galaxy-central/database/files/000/dataset_394.dat -outfmt 6 -num_threads 8 galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,347 (346) spec: -pe smp 4 galaxy.jobs.runners.drmaa ERROR 2013-01-16 11:49:40,351 Uncaught exception queueing job Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 146, in run_next self.queue_job( obj ) File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 235, in queue_job job_id = self.ds.runJob(jt) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DeniedByDrmException: code 17: error: no suitable queues Clearly something is going wrong in passing the option to UGE. Note this works at the command line: $ qsub -pe smp 4 /mnt/galaxy/galaxy-central/database/pbs/galaxy_346.sh Your job 252 ("galaxy_346.sh") has been submitted $ qstat | grep 252 252 0.60500 galaxy_346 galaxy qw 01/16/2013 11:50:41 4 If I remove this option, job submission works. Given Galaxy gives UGE the 'native spec' as a string, I don't think this is a Galaxy problem. Rather, it could be an incompatibility in UGE versus SGE? I can probably workaround this particular issue - there are other ways to request four processors and/or a whole cluster node. So, to recap, I needed to remove any colons in job scripts fixed (crude patch on previous email), and tweak my SGE/UGE settings in the universe_wsgi.ini file. I would also like to see a clear error message for the user when an DeniedByDrmException is raised during job submission - currently this is not handled gracefully at all. I've now had some cluster jobs succeed via Galaxy, but it does not seem to be as reliable as under SGE. Perhaps there is some heavy IO on the cluster at the moment which may be confusing things... Peter