Galaxy with Univa Grid Engine (UGE) instead of SGE?
Hello all, Our local Galaxy server had been running happily under SGE, using one of the last free releases (not sure exactly which - I could ask). Due to concerns about long term maintenance, the SysAdmin has moved us to an SGE compatible setup - Univa Grid Engine (UGE). However, in at least one respect this is not a drop in replacement, while other cluster usage appears to be working fine our Galaxy installation is not, e.g. galaxy.jobs.runners.drmaa DEBUG 2013-01-15 17:14:33,660 (331:842) submitting file /mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh galaxy.jobs.runners.drmaa DEBUG 2013-01-15 17:14:33,661 (331:842) command is: /mnt/galaxy/galaxy-central/extract_dataset_parts.sh /mnt/galaxy/galaxy-central/database/job_working_directory/000/331/task_0; blastp -query "/mnt/galaxy/galaxy-central/database/job_working_directory/000/331/task_0/dataset_344.dat" -db "/var/local/blast/ncbi/nr" -task blastp -evalue 0.001 -out /mnt/galaxy/galaxy-central/database/job_working_directory/000/331/task_0/dataset_373.dat -outfmt 5 -num_threads 8 galaxy.jobs.runners.drmaa ERROR 2013-01-15 17:14:33,666 Uncaught exception queueing job Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 146, in run_next self.queue_job( obj ) File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 234, in queue_job job_id = self.ds.runJob(jt) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DeniedByDrmException: code 17: error: no suitable queues Debugging this by attempting a manual submission, $ qsub /mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh Unable to run job: Colon (':') not allowed in objectname. Exiting. Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails. Additionally removing the SGE specific settings in universe_wsgi.ini did allow the job to be submitted I am still having problems. Perhaps I need to fix all the other filenames too (e.g. stdout, stderr, error code), or do that in one go by removing the colon in the job name? Has anyone else tried Galaxy under UGE, and do you have any advice? Thanks, Peter -- Quick filename hack to avoid colons in job script filenames - might be better to avoid this in the job name itself? $ hg diff diff -r 1bfe2768026a lib/galaxy/jobs/runners/drmaa.py --- a/lib/galaxy/jobs/runners/drmaa.py Mon Jan 14 17:21:25 2013 +0000 +++ b/lib/galaxy/jobs/runners/drmaa.py Tue Jan 15 18:44:31 2013 +0000 @@ -191,7 +191,7 @@ job_name = ''.join( map( lambda x: x if x in ( string.letters + string.digits + '_' ) else '_', job_name ) ) jt = self.ds.createJobTemplate() - jt.remoteCommand = "%s/galaxy_%s.sh" % (self.app.config.cluster_files_directory, job_wrapper.get_id_tag()) + jt.remoteCommand = ("%s/galaxy_%s.sh" % (self.app.config.cluster_files_directory, job_wrapper.get_id_tag())).replace(":", "_") jt.jobName = job_name jt.outputPath = ":%s" % ofile jt.errorPath = ":%s" % efile @@ -229,6 +229,7 @@ log.debug("(%s) submitting file %s" % ( galaxy_id_tag, jt.remoteCommand ) ) log.debug("(%s) command is: %s" % ( galaxy_id_tag, command_line ) ) + log.debug("(%s) spec: %s" % ( galaxy_id_tag, native_spec)) # runJob will raise if there's a submit problem if self.external_runJob_script is None: job_id = self.ds.runJob(jt) @@ -423,7 +424,7 @@ drm_job_state.ofile = "%s.drmout" % os.path.join(os.getcwd(), job_wrapper.working_directory, job_wrapper.get_id_tag()) drm_job_state.efile = "%s.drmerr" % os.path.join(os.getcwd(), job_wrapper.working_directory, job_wrapper.get_id_tag()) drm_job_state.ecfile = "%s.drmec" % os.path.join(os.getcwd(), job_wrapper.working_directory, job_wrapper.get_id_tag()) - drm_job_state.job_file = "%s/galaxy_%s.sh" % (self.app.config.cluster_files_directory, job.get_id()) + drm_job_state.job_file = ("%s/galaxy_%s.sh" % (self.app.config.cluster_files_directory, job.get_id())).replace(":", "_") drm_job_state.job_id = str( job_id ) drm_job_state.runner_url = job_wrapper.get_job_runner_url() job_wrapper.command_line = job.get_command_line()
On Tue, Jan 15, 2013 at 7:02 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hello all,
Our local Galaxy server had been running happily under SGE, using one of the last free releases (not sure exactly which - I could ask). Due to concerns about long term maintenance, the SysAdmin has moved us to an SGE compatible setup - Univa Grid Engine (UGE).
However, in at least one respect this is not a drop in replacement, while other cluster usage appears to be working fine our Galaxy installation is not, e.g.
...
Debugging this by attempting a manual submission,
$ qsub /mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh Unable to run job: Colon (':') not allowed in objectname. Exiting.
Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails.
Additionally removing the SGE specific settings in universe_wsgi.ini did allow the job to be submitted I am still having problems. Perhaps I need to fix all the other filenames too (e.g. stdout, stderr, error code), or do that in one go by removing the colon in the job name?
Part of the problem I am facing involves the SGE/UGE specific arguments I have defined in universe_wsgi.ini (which still work fine if I use them with qsub manually). My original settings looked like this, [galaxy:tool_runners] ncbi_blastp_wrapper = drmaa://-V -l hostname="n08-04-008-*|n11-04-048-cortana" -pe smp 4/ That worked fine in Galaxy with SGE, and still works fine with UGE using qsub manually. However, the "-pe smp 4" part does not work for queue submission anymore with UGE. Simplifying to: [galaxy:tool_runners] ncbi_blastp_wrapper = drmaa://-V -pe smp 4/ fails: galaxy.jobs.handler INFO 2013-01-16 11:49:39,603 (346) Job dispatched galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,346 (346) submitting file /mnt/galaxy/galaxy-central/database/pbs/galaxy_346.sh galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,347 (346) command is: blastp -version &> /mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_346; blastp -query "/mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat" -db "/mnt/shared/cluster/blast/galaxy/oomycete_CDS" -task blastp -evalue 0.001 -out /mnt/galaxy/galaxy-central/database/files/000/dataset_394.dat -outfmt 6 -num_threads 8 galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,347 (346) spec: -pe smp 4 galaxy.jobs.runners.drmaa ERROR 2013-01-16 11:49:40,351 Uncaught exception queueing job Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 146, in run_next self.queue_job( obj ) File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 235, in queue_job job_id = self.ds.runJob(jt) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DeniedByDrmException: code 17: error: no suitable queues Clearly something is going wrong in passing the option to UGE. Note this works at the command line: $ qsub -pe smp 4 /mnt/galaxy/galaxy-central/database/pbs/galaxy_346.sh Your job 252 ("galaxy_346.sh") has been submitted $ qstat | grep 252 252 0.60500 galaxy_346 galaxy qw 01/16/2013 11:50:41 4 If I remove this option, job submission works. Given Galaxy gives UGE the 'native spec' as a string, I don't think this is a Galaxy problem. Rather, it could be an incompatibility in UGE versus SGE? I can probably workaround this particular issue - there are other ways to request four processors and/or a whole cluster node. So, to recap, I needed to remove any colons in job scripts fixed (crude patch on previous email), and tweak my SGE/UGE settings in the universe_wsgi.ini file. I would also like to see a clear error message for the user when an DeniedByDrmException is raised during job submission - currently this is not handled gracefully at all. I've now had some cluster jobs succeed via Galaxy, but it does not seem to be as reliable as under SGE. Perhaps there is some heavy IO on the cluster at the moment which may be confusing things... Peter
On Wed, Jan 16, 2013 at 7:28 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails.
Hi Peter, After seeing your email I now wonder if the problem I described here[1] and didn't get any answer about it is related to your findings while trying UGE. [1]http://dev.list.galaxyproject.org/Issue-when-enabling-use-tasked-jobs-with-t... I noticed the only mayor different I can notice between jobs submission with and without tasked option enabled is a colon in the name. See the relevant output from "qstat -f JOBID" below. Without tasked: Error_Path = /local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/34/34.drmerr Output_Path = /local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/34/34.drmout Job finishes and galaxy is able to collect drmerr and drmout files. With tasked: Error_Path = /local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/33/task_4/33:30.drmerr Output_Path = /local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/33/task_4/33:30.drmout sched_hint = Post job file processing error; job 40.head.local on host node01.local/7+node01.local/6+node01.local/5+node01.local/4+node01.local/3+node01.local/2+node01.local/1+node01.brel.local/0 Unable to copy file /var/spool/torque/spool/40.head.local.OU to galaxy@/local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/33/task_4/33:30.drmout *** error from copy cp: cannot create regular file `galaxy@/local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/33/task_4/33:30.drmout': No such file or directory *** end error output Output retained on that host in: /var/spool/torque/undelivered/40.head.local.OU Unable to copy file /var/spool/torque/spool/40.head.local.ER to galaxy@/local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/33/task_4/33:30.drmerr *** error from copy cp: cannot create regular file `galaxy@/local/opt/galaxy/galaxy-dist.torque/database/job_working_directory/000/33/task_4/33:30.drmerr': No such file or directory *** end error output Output retained on that host in: /var/spool/torque/undelivered/40.head.local.ER Job finishes, galaxy is not able to collect drmerr and drmout files and job turns green in the history panels but includes partial information about not being able to collect drmerr and drmout files. I will try to see if switching from using colon to underscore could help in this situation also. Although I'm also worry about "galaxy@" in the file path. I don't understand why is there. I'm using latest Galaxy Dist, Torque 4.1.4, Maui 3.3.1 and pbs-drmaa 1.0.12. I tried using pbs-python but that failed for me. I also tried libdrmaa from this Torque version with the same exact results. Best, Carlos
On Thursday, January 17, 2013, Carlos Borroto wrote:
On Wed, Jan 16, 2013 at 7:28 AM, Peter Cock <p.j.a.cock@googlemail.com<javascript:;>> wrote:
Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails.
Hi Peter,
After seeing your email I now wonder if the problem I described here[1] and didn't get any answer about it is related to your findings while trying UGE.
[1] http://dev.list.galaxyproject.org/Issue-when-enabling-use-tasked-jobs-with-t...
I noticed the only mayor different I can notice between jobs submission with and without tasked option enabled is a colon in the name.
Some overlap yes, and I do normally have BLAST running with task splitting. That does probably explain the source of the colon. I now suspect the colon is only a problem in SGE / UGE at the command line using qsub - it may work via the Python API. I have also seen some similar problems with sub-jobs failing yet Galaxy still had the task as a green success (not sure if that has happens recently or not). Yesterday while our UGE cluster was under load, often one or more of my split task jobs would fail (Galaxy could not collect the output files - presumably network related for accessing the shared storage), but Galaxy was correctly flagging these as red failed jobs. Cluster problems are hard to debug - especially on a live cluster where there are other users active :( Peter
On Thu, Jan 17, 2013 at 11:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thursday, January 17, 2013, Carlos Borroto wrote:
On Wed, Jan 16, 2013 at 7:28 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails.
Hi Peter,
After seeing your email I now wonder if the problem I described here[1] and didn't get any answer about it is related to your findings while trying UGE.
[1]http://dev.list.galaxyproject.org/Issue-when-enabling-use-tasked-jobs-with-t...
I noticed the only mayor different I can notice between jobs submission with and without tasked option enabled is a colon in the name.
Some overlap yes, and I do normally have BLAST running with task splitting. That does probably explain the source of the colon. I now suspect the colon is only a problem in SGE / UGE at the command line using qsub - it may work via the Python API.
Confirmed - on our system using UGE (and likely our old system with SGE), when task split jobs are assigned job shell scripts with a colon in the filename it does work. However, the presence of the colon prevents manual submission of the same script, which I have found extremely useful in debugging. e.g. $ qsub sleep:colon.sh Unable to run job: Colon (':') not allowed in objectname. Exiting. Escaping the colon did not help: $ qsub sleep\:colon.sh Unable to run job: Colon (':') not allowed in objectname. Exiting. (Sadly the Univa Grid Engine UGE version of qsub does not appear to have a version switch so I'm not sure which version this is) For this reason alone, I would prefer Galaxy avoided colons in its job filenames - and I suspect there could be other corner cases with other cluster systems and shared file-systems (as found by Carlos). Regards, Peter
Hey Peter, Carlos, Thanks for tracking this down, I've updated the composite job/task id tag to use an underscore instead in changeset 8658:c901bbb4eca8. -Dannon On Jan 23, 2013, at 12:00 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Jan 17, 2013 at 11:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thursday, January 17, 2013, Carlos Borroto wrote:
On Wed, Jan 16, 2013 at 7:28 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Renaming the file to replace the colon with (say) an underscore allows a manual qsub to work fine with UGE. I've edited Galaxy to avoid the colons (patch below) but the submission still fails.
Hi Peter,
After seeing your email I now wonder if the problem I described here[1] and didn't get any answer about it is related to your findings while trying UGE.
[1]http://dev.list.galaxyproject.org/Issue-when-enabling-use-tasked-jobs-with-t...
I noticed the only mayor different I can notice between jobs submission with and without tasked option enabled is a colon in the name.
Some overlap yes, and I do normally have BLAST running with task splitting. That does probably explain the source of the colon. I now suspect the colon is only a problem in SGE / UGE at the command line using qsub - it may work via the Python API.
Confirmed - on our system using UGE (and likely our old system with SGE), when task split jobs are assigned job shell scripts with a colon in the filename it does work. However, the presence of the colon prevents manual submission of the same script, which I have found extremely useful in debugging. e.g.
$ qsub sleep:colon.sh Unable to run job: Colon (':') not allowed in objectname. Exiting.
Escaping the colon did not help:
$ qsub sleep\:colon.sh Unable to run job: Colon (':') not allowed in objectname. Exiting.
(Sadly the Univa Grid Engine UGE version of qsub does not appear to have a version switch so I'm not sure which version this is)
For this reason alone, I would prefer Galaxy avoided colons in its job filenames - and I suspect there could be other corner cases with other cluster systems and shared file-systems (as found by Carlos).
Regards,
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Thu, Jan 24, 2013 at 4:04 PM, Dannon Baker <dannonbaker@me.com> wrote:
Hey Peter, Carlos,
Thanks for tracking this down, I've updated the composite job/task id tag to use an underscore instead in changeset 8658:c901bbb4eca8.
-Dannon
I hoped the colon/underscore change would be one line but hadn't managed to track which line ;) Thanks, that will at least make debugging failed split task jobs easier for me with SGE/UGE - and hopefully will solve the issue Carlos was having with the filenames. Cheers, Peter
On Thu, Jan 24, 2013 at 11:04 AM, Dannon Baker <dannonbaker@me.com> wrote:
Hey Peter, Carlos,
Thanks for tracking this down, I've updated the composite job/task id tag to use an underscore instead in changeset 8658:c901bbb4eca8.
Hi Dannon, Thanks for the quick fix. I applied this fix manually to the latest galaxy-dist and I can confirm the issue I was seeing is gone. Thanks again, Carlos
participants (3)
-
Carlos Borroto
-
Dannon Baker
-
Peter Cock