Hello all,
QUESTION: When submitting jobs to the cluster as the real user,
how should sudo scripts/drmaa_external_runner.py be told which
Python to use, and how would it activate the venv if needed for
the DRMAA dependency?
BACKGROUND:
We're currently trying Galaxy out on a new CentOS 6 VM, with matching
CentOS 6 cluster, where jobs are submitted to SGE via DRMAA and
run as the Linux user rather than a generic Galaxy Linux user account.
This is documented on the wiki here:
https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster
This all seemed to be working under Galaxy v15.10 (using eggs),
but we're now targeting the recently released Galaxy v16.10 (using
wheels) instead and have run into problems.
https://github.com/galaxyproject/galaxy/issues/1596
Because Galaxy is deprecating support for Python 2.6 (the default
bundled with CentOS 6), we're now using a local copy of Python 2.7
(compiled from source) on a shared mount. This mismatch seems
to be the root cause of the problem I will now describe.
During job submission to SGE, Galaxy will attempt to run a command like this:
$ sudo scripts/drmaa_external_runner.py 1005
/mnt/shared/galaxy/galaxy-dist/database/sge/132.jt_json
From the terminal output from ./run.sh we'd see:
RuntimeError: External_runjob failed (exit code 1)
Child process reported error:
Traceback (most recent call last):
File "/mnt/shared/galaxy/galaxy-dist/scripts/drmaa_external_runner.py",
line 15, in <module>
import drmaa
ImportError: No module named drmaa
Although a drmaa wheel was installed within the Python 2.7 virtual
environment under ~/galaxy-dist/.venv Galaxy makes no attempt
to activate the venv for scripts/drmaa_external_runner.py
We then installed DRMAA under our local copy of Python 2.7,
and realised sudo scripts/drmaa_external_runner.py was not even
using this copy of Python. Changing the hash bang line was a
crude way to solve that (see below).
This in turn lead to finding $DRMAA_LIBRARY_PATH and $SGE_ROOT
were not set in the sudo environment. Again, you can hack around this
by modifying scripts/drmaa_external_runner.py (see below).
In our case, I suspect the least invasive change would be to install
the DRMAA libraries under the system provided Python 2.6, and
let sudo scripts/drmaa_external_runner.py execute that way.
We still need to solve why sudo scripts/drmaa_external_runner.py
does not see $DRMAA_LIBRARY_PATH and $SGE_ROOT but
we have some clues to follow up on that:
http://stackoverflow.com/questions/257616/sudo-changes-path-why
Peter
P.S. See also
https://twitter.com/pjacock/status/704335582651162624
--
Here's our workaround diff - lots of hard coded strings, not portable
at all but it worked for testing/debugging:
$ git diff scripts/drmaa_external_runner.py
diff --git a/scripts/drmaa_external_runner.py b/scripts/drmaa_external_runner.py
index a1474fe..61d2383 100755
--- a/scripts/drmaa_external_runner.py
+++ b/scripts/drmaa_external_runner.py
@@ -1,5 +1,6 @@
+#!/mnt/shared/galaxy/apps/python/2.7.11/bin/python
+#Was
#!/usr/bin/env python
-
"""
Submit a DRMAA job given a user id and a job template file (in JSON format)
defining any or all of the following: args, remoteCommand, outputPath,
@@ -12,8 +13,15 @@ import os
import pwd
import sys
+# Hack
+# print "$DRMAA_LIBRARY_PATH is %s" %
os.environ.get('DRMAA_LIBRARY_PATH')
+# print "$SGE_ROOT is %s" % os.environ.get('SGE_ROOT')
+os.environ['DRMAA_LIBRARY_PATH'] = '/mnt/sge/lib/lx-amd64/libdrmaa.so'
+os.environ['SGE_ROOT'] = '/mnt/sge'
+
import drmaa
DRMAA_jobTemplate_attributes = [ 'args', 'remoteCommand',
'outputPath', 'errorPath', 'nativeSpecification',
'workingDirectory', 'jobName',
'email', 'project' ]