I have followed the instructions on how to setup a local cluster closely (http://wiki.galaxyproject.org/Admin/Config/Performance/Cluster). Frankly, the (Galaxy Configuration) section was not clear for me. I am not sure if those outlined
steps should be applied to the server’s universe_wsgi.ini or to the nodes’? So I might have overlooked some steps there but here is a summary of what I am doing at the server:
All my nodes have been configured so the galaxy user can ssh/scp between nodes without pwd. Also the galaxy is a sudo in the galaxy (server).
Torque has been configured and tested between the nodes. So the cluster is working fine.
In universal_wsgi.ini file at the server node
--------------------------
Start_job_runners=pbs,drama
Drama_external_runjob_~
Drama_external_killer~
External_chown_~
Pbs_application_server = galaxyhost (server)
Pbs_stage_path=/tmp/galaxy_stage/
Pbs_dataset_server= galaxyhost (server) ##This is the same like pbs_application_server
Also, ln –s /nfsexport/galaxy_stage /usr/local/galaxy/galaxy-dis/database/tmp
outputs_to_working_directory= False (if I changed this to True, the galaxy will not start)
---------------------------------------------
After restarting the galaxy at the server node. The job seems to be submitted and its status is “R”. When I “top” the processes on the node where the job was sent to, I see two processes; ssh and scp ran by the galaxy server. This tells
me something is being copied over to the node. But I am not sure what and to where?
After while the job status changed to “W”.
qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
68.ngsgalaxy01 ...xy@idtdna.com galaxy 0 W batch
Here is what I say from the log when the job is sent.
>>>>>>>>>>>>>>>>>>
galaxy.jobs DEBUG 2013-03-15 15:34:54,183 (341) Working directory for job is: /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341
galaxy.jobs.handler DEBUG 2013-03-15 15:34:54,183 dispatching job 341 to pbs runner
galaxy.jobs.handler INFO 2013-03-15 15:34:54,231 (341) Job dispatched
galaxy.tools DEBUG 2013-03-15 15:34:54,309 Building dependency shell command for dependency 'samtools'
galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) submitting file /usr/local/galaxy/galaxy-dist/database/pbs/341.sh
galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) command is: PACKAGE_BASE=/usr/local/galaxy/software/samtools/0.1.16; export PACKAGE_BASE; . /usr/local/galaxy/software/samtools/0.1.16/env.sh; samtools flagstat "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_319.dat"
> "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_384.dat"
galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,394 (341) queued in default queue as 70.ngsgalaxy01.idtdna.com
galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,966 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from N to R
>>>>>>>>>>>>>>>>>>
Here is the log when the ssh/scp on the node is finished.
>>>>>>>>>>>>>>>>>>>>
galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:37:00,815 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from R to W
>>>>>>>>>>>>>>>>>>>>
Here is the log when I qdel that job
>>>>>>>>>>>>>>>>>>>>
galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,016 Exit code was invalid. Using 0.
galaxy.jobs DEBUG 2013-03-15 15:39:20,033 (341) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341 galaxy 10020
galaxy.jobs ERROR 2013-03-15 15:39:20,071 (341) Failed to change ownership of /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341, failing
Traceback (most recent call last):
File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 336, in finish
self.reclaim_ownership()
File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 909, in reclaim_ownership
self._change_ownership( self.galaxy_system_pwent[0], str( self.galaxy_system_pwent[3] ) )
File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 895, in _change_ownership
assert p.returncode == 0
AssertionError
galaxy.datatypes.metadata DEBUG 2013-03-15 15:39:20,160 Cleaning up external metadata files
galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,172 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.o'
galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.e'
galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.ec'
10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2 HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0"
10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2/contents?ids=bbbfa414ae315caf HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0"
>>>>>>>>>>>>>>>>>>>>>>
Is there anything I am not doing or doing wrong?
Regards,