I have followed the instructions on how to setup a local cluster closely (http://wiki.galaxyproject.org/Admin/Config/Performance/Cluster). Frankly, the (Galaxy Configuration) section was not clear for me. I am not sure if those outlined steps should be applied to the server's universe_wsgi.ini or to the nodes'? So I might have overlooked some steps there but here is a summary of what I am doing at the server: All my nodes have been configured so the galaxy user can ssh/scp between nodes without pwd. Also the galaxy is a sudo in the galaxy (server). Torque has been configured and tested between the nodes. So the cluster is working fine. In universal_wsgi.ini file at the server node -------------------------- Start_job_runners=pbs,drama Drama_external_runjob_~ Drama_external_killer~ External_chown_~ Pbs_application_server = galaxyhost (server) Pbs_stage_path=/tmp/galaxy_stage/ Pbs_dataset_server= galaxyhost (server) ##This is the same like pbs_application_server Also, ln -s /nfsexport/galaxy_stage /usr/local/galaxy/galaxy-dis/database/tmp outputs_to_working_directory= False (if I changed this to True, the galaxy will not start) --------------------------------------------- After restarting the galaxy at the server node. The job seems to be submitted and its status is "R". When I "top" the processes on the node where the job was sent to, I see two processes; ssh and scp ran by the galaxy server. This tells me something is being copied over to the node. But I am not sure what and to where? After while the job status changed to "W". qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 68.ngsgalaxy01 ...xy@idtdna.com galaxy 0 W batch Here is what I say from the log when the job is sent.
>>>>>>>>>>>> galaxy.jobs DEBUG 2013-03-15 15:34:54,183 (341) Working directory for job is: /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341 galaxy.jobs.handler DEBUG 2013-03-15 15:34:54,183 dispatching job 341 to pbs runner galaxy.jobs.handler INFO 2013-03-15 15:34:54,231 (341) Job dispatched galaxy.tools DEBUG 2013-03-15 15:34:54,309 Building dependency shell command for dependency 'samtools' galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) submitting file /usr/local/galaxy/galaxy-dist/database/pbs/341.sh galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) command is: PACKAGE_BASE=/usr/local/galaxy/software/samtools/0.1.16; export PACKAGE_BASE; . /usr/local/galaxy/software/samtools/0.1.16/env.sh; samtools flagstat "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_319.dat" > "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_384.dat" galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,394 (341) queued in default queue as 70.ngsgalaxy01.idtdna.com galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,966 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from N to R >>>>>>>>>>>>
Here is the log when the ssh/scp on the node is finished.
>>>>>>>>>>>>>> galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:37:00,815 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from R to W >>>>>>>>>>>>>>
Here is the log when I qdel that job
>>>>>>>>>>>>>> galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,016 Exit code was invalid. Using 0. galaxy.jobs DEBUG 2013-03-15 15:39:20,033 (341) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341 galaxy 10020 galaxy.jobs ERROR 2013-03-15 15:39:20,071 (341) Failed to change ownership of /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341, failing Traceback (most recent call last): File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 336, in finish self.reclaim_ownership() File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 909, in reclaim_ownership self._change_ownership( self.galaxy_system_pwent[0], str( self.galaxy_system_pwent[3] ) ) File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 895, in _change_ownership assert p.returncode == 0 AssertionError galaxy.datatypes.metadata DEBUG 2013-03-15 15:39:20,160 Cleaning up external metadata files galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,172 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.o' galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.e' galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.ec' 10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2 HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0" 10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2/contents?ids=bbbfa414ae315caf HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0" >>>>>>>>>>>>>>>>
Is there anything I am not doing or doing wrong? Regards,