I have followed the instructions on how to setup a local cluster closely (http://wiki.galaxyproject.org/Admin/Config/Performance/Cluster).  Frankly, the (Galaxy Configuration) section was not clear for me.  I am not sure if those outlined steps should be applied to the server’s universe_wsgi.ini or to the nodes’?  So I might have overlooked some steps there but here is a summary of what I am doing at the server:

 

All my nodes have been configured so the galaxy user can ssh/scp between nodes without pwd.  Also the galaxy is a sudo in the galaxy (server).

Torque has been configured and tested between the nodes.  So the cluster is working fine.

 

In universal_wsgi.ini file at the server node

--------------------------

Start_job_runners=pbs,drama

 

Drama_external_runjob_~

Drama_external_killer~

External_chown_~

 

Pbs_application_server = galaxyhost (server)

Pbs_stage_path=/tmp/galaxy_stage/

Pbs_dataset_server= galaxyhost (server) ##This is the same like pbs_application_server

 

Also, ln –s /nfsexport/galaxy_stage /usr/local/galaxy/galaxy-dis/database/tmp

 

 

outputs_to_working_directory= False (if I changed this to True, the galaxy will not start)

---------------------------------------------

 

After restarting the galaxy at the server node.  The job seems to be submitted and its status is “R”.  When I “top” the processes on the node where the job was sent to, I see two processes; ssh and scp ran by the galaxy server.  This tells me something is being copied over to the node.  But I am not sure what and to where?

 

After while the job status changed to “W”.    

 

qstat

Job id                    Name             User            Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

68.ngsgalaxy01             ...xy@idtdna.com galaxy                 0 W batch

 

 

Here is what I say from the log when the job is sent.

>>>>>>>>>>>>>>>>>> 

galaxy.jobs DEBUG 2013-03-15 15:34:54,183 (341) Working directory for job is: /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341

galaxy.jobs.handler DEBUG 2013-03-15 15:34:54,183 dispatching job 341 to pbs runner

galaxy.jobs.handler INFO 2013-03-15 15:34:54,231 (341) Job dispatched

galaxy.tools DEBUG 2013-03-15 15:34:54,309 Building dependency shell command for dependency 'samtools'

galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) submitting file /usr/local/galaxy/galaxy-dist/database/pbs/341.sh

galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,391 (341) command is: PACKAGE_BASE=/usr/local/galaxy/software/samtools/0.1.16; export PACKAGE_BASE; . /usr/local/galaxy/software/samtools/0.1.16/env.sh; samtools flagstat "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_319.dat" > "/usr/local/galaxy/galaxy-dist/database/files/000/dataset_384.dat"

galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,394 (341) queued in default queue as 70.ngsgalaxy01.idtdna.com

galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:34:54,966 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from N to R

>>>>>>>>>>>>>>>>>> 

 

Here is the log when the ssh/scp on the node is finished.

 

>>>>>>>>>>>>>>>>>>>> 

galaxy.jobs.runners.pbs DEBUG 2013-03-15 15:37:00,815 (341/70.ngsgalaxy01.idtdna.com) PBS job state changed from R to W

>>>>>>>>>>>>>>>>>>>> 

 

Here is the log when I qdel that job

 

>>>>>>>>>>>>>>>>>>>> 

galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,016 Exit code  was invalid. Using 0.

galaxy.jobs DEBUG 2013-03-15 15:39:20,033 (341) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341 galaxy 10020

galaxy.jobs ERROR 2013-03-15 15:39:20,071 (341) Failed to change ownership of /usr/local/galaxy/galaxy-dist/database/job_working_directory/000/341, failing

Traceback (most recent call last):

  File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 336, in finish

    self.reclaim_ownership()

  File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 909, in reclaim_ownership

    self._change_ownership( self.galaxy_system_pwent[0], str( self.galaxy_system_pwent[3] ) )

  File "/usr/local/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 895, in _change_ownership

    assert p.returncode == 0

AssertionError

galaxy.datatypes.metadata DEBUG 2013-03-15 15:39:20,160 Cleaning up external metadata files

galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,172 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.o'

galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.e'

galaxy.jobs.runners.pbs WARNING 2013-03-15 15:39:20,173 Unable to cleanup: [Errno 2] No such file or directory: '/usr/local/galaxy/galaxy-dist/database/pbs/341.ec'

10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2 HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0"

10.7.10.201 - - [15/Mar/2013:15:39:22 -0500] "GET /api/histories/5a1cff6882ddb5b2/contents?ids=bbbfa414ae315caf HTTP/1.0" 200 - "http://10.7.10.31/history" "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0"

 >>>>>>>>>>>>>>>>>>>>>>

 

Is there anything I am not doing or doing wrong?

 

 

Regards,