Hi all, first of all, thanks to the Galaxy Team for this really useful software. Actually I don't really know if my problem is related with Galaxy or with Torque/Maui but I didn't find any solution looking in both Torque and Maui user lists, so I hope that some of you with more experience could give me some good advices. I'm trying to set up Galaxy in a small local virtual environment in order to test it. I started with 2 virtual ubuntu servers called galaxy1 and galaxy2. On galaxy1 I succesfully installed Galaxy, Apache, Torque and Maui. I'm using Postgres ad DBMS. It is installed on another "real" DB server. The virtual server galaxy2 is used as node. Galaxy is working like a charm locally but when I try to use Torque problems arise. Torque alone works correctly. That means that I can submit a job with qsub and everything works. The 2 virtual server (galaxy1 and galaxy2) share a directory (through NFS) in which I installed Galaxy following the "unified method" from the documentation. Now, as I said, Galaxy alone works, Torque/Maui alone works. When I put the two together nothing works. As a test I upload (using local runner) a gff file. Then I try to make a filter to the gff using "Filter and sort -> Extract features". When I run this tool the corresponding job on the Torque queue runs forever in Hold state. I report some output from diagnose programs: The diagnose -j reports the following: Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features 29 Hold DEF 1 DEF 1:00:00 0 1 galaxy galaxy - 00:02:36 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE] While the showq command reports ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 1 Processors Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 29 galaxy Hold 1 1:00:00 Wed May 4 03:56:40 The checkjob reports: checking job 29 State: Hold Creds: user:galaxy group:galaxy class:batch qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Wed May 4 03:56:40 (Time Queued Total: 00:03:07 Eligible: 00:00:01) The qstat -f reports Job Id: 33.galaxy1.research.intra.ismaa.it Job_Name = 27_Extract_features1_marco.moretto@iasma.it Job_Owner = galaxy@galaxy1.research.intra.ismaa.it job_state = W queue = batch server = galaxy1.research.intra.ismaa.it ctime = Wed May 4 04:56:36 2011 Error_Path = galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.e exec_host = galaxy2/0 exec_port = 15003 Execution_Time = Wed May 4 05:26:41 2011 mtime = Wed May 4 04:56:37 2011 Output_Path = galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27. o qtime = Wed May 4 04:56:36 2011 Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 01:00:00 stagein = /mnt/equallogic1/galaxy/tmp/dataset_18.dat@galaxy1 :/mnt/equallog ic1/galaxy/galaxy-dist/database/files/000/dataset_18.dat, /mnt/equallogic1/galaxy/tmp/dataset_30.dat@galaxy1 :/mnt/equallogic1/g alaxy/galaxy-dist/database/files/000/dataset_30.dat stageout = /mnt/equallogic1/galaxy/galaxy-dist/database/files/000/dataset_ 30.dat@galaxy1 :/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/ dataset_30.dat substate = 37 Variable_List = PBS_O_QUEUE=batch, PBS_O_HOST=galaxy1.research.intra.ismaa.it euser = galaxy egroup = galaxy hashname = 33.galaxy1.research.intra.ismaa.it queue_rank = 33 queue_type = E StartDate: -00:03:06 Wed May 4 03:56:41 Total Tasks: 1 Req[0] TaskCount: 1 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] PE: 1.00 StartPriority: 1 cannot select job 29 for partition DEFAULT (non-idle state 'Hold') and finally the tracejob reports /var/spool/torque/server_priv/accounting/20110504: Permission denied /var/spool/torque/mom_logs/20110504: No such file or directory /var/spool/torque/sched_logs/20110504: No such file or directory Job: 33.galaxy1.research.intra.ismaa.it 05/04/2011 04:56:36 S enqueuing into batch, state 1 hop 1 05/04/2011 04:56:36 S Job Queued at request of galaxy@galaxy1.research.intra.ismaa.it, owner = galaxy@galaxy1.research.intra.ismaa.it, job name = 27_Extract_features1_marco.moretto@iasma.it, queue = batch 05/04/2011 04:56:37 S Job Run at request of galaxy@galaxy1.research.intra.ismaa.it 05/04/2011 04:56:41 S Email 's' to galaxy@galaxy1.research.intra.ismaa.it failed: Child process 'sendmail -f adm galaxy@galaxy1.research.intra.ismaa.it' returned 127 (errno 10:No child processes) The only clear thing to me is that after the submission the scheduler puts it in a Hold state. But I cannot understand why. I also try to run the Galaxy-generated sh script with qsub. Following the Galaxy log: galaxy.jobs.runners.pbs DEBUG 2011-05-04 04:56:36,748 (27) submitting file /mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.sh I copied and run the 27.sh script with the command: qsub 27.sh And the job runs correctly. So what it is not clear to me is if the problem is related to Torque/Maui or is related to the way in which the job is submitted from Galaxy. Sorry for the very long e-mail and thank you very much for any help. --- Marco
Hi Marco, Thanks for all of the details, they make a big difference when troubleshooting. Sorry for the delay in response. Can you ensure that you don't have any of the pbs_* options set in universe_wsgi.ini? I noticed that there are stagein/stageouts set on the job. You may also need to set $usecp in mom_priv/config on your execution hosts to prevent pbs_mom from trying to rcp/scp the error and output files back to galaxy1. $usecp instructs pbs_mom to consider /path on the execution host to be the same filesystem as on the submission host, e.g.: $usecp *:/mnt/equallogic1 /mnt/equallogic1 --nate Marco Moretto wrote:
Hi all, first of all, thanks to the Galaxy Team for this really useful software. Actually I don't really know if my problem is related with Galaxy or with Torque/Maui but I didn't find any solution looking in both Torque and Maui user lists, so I hope that some of you with more experience could give me some good advices. I'm trying to set up Galaxy in a small local virtual environment in order to test it. I started with 2 virtual ubuntu servers called galaxy1 and galaxy2. On galaxy1 I succesfully installed Galaxy, Apache, Torque and Maui. I'm using Postgres ad DBMS. It is installed on another "real" DB server. The virtual server galaxy2 is used as node. Galaxy is working like a charm locally but when I try to use Torque problems arise. Torque alone works correctly. That means that I can submit a job with qsub and everything works. The 2 virtual server (galaxy1 and galaxy2) share a directory (through NFS) in which I installed Galaxy following the "unified method" from the documentation. Now, as I said, Galaxy alone works, Torque/Maui alone works. When I put the two together nothing works. As a test I upload (using local runner) a gff file. Then I try to make a filter to the gff using "Filter and sort -> Extract features". When I run this tool the corresponding job on the Torque queue runs forever in Hold state. I report some output from diagnose programs: The diagnose -j reports the following:
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
29 Hold DEF 1 DEF 1:00:00 0 1 galaxy galaxy - 00:02:36 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
While the showq command reports
ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME
0 Active Jobs 0 of 1 Processors Active (0.00%)
IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
29 galaxy Hold 1 1:00:00 Wed May 4 03:56:40
The checkjob reports: checking job 29
State: Hold Creds: user:galaxy group:galaxy class:batch qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Wed May 4 03:56:40 (Time Queued Total: 00:03:07 Eligible: 00:00:01)
The qstat -f reports
Job Id: 33.galaxy1.research.intra.ismaa.it Job_Name = 27_Extract_features1_marco.moretto@iasma.it Job_Owner = galaxy@galaxy1.research.intra.ismaa.it job_state = W queue = batch server = galaxy1.research.intra.ismaa.it ctime = Wed May 4 04:56:36 2011 Error_Path = galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.e
exec_host = galaxy2/0 exec_port = 15003 Execution_Time = Wed May 4 05:26:41 2011 mtime = Wed May 4 04:56:37 2011 Output_Path = galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27. o qtime = Wed May 4 04:56:36 2011 Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 01:00:00 stagein = /mnt/equallogic1/galaxy/tmp/dataset_18.dat@galaxy1 :/mnt/equallog ic1/galaxy/galaxy-dist/database/files/000/dataset_18.dat, /mnt/equallogic1/galaxy/tmp/dataset_30.dat@galaxy1 :/mnt/equallogic1/g alaxy/galaxy-dist/database/files/000/dataset_30.dat stageout = /mnt/equallogic1/galaxy/galaxy-dist/database/files/000/dataset_ 30.dat@galaxy1 :/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/ dataset_30.dat substate = 37 Variable_List = PBS_O_QUEUE=batch, PBS_O_HOST=galaxy1.research.intra.ismaa.it euser = galaxy egroup = galaxy hashname = 33.galaxy1.research.intra.ismaa.it queue_rank = 33 queue_type = E
StartDate: -00:03:06 Wed May 4 03:56:41 Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] PE: 1.00 StartPriority: 1 cannot select job 29 for partition DEFAULT (non-idle state 'Hold')
and finally the tracejob reports
/var/spool/torque/server_priv/accounting/20110504: Permission denied /var/spool/torque/mom_logs/20110504: No such file or directory /var/spool/torque/sched_logs/20110504: No such file or directory
Job: 33.galaxy1.research.intra.ismaa.it
05/04/2011 04:56:36 S enqueuing into batch, state 1 hop 1 05/04/2011 04:56:36 S Job Queued at request of galaxy@galaxy1.research.intra.ismaa.it, owner = galaxy@galaxy1.research.intra.ismaa.it, job name = 27_Extract_features1_marco.moretto@iasma.it, queue = batch 05/04/2011 04:56:37 S Job Run at request of galaxy@galaxy1.research.intra.ismaa.it 05/04/2011 04:56:41 S Email 's' to galaxy@galaxy1.research.intra.ismaa.it failed: Child process 'sendmail -f adm galaxy@galaxy1.research.intra.ismaa.it' returned 127 (errno 10:No child processes)
The only clear thing to me is that after the submission the scheduler puts it in a Hold state. But I cannot understand why. I also try to run the Galaxy-generated sh script with qsub. Following the Galaxy log: galaxy.jobs.runners.pbs DEBUG 2011-05-04 04:56:36,748 (27) submitting file /mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.sh
I copied and run the 27.sh script with the command: qsub 27.sh And the job runs correctly. So what it is not clear to me is if the problem is related to Torque/Maui or is related to the way in which the job is submitted from Galaxy.
Sorry for the very long e-mail and thank you very much for any help.
--- Marco
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Marco Moretto
-
Nate Coraor