Job Control in Galaxy
Hi, We have configured our Galaxy system to use Sun Grid Engine in shared mode. We noticed that on submission of a job that running qstat does not show the jobs status. It looks the a wrapper shell script defining the environment and the command(s) is being submitted and then as far as SGE is concerned the job is done (however, the job is still running on the application server). I don't understand when using this system how you can manage submitted jobs. Isn't the point of using a queuing system like SGE that you can manage the load on the servers using the SGE configuration tool (qconf) to control the queues? Or am I misunderstanding something? Thanks for any advice, Steve
On 29/04/2010 12:09, Steve Taylor wrote:
Hi,
We have configured our Galaxy system to use Sun Grid Engine in shared mode. We noticed that on submission of a job that running qstat does not show the jobs status. It looks the a wrapper shell script defining the environment and the command(s) is being submitted and then as far as SGE is concerned the job is done (however, the job is still running on the application server).
I don't understand when using this system how you can manage submitted jobs. Isn't the point of using a queuing system like SGE that you can manage the load on the servers using the SGE configuration tool (qconf) to control the queues? Or am I misunderstanding something?
Ok. Looks like I *am* misunderstanding something. Scrub that last email!:-) I was running some SAM-BAM conversions but I was getting confused since my jobs may have actually finished (and are not visible in the queue) but in the galaxy interface it is still showing it as running. So it looks like the problem may be with the our samtools config rather than more general case. More digging required... Steve
Steve Taylor wrote:
Ok. Looks like I *am* misunderstanding something. Scrub that last email!:-)
I was running some SAM-BAM conversions but I was getting confused since my jobs may have actually finished (and are not visible in the queue) but in the galaxy interface it is still showing it as running. So it looks like the problem may be with the our samtools config rather than more general case. More digging required...
Depending on how big those files are, setting metadata may be what's causing the lag between the actual command finishing and Galaxy marking the job as finished. You may want to set: set_metadata_externally = True In the config to force this to happen out on the cluster. --nate
Steve _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Hi Nate,
Ok. Looks like I *am* misunderstanding something. Scrub that last email!:-)
I was running some SAM-BAM conversions but I was getting confused since my jobs may have actually finished (and are not visible in the queue) but in the galaxy interface it is still showing it as running. So it looks like the problem may be with the our samtools config rather than more general case. More digging required...
Depending on how big those files are, setting metadata may be what's causing the lag between the actual command finishing and Galaxy marking the job as finished. You may want to set:
set_metadata_externally = True
Ok. Thanks. Will try that.
In the config to force this to happen out on the cluster.
Trying another samtool...in the Galaxy History the job is still running (yellow background), but is finished according to SGE and there appears to be some output produced. galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,051 (136) submitting file /wwwdata/galaxy-dist/database/pbs/galaxy_136.sh galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,052 (136) command is: python /wwwdata/galaxy-dist/tools/samtools/sam2interval.py --input_sam_file=/wwwdata/galaxy-dist/database/files/000/dataset_153.dat -p > /wwwdata/galaxy-dist/database/files/000/dataset_210.dat Any ideas? Steve
Steve Taylor wrote:
Trying another samtool...in the Galaxy History the job is still running (yellow background), but is finished according to SGE and there appears to be some output produced.
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,051 (136) submitting file /wwwdata/galaxy-dist/database/pbs/galaxy_136.sh galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,052 (136) command is: python /wwwdata/galaxy-dist/tools/samtools/sam2interval.py --input_sam_file=/wwwdata/galaxy-dist/database/files/000/dataset_153.dat -p > /wwwdata/galaxy-dist/database/files/000/dataset_210.dat
That's only the initial step - you should see quite a few more with the SGE job id and state changes, and a "job finished" message once complete. --nate
Any ideas?
Steve
On 29/04/2010 15:43, Nate Coraor wrote:
Steve Taylor wrote:
Trying another samtool...in the Galaxy History the job is still running (yellow background), but is finished according to SGE and there appears to be some output produced.
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,051 (136) submitting file /wwwdata/galaxy-dist/database/pbs/galaxy_136.sh galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,052 (136) command is: python /wwwdata/galaxy-dist/tools/samtools/sam2interval.py --input_sam_file=/wwwdata/galaxy-dist/database/files/000/dataset_153.dat -p> /wwwdata/galaxy-dist/database/files/000/dataset_210.dat
That's only the initial step - you should see quite a few more with the SGE job id and state changes, and a "job finished" message once complete.
Yes it says: galaxy.jobs.runners.sge DEBUG 2010-04-29 15:33:33,133 (136/1221) state change: job finished normally but in Galaxy it still is showing it as running. Steve
On 29/04/2010 15:54, Steve Taylor wrote:
On 29/04/2010 15:43, Nate Coraor wrote:
Steve Taylor wrote:
Trying another samtool...in the Galaxy History the job is still running (yellow background), but is finished according to SGE and there appears to be some output produced.
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,051 (136) submitting file /wwwdata/galaxy-dist/database/pbs/galaxy_136.sh galaxy.jobs.runners.sge DEBUG 2010-04-29 15:32:26,052 (136) command is: python /wwwdata/galaxy-dist/tools/samtools/sam2interval.py --input_sam_file=/wwwdata/galaxy-dist/database/files/000/dataset_153.dat -p> /wwwdata/galaxy-dist/database/files/000/dataset_210.dat
That's only the initial step - you should see quite a few more with the SGE job id and state changes, and a "job finished" message once complete.
Yes it says:
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:33:33,133 (136/1221) state change: job finished normally
but in Galaxy it still is showing it as running.
Ok. It has finally completed. Maybe I am being too impatient :-). I'll run a SAM-BAM test overnight to see if it really is just a big lag! Steve
Steve Taylor wrote:
Yes it says:
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:33:33,133 (136/1221) state change: job finished normally
but in Galaxy it still is showing it as running.
After this you should also see 'job 136 ended'. This means that Galaxy has finished all post-processing. If this never happens, please try enabling this option: use_heartbeat = True This will create two files: heartbeat.log heartbeat.log.nonsleeping Watching the latter will show what the threads are up to when this is happening. I suspect it's setting metadata, but this should tell us for sure. --nate
Steve
"use_heartbeat = True" is set. No heartbeat log files are generated after restarting Galaxy service and running a queue job. Zong-Pei On Thu, 29 Apr 2010, Nate Coraor wrote: Date: Thu, 29 Apr 2010 16:03:59 +0100 From: Nate Coraor <nate@bx.psu.edu> To: Stephen Taylor <stephen.taylor@imm.ox.ac.uk> Cc: "galaxy-dev@bx.psu.edu" <galaxy-dev@bx.psu.edu>, "zph@herald.ox.ac.uk" <zph@herald.ox.ac.uk> Subject: Re: [galaxy-dev] Job Control in Galaxy Steve Taylor wrote:
Yes it says:
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:33:33,133 (136/1221) state change: job finished normally
but in Galaxy it still is showing it as running.
After this you should also see 'job 136 ended'. This means that Galaxy has finished all post-processing. If this never happens, please try enabling this option: use_heartbeat = True This will create two files: heartbeat.log heartbeat.log.nonsleeping Watching the latter will show what the threads are up to when this is happening. I suspect it's setting metadata, but this should tell us for sure. --nate
Steve
zong-pei.han@imm.ox.ac.uk wrote:
"use_heartbeat = True" is set. No heartbeat log files are generated after restarting Galaxy service and running a queue job.
I'm not sure how this can occur. They should be created in Galaxy's root directory since this is the working directory when Galaxy runs. --nate
Zong-Pei
On Thu, 29 Apr 2010, Nate Coraor wrote: Date: Thu, 29 Apr 2010 16:03:59 +0100 From: Nate Coraor <nate@bx.psu.edu> To: Stephen Taylor <stephen.taylor@imm.ox.ac.uk> Cc: "galaxy-dev@bx.psu.edu" <galaxy-dev@bx.psu.edu>, "zph@herald.ox.ac.uk" <zph@herald.ox.ac.uk> Subject: Re: [galaxy-dev] Job Control in Galaxy
Steve Taylor wrote:
Yes it says:
galaxy.jobs.runners.sge DEBUG 2010-04-29 15:33:33,133 (136/1221) state change: job finished normally
but in Galaxy it still is showing it as running.
After this you should also see 'job 136 ended'. This means that Galaxy has finished all post-processing.
If this never happens, please try enabling this option:
use_heartbeat = True
This will create two files:
heartbeat.log heartbeat.log.nonsleeping
Watching the latter will show what the threads are up to when this is happening. I suspect it's setting metadata, but this should tell us for sure.
--nate
Steve
Hi Nate, Our local Galaxy root directory is /wwwdata/galaxy, and I have to hardwire the filename /wwwdata/galaxy/log/heartbeat.log in /wwwdata/galaxy/lib/galaxy/util/heartbeat.py for log files to be created. With "use_heartbeat = True", now Galaxy no longer getting job running or finished status - just sits there saying "Job is waiting to run". In fact the queued jobs are finished! I have put log files below for you to look at: http://sara.molbiol.ox.ac.uk/userweb/zph/galaxy.log http://sara.molbiol.ox.ac.uk/userweb/zph/heartbeat.log http://sara.molbiol.ox.ac.uk/userweb/zph/heartbeat.log.nonsleeping Regards, Zong-Pei On Thu, 29 Apr 2010, Nate Coraor wrote: Date: Thu, 29 Apr 2010 18:27:59 +0100 From: Nate Coraor <nate@bx.psu.edu> To: Zong-Pei Han <zong-pei.han@imm.ox.ac.uk> Cc: Stephen Taylor <stephen.taylor@imm.ox.ac.uk>, "galaxy-dev@bx.psu.edu" <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Job Control in Galaxy zong-pei.han@imm.ox.ac.uk wrote:
"use_heartbeat = True" is set. No heartbeat log files are generated after restarting Galaxy service and running a queue job.
I'm not sure how this can occur. They should be created in Galaxy's root directory since this is the working directory when Galaxy runs. --nate [cut]
zong-pei.han@imm.ox.ac.uk wrote:
Hi Nate,
Our local Galaxy root directory is /wwwdata/galaxy, and I have to hardwire the filename
/wwwdata/galaxy/log/heartbeat.log
in /wwwdata/galaxy/lib/galaxy/util/heartbeat.py for log files to be created.
With "use_heartbeat = True", now Galaxy no longer getting job running or finished status - just sits there saying "Job is waiting to run". In fact the queued jobs are finished! I have put log files below for you to look at:
http://sara.molbiol.ox.ac.uk/userweb/zph/galaxy.log http://sara.molbiol.ox.ac.uk/userweb/zph/heartbeat.log http://sara.molbiol.ox.ac.uk/userweb/zph/heartbeat.log.nonsleeping
Galaxy is setting metadata on one of the datasets, which is causing other job operations to move very slowly. I suspect it's job 141 - could this job have been created before you enabled set_metadata_externally? --nate
Regards, Zong-Pei
On Thu, 29 Apr 2010, Nate Coraor wrote: Date: Thu, 29 Apr 2010 18:27:59 +0100 From: Nate Coraor <nate@bx.psu.edu> To: Zong-Pei Han <zong-pei.han@imm.ox.ac.uk> Cc: Stephen Taylor <stephen.taylor@imm.ox.ac.uk>, "galaxy-dev@bx.psu.edu" <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Job Control in Galaxy
zong-pei.han@imm.ox.ac.uk wrote:
"use_heartbeat = True" is set. No heartbeat log files are generated after restarting Galaxy service and running a queue job.
I'm not sure how this can occur. They should be created in Galaxy's root directory since this is the working directory when Galaxy runs.
--nate
[cut]
Steve Taylor wrote:
Hi,
We have configured our Galaxy system to use Sun Grid Engine in shared mode. We noticed that on submission of a job that running qstat does not show the jobs status. It looks the a wrapper shell script defining the environment and the command(s) is being submitted and then as far as SGE is concerned the job is done (however, the job is still running on the application server).
I don't understand when using this system how you can manage submitted jobs. Isn't the point of using a queuing system like SGE that you can manage the load on the servers using the SGE configuration tool (qconf) to control the queues? Or am I misunderstanding something?
Hi Steve, The job should indeed run on the node and be visible to qsub. Can you provide more details, and paste the output of all of the steps you're describing? If the job is submitted to the SGE queue, it should not run locally on the application server. What does this look like when it happens? Thanks, --nate
Thanks for any advice,
Steve _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
participants (3)
-
Nate Coraor
-
Steve Taylor
-
zong-pei.han@imm.ox.ac.uk