On 8/8/12 4:06 PM, "Nate Coraor" <nate@bx.psu.edu> wrote:
On Aug 8, 2012, at 2:30 PM, Karger, Amir wrote:
Meanwhile, we're able to restart, and get happy log messages from the jobrunner and two web "servers" (two servers running on different ports of a Tomcat host). And I can do an upload, which runs locally. But when I try to do a blast, which is supposed to submit to the cluster (and ran just fine on our old install), it hangs and never starts. I would think the database is working OK, since it shows me new history items when I upload and stuff. The web Galaxy log shows that I went to the tool page, and then has a ton of loads to root/history_item_updates, but nothing else. The job handler Galaxy log has nothing since the PID messages when the server started up most recently.
A quick search of the archives didn't find anything obvious. (I don't have any obvious words to search for.) Any thoughts about where I should start looking to track this down?
Hi Amir,
If you aren't setting job_manager and job_handlers in your config, each server will consider itself the manager and handler. If not configured to run jobs, this may result in jobs failing to run. I'd suggest explicitly defining a manager and handlers.
--nate
Sigh. We have both job_manager and job_handlers set to the same server. It seems like our runner app may be getting into some kind of sleeping state. I was unable to upload a file, which had worked before. However, when I restarted the runner, it picked up the upload job and successfully uploaded it AND picked up the previously queued tab2fasta job, and I believe completed it successfully too. (There's an error due to a missing filetype, which I guess makes stderr non-empty and makes Galaxy think it was unsuccessful. But I can confirm that the job was in fact run on our cluster.) Running paster.py ... --status claims that the process is still running. So what would make the runner go to "sleep" like that and how do I stop it from happening? Thanks, -Amir