Re: [galaxy-dev] DRMAA runner weirdness

11 Jan 2013

      Hello,
Can you please post the link to this patch? I do not see it in the mail 
thread and I too have noticed some issues with the DRMAA job running since 
updating to the Oct. 23rd distribution. I don't know if it is related yet 
but I'd like to try the patch to see. I have two local instances of Galaxy 
(prod and dev). On my dev instance (which is fully up to date), when I run 
the same job multiple times, sometimes it finishes and sometimes it dies, 
this is independent of which node it runs on. My prod instance is still at 
the Oct. 03 distribution and does not experience this problem. So I am 
afraid to update our production instance. 

Thanks in advance,
Liisa

From:   Kyle Ellrott <kellrott@soe.ucsc.edu>
To:     Nate Coraor <nate@bx.psu.edu>
Cc:     "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu>
Date:   10/01/2013 07:44 PM
Subject:        Re: [galaxy-dev] DRMAA runner weirdness
Sent by:        galaxy-dev-bounces@lists.bx.psu.edu

I did a merge of galaxy-central that included the patch you posted 
today. The scheduling problem seems to have gone away. Although I'm still 
getting back 'Job output not returned from cluster' for errors. This seems 
odd, as the system previously would output stderr correctly.

Kyle

On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <nate@bx.psu.edu> wrote:
On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:
...
I'm running a test Galaxy system on a cluster (merged galaxy-dist on 
Janurary 4th). And I've noticed some odd behavior from the DRMAA job 
runner.
I'm running a multithread system, one web server, one job_manager, and 
three job_handlers. DRMAA is the default job runner (the command for 
tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being 
the engine underneath.
My test involves trying to run three different Tophat2 jobs. The first 
two seem to start up (and get put on the SGE queue), but the third stays 
grey, with the job manager listing it in state 'new' with command line 
'None'. It doesn't seem to leave this state. Both of the jobs that 
actually got onto the queue die (reasons unknown, but much to early, 
probably some tophat/bowtie problem), but one job is listed in error state 
with stderr as 'Job output not returned from cluster', while the other job 
(which is no longer in the SGE queue) is still listed as running.
Hi Kyle,

It sounds like there are bunch of issues here.  Do you have any limits set 
as to the number of concurrent jobs allowed?  If not, you may need to add 
a bit of debugging information to the manager or handler code to figure 
out why the 'new' job is not being dispatched for execution.

For the 'error' job, more information about output collection should be 
available from the Galaxy server log.  If you have general SGE problems 
this may not be Galaxy's fault.  You do need to make sure that the 
stdout/stderr files are able to be properly copied back to the Galaxy 
server upon job completion.

For the 'running' job, make sure you've got 'set_metadata_externally = 
True' in your Galaxy config.

--nate
...
Any ideas?
Kyle
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] DRMAA runner weirdness

Liisa Koski