In our case someone had installed and started
a second development instance of galaxy but used the same database as the
first development instance. So the ids were mixed up and and causing some
jobs to crash. Yuck!
Thanks,
Liisa
From:
Nate Coraor <nate@bx.psu.edu>
To:
Liisa Koski <liisa.koski@basf.com>
Cc:
kellrott@soe.ucsc.edu,
"galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu>,
galaxy-dev-bounces@lists.bx.psu.edu
Date:
14/01/2013 10:48 AM
Subject:
Re: [galaxy-dev]
DRMAA runner weirdness
On Jan 11, 2013, at 9:32 AM, Liisa Koski wrote:
> Hello,
> Can you please post the link to this patch? I do not see it in the
mail thread and I too have noticed some issues with the DRMAA job running
since updating to the Oct. 23rd distribution. I don't know if it is related
yet but I'd like to try the patch to see. I have two local instances of
Galaxy (prod and dev). On my dev instance (which is fully up to date),
when I run the same job multiple times, sometimes it finishes and sometimes
it dies, this is independent of which node it runs on. My prod instance
is still at the Oct. 03 distribution and does not experience this problem.
So I am afraid to update our production instance.
>
> Thanks in advance,
> Liisa
Hi Liisa,
Here's the one that Kyle is referring to:
https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5552c00e3e38a2da0
However, this patch should only fix the problem of the server segfaulting
when deleting certain jobs (ones that have not yet been dispatched to the
cluster).
--nate
>
>
>
>
> From: Kyle Ellrott <kellrott@soe.ucsc.edu>
> To: Nate Coraor <nate@bx.psu.edu>
> Cc: "galaxy-dev@lists.bx.psu.edu"
<galaxy-dev@lists.bx.psu.edu>
> Date: 10/01/2013 07:44 PM
> Subject: Re: [galaxy-dev] DRMAA runner
weirdness
> Sent by: galaxy-dev-bounces@lists.bx.psu.edu
>
>
>
> I did a merge of galaxy-central that included the patch you posted
today. The scheduling problem seems to have gone away. Although I'm still
getting back 'Job output not returned from cluster' for errors. This seems
odd, as the system previously would output stderr correctly.
>
> Kyle
>
>
> On Thu, Jan 10, 2013 at 8:30 AM, Nate Coraor <nate@bx.psu.edu>
wrote:
> On Jan 9, 2013, at 12:18 AM, Kyle Ellrott wrote:
>
> > I'm running a test Galaxy system on a cluster (merged galaxy-dist
on Janurary 4th). And I've noticed some odd behavior from the DRMAA job
runner.
> > I'm running a multithread system, one web server, one job_manager,
and three job_handlers. DRMAA is the default job runner (the command for
tophat2 is drmaa://-V -l mem_total=7G -pe smp 2/), with SGE 6.2u5 being
the engine underneath.
> >
> > My test involves trying to run three different Tophat2 jobs.
The first two seem to start up (and get put on the SGE queue), but the
third stays grey, with the job manager listing it in state 'new' with command
line 'None'. It doesn't seem to leave this state. Both of the jobs that
actually got onto the queue die (reasons unknown, but much to early, probably
some tophat/bowtie problem), but one job is listed in error state with
stderr as 'Job output not returned from cluster', while the other job (which
is no longer in the SGE queue) is still listed as running.
>
> Hi Kyle,
>
> It sounds like there are bunch of issues here. Do you have any
limits set as to the number of concurrent jobs allowed? If not, you
may need to add a bit of debugging information to the manager or handler
code to figure out why the 'new' job is not being dispatched for execution.
>
> For the 'error' job, more information about output collection should
be available from the Galaxy server log. If you have general SGE
problems this may not be Galaxy's fault. You do need to make sure
that the stdout/stderr files are able to be properly copied back to the
Galaxy server upon job completion.
>
> For the 'running' job, make sure you've got 'set_metadata_externally
= True' in your Galaxy config.
>
> --nate
>
> >
> > Any ideas?
> >
> >
> > Kyle
> > ___________________________________________________________
> > Please keep all replies on the list by using "reply all"
> > in your mail client. To manage your subscriptions to this
> > and other Galaxy lists, please use the interface at:
> >
> > http://lists.bx.psu.edu/
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
> http://lists.bx.psu.edu/