Re: [galaxy-dev] ./run.sh segfault

20 Jul 2015

      Hi All,

I'm still having this issue despite several attempts to try to resolve
it. I've booted it on a 80GB VM, there are no users on it and only 1 or
2 tools installed from the tool shed. I have loaded around 150 fasta.gz
files into a couple of data libraries which are on a nfs share. When
galaxy starts it has a 57GB RAM foot print. If I leave it and do
nothing, around 5 mins after I start galaxy something kicks in and
starts consuming all the ram and then it segfaults.

root@galaxy:~# top
top - 10:01:34 up 20 min,  2 users,  load average: 0.84, 0.49, 0.43
Tasks: 180 total,   1 running, 179 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.4 us,  0.2 sy,  0.0 ni, 87.1 id,  0.0 wa,  0.0 hi,  0.0 si, 
0.3 st
KiB Mem:  81295232 total, 58937820 used, 22357408 free,    13508 buffers
KiB Swap:  8640508 total,    69940 used,  8570568 free.    86132 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND                                                                                                                                                                 

 2867 galaxy    20   0 57.856g 0.054t  11460 S 101.6 71.4   1:38.15
python       

This is what I get in syslog when it crashes like this.

Jul 20 09:50:25 galaxy kernel: [  569.351158] show_signal_msg: 18
callbacks suppressed
Jul 20 09:50:25 galaxy kernel: [  569.351168] python[1883]: segfault at
24 ip 0000000000558077 sp 00007fc5cb9e6400 error 6 in
python2.7[400000+2bc000]
Jul 20 09:50:25 galaxy kernel: [  569.444890] Core dump to
|/usr/share/apport/apport 1409 11 0 1409 pipe failed

If there isn't sufficient memory in the first place (i.e. less than
57GB), I get something more like this;

Jul 16 20:36:41 galaxy kernel: [  117.123921] Out of memory: Kill
process 1390 (python) score 986 or sacrifice child
Jul 16 20:36:41 galaxy kernel: [  117.124087] Killed process 1390
(python) total-vm:43496348kB, anon-rss:32611892kB, file-rss:1800kB (END)

I can't see anything in the paster.log.

I'm at a bit of a loss where to look for what is causing it. Any help
would be greatly appreciated.

Many thanks,

Martin

On 07/16/2015 08:48 PM, Martin Vickers [mjv08] wrote:
...
Hi Nate,
Thanks for the reply. In syslog I'm getting;
Jul 16 20:36:41 galaxy kernel: [  117.123921] Out of memory: Kill
process 1390 (python) score 986 or sacrifice child
...
Jul 16 20:36:41 galaxy kernel: [  117.124087] Killed process 1390
(python) total-vm:43496348kB, anon-rss:32611892kB, file-rss:1800kB
(END)
It's a 32GB VM. I could increase it but I wouldn't expect 32GB to be
too little. I've attached the full syslog.
Dr. Martin Vickers
Data Manager/HPC Systems Administrator
Institute of Biological, Environmental and Rural Sciences
IBERS New Building
Aberystwyth University
w: http://www.martin-vickers.co.uk/
e: mjv08@aber.ac.uk
t: 01970 62 2807
-------------------------
*From:* Nate Coraor <nate@bx.psu.edu>
*Sent:* 16 July 2015 04:36 PM
*To:* Martin Vickers [mjv08]
*Cc:* galaxy-dev@lists.galaxyproject.org
*Subject:* Re: [galaxy-dev] ./run.sh segfault
Hi Martin,
Is there anything in the syslog?
--nate
On Thu, Jul 16, 2015 at 11:26 AM, Martin Vickers <mjv08@aber.ac.uk
<mailto:mjv08@aber.ac.uk>> wrote:
Hi All,
I have a weird issue that's just cropped up. After a new install of
galaxy (checked out on Monday from github) on a ubuntu vm, using
postgres rather than sqlite as well as a few other production
recommendations, I started playing around with the Data Libraries
functionality. I linked a bunch of fastq.gz files into galaxy (around
150 in total) and everything was working fine. I went home and the
next day, it was down.
I tried to start it up as usual (using an init.d script), it worked
for less than a minute and then disappeared again. So I tried running
it as the galaxy user using ./run.sh and I get a seg fault;
Starting server in PID 23173.
serving on http://144.124.110.39:8080
Segmentation fault
Tried again with strace
Starting server in PID 23552.
serving on http://144.124.110.39:8080
[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 23552
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=23552,
si_status=SIGKILL, si_utime=1590, si_stime=1930} ---
rt_sigreturn()                          = 23552
write(2, "Killed\n", 7Killed
)                 = 7
read(10, "", 8192)                      = 0
exit_group(137)                         = ?
+++ exited with 137 +++
I can't see anything odd in the log file and I've turned debugging on
in galaxy.ini. I'm at a bit of a loss. Does anyone know what might be
causing it?
Cheers,
___________________________________________________________
    Please keep all replies on the list by using "reply all"
    in your mail client.  To manage your subscriptions to this
    and other Galaxy lists, please use the interface at:
      https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
      http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
-- 

--
Dr. Martin Vickers

Data Manager/HPC Systems Administrator
Institute of Biological, Environmental and Rural Sciences
IBERS New Building
Aberystwyth University

w: http://www.martin-vickers.co.uk/
e: mjv08@aber.ac.uk
t: 01970 62 2807