Galaxy run.sh process crashing
Hi, We have been running galaxy for the last year fine on our website to run short running scripts (< 10 seconds) and have had no issues. However, we just upgraded to the latest galaxy source and setup the tools necessary for NGS mapping and assembly. Our galaxy instance keeps crashing when jobs are running. It seems to only happen when jobs have been running for a long(ish) amount of time. I've already converted the database to use PostGreSQL. The paster.log doesn't have any informative error messages. Any suggestions on how to fix or further troubleshoot this issue would be appreciated. Thank you, -Hans
Hi Hans what happens, when you run such a long(ish) on the command line (executed as the same user the galaxy server is running as)? Hans-Rudolf On 08/04/2015 01:12 AM, Hans Vasquez-Gross wrote:
Hi,
We have been running galaxy for the last year fine on our website to run short running scripts (< 10 seconds) and have had no issues.
However, we just upgraded to the latest galaxy source and setup the tools necessary for NGS mapping and assembly. Our galaxy instance keeps crashing when jobs are running. It seems to only happen when jobs have been running for a long(ish) amount of time.
I've already converted the database to use PostGreSQL. The paster.log doesn't have any informative error messages. Any suggestions on how to fix or further troubleshoot this issue would be appreciated.
Thank you, -Hans
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi, are you running your jobs via a scheduler? Maybe your scheduler is killing your jobs after X minutes? Ciao, Bjoern Am 04.08.2015 um 01:12 schrieb Hans Vasquez-Gross:
Hi,
We have been running galaxy for the last year fine on our website to run short running scripts (< 10 seconds) and have had no issues.
However, we just upgraded to the latest galaxy source and setup the tools necessary for NGS mapping and assembly. Our galaxy instance keeps crashing when jobs are running. It seems to only happen when jobs have been running for a long(ish) amount of time.
I've already converted the database to use PostGreSQL. The paster.log doesn't have any informative error messages. Any suggestions on how to fix or further troubleshoot this issue would be appreciated.
Thank you, -Hans
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi Bjorn and Hans, We are running Galaxy on our local webserver, so there is no job scheduler. Instead, we are using the localJobRunner configuration in the job_conf.xml <?xml version="1.0"?> <job_conf> <plugins> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="8"/> <plugin id="multilocal" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> </plugins> <destinations default="local"> <destination id="local" runner="local"/> <destination id="multicore6" runner="multilocal"> <param id="local_slots">6</param> </destination> </destinations> <tools> <tool id="bowtie2" destination="multicore6" /> <tool id="spades" destination="multicore6" /> <tool id="bbmap_1" destination="multicore6" /> <tool id="iuc_pear" destination="multicore6" /> <tool id="abyss-pe" destination="multicore6" /> <tool id="fastq_groomer_parallel" destination="multicore6" /> </tools> <handlers> <handler id="main"/> </handlers> </job_conf> Also, normally, we run galaxy in the daemon mode, but recently to help debug this issue, we have been running galaxy in interactive mode in a screen session. @Hans - The processes run under the galaxy user and everything seems to run fine. I am trying to get a concrete example, but run.sh usually crashes during our trimming/assembly steps for NGS data. Sometimes these workflows run to completion, but sometimes they crash the run.sh process. When run.sh crashes, the individual running programs keep running as the galaxy user. We cannot restart galaxy until we manually kill those running processes. Here are what the running processes look like on the system. galaxy 21162 99.4 0.0 2432784 616 s014 R 2:08PM 1425:10.57 seqtk seq -q 0 -X 255 -l 0 -Q 33 -s 11 -f 1.0 -L 0 -1 /Users/galaxy/data_galaxy/test_BACs/all-merged.interleaved.fq galaxy 21159 99.4 0.0 2432784 616 s014 R 2:08PM 1425:16.72 seqtk seq -q 0 -X 255 -l 0 -Q 33 -s 11 -f 1.0 -L 0 -2 /Users/galaxy/data_galaxy/test_BACs/all-merged.interleaved.fq galaxy 21118 92.5 0.5 3015188 367852 s014 R+ 2:07PM 791:21.54 python ./scripts/paster.py serve universe_wsgi.ini galaxy 21160 0.0 0.0 2433640 1044 s014 S 2:08PM 0:00.01 /bin/sh /Users/galaxy/galaxy-dist/database/job_working_directory/002/2180/galaxy_2180.sh galaxy 21157 0.0 0.0 2433640 1044 s014 S 2:08PM 0:00.01 /bin/sh /Users/galaxy/galaxy-dist/database/job_working_directory/002/2179/galaxy_2179.sh galaxy 21113 0.0 0.0 2433640 1000 s014 S+ 2:07PM 0:00.00 sh run.sh Thank you for the help, -Hans
Hello Devs, Now that the galaxy process has crashed, I'll send you more information regarding the logs. galaxy.jobs.runners.local DEBUG 2015-08-04 17:13:20,303 (2337) executing job script: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2337/galaxy_2337.sh galaxy.jobs DEBUG 2015-08-04 17:13:20,487 (2337) Persisting job destination (destination id: multicore6) galaxy.jobs DEBUG 2015-08-04 17:13:25,181 (2468) Working directory for job is: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2468 galaxy.jobs.handler DEBUG 2015-08-04 17:13:25,190 (2468) Dispatching to local runner galaxy.jobs DEBUG 2015-08-04 17:13:25,466 (2468) Persisting job destination (destination id: local) galaxy.jobs.runners DEBUG 2015-08-04 17:13:25,485 Job [2468] queued (295.181 ms) galaxy.jobs.handler INFO 2015-08-04 17:13:25,490 (2468) Job dispatched galaxy.jobs DEBUG 2015-08-04 17:13:25,538 (2469) Working directory for job is: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2469 galaxy.jobs.handler DEBUG 2015-08-04 17:13:25,545 (2469) Dispatching to local runner galaxy.jobs DEBUG 2015-08-04 17:13:25,777 (2469) Persisting job destination (destination id: local) galaxy.jobs.runners DEBUG 2015-08-04 17:13:25,798 Job [2469] queued (252.823 ms) galaxy.jobs.handler INFO 2015-08-04 17:13:25,803 (2469) Job dispatched galaxy.jobs.runners.local DEBUG 2015-08-04 17:14:09,922 execution finished: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2337/galaxy_2337.sh galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:10,476 loading metadata from file for: HistoryDatasetAssociation 5117 galaxy.jobs.runners.local DEBUG 2015-08-04 17:14:10,589 execution finished: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2318/galaxy_2318.sh galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:10,919 loading metadata from file for: HistoryDatasetAssociation 5116 galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:11,500 loading metadata from file for: HistoryDatasetAssociation 5115 galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:11,519 loading metadata from file for: HistoryDatasetAssociation 5084 galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:12,270 loading metadata from file for: HistoryDatasetAssociation 5083 galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:12,804 loading metadata from file for: HistoryDatasetAssociation 5082 galaxy.jobs INFO 2015-08-04 17:14:13,406 Collecting job metrics for <galaxy.model.Job object at 0x117e5b4d0> galaxy.jobs DEBUG 2015-08-04 17:14:13,537 job 2337 ended (finish() executed in (3614.250 ms)) galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:13,580 Cleaning up external metadata files galaxy.jobs INFO 2015-08-04 17:14:14,001 Collecting job metrics for <galaxy.model.Job object at 0x11b03b410> galaxy.jobs DEBUG 2015-08-04 17:14:14,068 job 2318 ended (finish() executed in (3449.205 ms)) galaxy.datatypes.metadata DEBUG 2015-08-04 17:14:14,112 Cleaning up external metadata files galaxy.jobs DEBUG 2015-08-04 17:14:19,625 (2319) Working directory for job is: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2319 galaxy.jobs.handler DEBUG 2015-08-04 17:14:19,630 (2319) Dispatching to local runner galaxy.jobs DEBUG 2015-08-04 17:14:20,000 (2319) Persisting job destination (destination id: local) galaxy.jobs.runners DEBUG 2015-08-04 17:14:20,018 Job [2319] queued (387.377 ms) galaxy.jobs.handler INFO 2015-08-04 17:14:20,021 (2319) Job dispatched galaxy.jobs DEBUG 2015-08-04 17:14:20,061 (2320) Working directory for job is: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2320 galaxy.jobs.handler DEBUG 2015-08-04 17:14:20,067 (2320) Dispatching to local runner galaxy.jobs DEBUG 2015-08-04 17:14:20,254 (2320) Persisting job destination (destination id: local) galaxy.jobs.runners DEBUG 2015-08-04 17:14:20,268 Job [2320] queued (201.349 ms) galaxy.jobs.handler INFO 2015-08-04 17:14:20,272 (2320) Job dispatched galaxy.jobs DEBUG 2015-08-04 17:14:20,696 (2338) Working directory for job is: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2338 galaxy.jobs.handler DEBUG 2015-08-04 17:14:20,701 (2338) Dispatching to local runner galaxy.jobs DEBUG 2015-08-04 17:14:20,891 (2338) Persisting job destination (destination id: local) galaxy.jobs.runners DEBUG 2015-08-04 17:14:20,941 Job [2338] queued (239.991 ms) galaxy.jobs.handler INFO 2015-08-04 17:14:20,945 (2338) Job dispatched galaxy.jobs DEBUG 2015-08-04 17:14:20,982 (2339) Working directory for job is: /Users/galaxy/galaxy-dist/database/job_working_directory/002/2339 galaxy.jobs.handler DEBUG 2015-08-04 17:14:20,989 (2339) Dispatching to local runner galaxy.jobs DEBUG 2015-08-04 17:14:21,456 (2339) Persisting job destination (destination id: local) galaxy.jobs.runners DEBUG 2015-08-04 17:14:21,468 Job [2339] queued (478.726 ms) galaxy.jobs.handler INFO 2015-08-04 17:14:21,472 (2339) Job dispatched ERROR LINE: run.sh: line 81: 21118 Abort trap: 6 python ./scripts/paster.py serve $GALAXY_CONFIG_FILE $@ When trying to restart run.sh, this is what we see in the log. Starting server in PID 57249. Traceback (most recent call last): File "./scripts/paster.py", line 37, in <module> serve.run() File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/serve.py", line 1049, in run invoke(command, command_name, options, args[1:]) File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/serve.py", line 1055, in invoke exit_code = runner.run(args) File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/serve.py", line 220, in run result = self.command() File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/serve.py", line 670, in command serve() File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/serve.py", line 654, in serve server(app) File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/loadwsgi.py", line 292, in server_wrapper **context.local_conf) File "/Users/galaxy/galaxy-dist/lib/galaxy/util/pastescript/loadwsgi.py", line 97, in fix_call val = callable(*args, **kw) File "/Users/galaxy/galaxy-dist/eggs/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1342, in server_runner serve(wsgi_app, **kwargs) File "/Users/galaxy/galaxy-dist/eggs/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1291, in serve request_queue_size=request_queue_size) File "/Users/galaxy/galaxy-dist/eggs/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1134, in __init__ request_queue_size=request_queue_size) File "/Users/galaxy/galaxy-dist/eggs/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1113, in __init__ request_queue_size=request_queue_size) File "/Users/galaxy/galaxy-dist/eggs/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 360, in __init__ HTTPServer.__init__(self, server_address, RequestHandlerClass) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 408, in __init__ self.server_bind() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind SocketServer.TCPServer.server_bind(self) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 419, in server_bind self.socket.bind(self.server_address) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) socket.error: [Errno 48] Address already in use galaxy.jobs.handler INFO 2015-08-04 17:19:32,844 sending stop signal to worker thread galaxy.jobs.handler INFO 2015-08-04 17:19:32,845 job handler queue stopped galaxy.jobs.runners INFO 2015-08-04 17:19:32,845 LocalRunner: Sending stop signal to 2 worker threads galaxy.jobs.runners INFO 2015-08-04 17:19:32,857 LocalRunner: Sending stop signal to 8 worker threads galaxy.jobs.handler INFO 2015-08-04 17:19:32,899 sending stop signal to worker thread galaxy.jobs.handler INFO 2015-08-04 17:19:32,912 job handler stop queue stopped After we kill the running galaxy tool processes, we are able to successfully start run.sh again. We migrated from a previous version, so could this be caused by outdated eggs or .xml files? Thank you, -Hans On Tue, Aug 4, 2015 at 2:03 PM, Hans Vasquez-Gross < havasquezgross@ucdavis.edu> wrote:
Hi Bjorn and Hans,
We are running Galaxy on our local webserver, so there is no job scheduler. Instead, we are using the localJobRunner configuration in the job_conf.xml
<?xml version="1.0"?> <job_conf> <plugins> <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="8"/> <plugin id="multilocal" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/> </plugins> <destinations default="local"> <destination id="local" runner="local"/> <destination id="multicore6" runner="multilocal"> <param id="local_slots">6</param> </destination> </destinations> <tools> <tool id="bowtie2" destination="multicore6" /> <tool id="spades" destination="multicore6" /> <tool id="bbmap_1" destination="multicore6" /> <tool id="iuc_pear" destination="multicore6" /> <tool id="abyss-pe" destination="multicore6" /> <tool id="fastq_groomer_parallel" destination="multicore6" /> </tools>
<handlers> <handler id="main"/> </handlers> </job_conf>
Also, normally, we run galaxy in the daemon mode, but recently to help debug this issue, we have been running galaxy in interactive mode in a screen session.
@Hans - The processes run under the galaxy user and everything seems to run fine. I am trying to get a concrete example, but run.sh usually crashes during our trimming/assembly steps for NGS data. Sometimes these workflows run to completion, but sometimes they crash the run.sh process. When run.sh crashes, the individual running programs keep running as the galaxy user. We cannot restart galaxy until we manually kill those running processes. Here are what the running processes look like on the system.
galaxy 21162 99.4 0.0 2432784 616 s014 R 2:08PM 1425:10.57 seqtk seq -q 0 -X 255 -l 0 -Q 33 -s 11 -f 1.0 -L 0 -1 /Users/galaxy/data_galaxy/test_BACs/all-merged.interleaved.fq galaxy 21159 99.4 0.0 2432784 616 s014 R 2:08PM 1425:16.72 seqtk seq -q 0 -X 255 -l 0 -Q 33 -s 11 -f 1.0 -L 0 -2 /Users/galaxy/data_galaxy/test_BACs/all-merged.interleaved.fq galaxy 21118 92.5 0.5 3015188 367852 s014 R+ 2:07PM 791:21.54 python ./scripts/paster.py serve universe_wsgi.ini galaxy 21160 0.0 0.0 2433640 1044 s014 S 2:08PM 0:00.01 /bin/sh /Users/galaxy/galaxy-dist/database/job_working_directory/002/2180/galaxy_2180.sh galaxy 21157 0.0 0.0 2433640 1044 s014 S 2:08PM 0:00.01 /bin/sh /Users/galaxy/galaxy-dist/database/job_working_directory/002/2179/galaxy_2179.sh galaxy 21113 0.0 0.0 2433640 1000 s014 S+ 2:07PM 0:00.00 sh run.sh
Thank you for the help, -Hans
participants (3)
-
Björn Grüning
-
Hans Vasquez-Gross
-
Hans-Rudolf Hotz