[MXS-2217] maxscale crash with signal 11 Created: 2018-12-10  Updated: 2020-08-25  Resolved: 2019-01-17

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 2.3.2
Fix Version/s: 2.3.3

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: markus makela
Resolution: Fixed Votes: 2
Labels: None

Sprint: MXS-SPRINT-72, MXS-SPRINT-73

 Description   

Maxscale crashes with signal 11 every 3-5 minutes.

MariaDB MaxScale  /var/log/maxscale/maxscale.log  Mon Dec 10 09:04:16 2018
----------------------------------------------------------------------------
2018-12-10 09:04:16   notice : syslog logging is enabled.
2018-12-10 09:04:16   notice : maxlog logging is enabled.
2018-12-10 09:04:16   notice : Using up to 5.91GiB of memory for query classifier cache
2018-12-10 09:04:16   notice : Working directory: /var/log/maxscale
2018-12-10 09:04:16   notice : The collection of SQLite memory allocation statistics turned off.
2018-12-10 09:04:16   notice : Threading mode of SQLite set to Multi-thread.
2018-12-10 09:04:16   notice : MariaDB MaxScale 2.3.2 started (Commit: 1126c687a4570f60ee26a163520198a3263ccbbd)
2018-12-10 09:04:16   notice : MaxScale is running in process 5649
2018-12-10 09:04:16   notice : Configuration file: /etc/maxscale.cnf
2018-12-10 09:04:16   notice : Log directory: /var/log/maxscale
2018-12-10 09:04:16   notice : Data directory: /var/lib/maxscale
2018-12-10 09:04:16   notice : Module directory: /usr/lib64/maxscale
2018-12-10 09:04:16   notice : Service cache: /var/cache/maxscale
2018-12-10 09:04:16   notice : No query classifier specified, using default 'qc_sqlite'.
2018-12-10 09:04:16   notice : Loaded module qc_sqlite: V1.0.0 from /usr/lib64/maxscale/libqc_sqlite.so
2018-12-10 09:04:16   notice : Query classification results are cached and reused. Memory used per thread: 2.95GiB
2018-12-10 09:04:16   notice : The systemd watchdog is Enabled. Internal timeout = 30s
2018-12-10 09:04:16   notice : Loading /etc/maxscale.cnf.
2018-12-10 09:04:16   notice : Loaded module MariaDBBackend: V2.0.0 from /usr/lib64/maxscale/libmariadbbackend.so
2018-12-10 09:04:16   notice : Loaded module MariaDBClient: V1.1.0 from /usr/lib64/maxscale/libmariadbclient.so
2018-12-10 09:04:16   notice : Initializing statement-based read/write split router module.
2018-12-10 09:04:16   notice : Loaded module readwritesplit: V1.1.0 from /usr/lib64/maxscale/libreadwritesplit.so
2018-12-10 09:04:16   notice : Initialise the MariaDB Monitor module.
2018-12-10 09:04:16   notice : Loaded module mariadbmon: V1.5.0 from /usr/lib64/maxscale/libmariadbmon.so
2018-12-10 09:04:16   notice : Loaded module MySQLBackendAuth: V1.0.0 from /usr/lib64/maxscale/libmysqlbackendauth.so
2018-12-10 09:04:16   notice : Loaded module MySQLAuth: V1.1.0 from /usr/lib64/maxscale/libmysqlauth.so
2018-12-10 09:04:16   notice : Housekeeper thread started.
2018-12-10 09:04:16   notice : Using encrypted passwords. Encryption key: '/var/lib/maxscale/.secrets'.
2018-12-10 09:04:16   notice : Loaded server states from journal file: /var/lib/maxscale/DatabaseMonitor/monitor.dat
2018-12-10 09:04:16   notice : Starting a total of 1 services...
2018-12-10 09:04:16   notice : [SplitterService] Loaded 67 MySQL users for listener SplitterListener.
2018-12-10 09:04:16   notice : Listening for connections at [::]:3306 with protocol MySQL
2018-12-10 09:04:16   notice : Service 'SplitterService' started (1/1)
2018-12-10 09:04:16   notice : Started REST API on [127.0.0.1]:8989
2018-12-10 09:04:16   notice : MaxScale started with 2 worker threads, each with a stack size of 8388608 bytes.
2018-12-10 09:04:55   alert  : Fatal: MaxScale 2.3.2 received fatal signal 11. Attempting backtrace.
2018-12-10 09:04:55   alert  : Commit ID: 1126c687a4570f60ee26a163520198a3263ccbbd System name: Linux Release string: NAME="CentOS Linux"
2018-12-10 09:04:55   alert  :   /usr/bin/maxscale(_ZN7maxbase15dump_stacktraceESt8functionIFvPKcS2_EE+0x2b) [0x40cbeb]: /home/vagrant/MaxScale/maxutils/maxbase/src/stacktrace.cc:130
2018-12-10 09:04:55   alert  :   /usr/bin/maxscale(_ZN7maxbase15dump_stacktraceEPFvPKcS1_E+0x4e) [0x40cf4e]: /usr/include/c++/4.8.2/functional:2029
2018-12-10 09:04:55   alert  :   /usr/bin/maxscale() [0x4095c9]: ??:0
2018-12-10 09:04:55   alert  :   /lib64/libpthread.so.0(+0xf6d0) [0x7f922de0a6d0]: sigaction.c:?
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmysqlcommon.so.2.0.0(mysql_protocol_done+0x12) [0x7f9226d17c72]: /home/vagrant/MaxScale/server/modules/protocol/MySQL/mysql_common.cc:79
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmariadbclient.so(+0x3371) [0x7f9226b0a371]: /home/vagrant/MaxScale/server/modules/protocol/MySQL/mariadbclient/mysql_client.cc:1497
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(dcb_final_close+0x488) [0x7f922e535e08]: /home/vagrant/MaxScale/server/core/dcb.cc:1237
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker14delete_zombiesEv+0x34) [0x7f922e56e7a4]: /usr/include/c++/4.8.2/bits/stl_vector.h:734
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker10epoll_tickEv+0x26) [0x7f922e56f016]: /home/vagrant/MaxScale/server/core/routingworker.cc:641
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x326) [0x7f922e590676]: /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:762
2018-12-10 09:04:55   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x51) [0x7f922e590701]: /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:545
2018-12-10 09:04:55   alert  :   /usr/bin/maxscale(main+0x2019) [0x4087f9]: /home/vagrant/MaxScale/server/core/gateway.cc:2260
2018-12-10 09:04:55   alert  :   /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f922bd07445]: ??:?
2018-12-10 09:04:55   alert  :   /usr/bin/maxscale() [0x409022]: ??:0

Config:

[maxscale]
threads=auto
 
[DatabaseMonitor]
type=monitor
servers=mariadb1,mariadb2,mariadb3
user=maxscale
password=redacted
monitor_interval=100
detect_standalone_master=true
module=mariadbmon
replication_user=repli
replication_password=redacted
journal_max_age=28800
script_timeout=90
 
[SplitterService]
type=service
router=readwritesplit
servers=mariadb1,mariadb2,mariadb3
user=maxscale
password=redacted
use_sql_variables_in=master
#detect_replication_lag=1
#max_slave_replication_lag=1
 
[SplitterListener]
type=listener
service=SplitterService
protocol=MariaDBClient
port=3306
 
[mariadb1]
type=server
address=mariadb-prod-1.customer.com
port=3306
protocol=MariaDBBackend
 
[mariadb2]
type=server
address=mariadb-prod-2.customer.com
port=3306
protocol=MariaDBBackend
 
[mariadb3]
type=server
address=mariadb-prod-3.customer.com
port=3306
protocol=MariaDBBackend



 Comments   
Comment by Rick Pizzi [ 2018-12-11 ]

No it doesn't seem to have triggered.

Comment by Todd Stoffel (Inactive) [ 2018-12-21 ]

@rpizzi, This might be related to the customer having only 2 CPUs. Under heavy workload the maxscale daemon would crash and restart. I had them upgrade their EC2 instance to 8 CPUs and the issue seems to be resolved for now. However, this will need a long term fix in the code.

Comment by Wagner Bianchi (Inactive) [ 2018-12-21 ]

toddstoffel if we are talking about the same customer/environment, the current version is not 2.3.2 anymore as I downgraded that back to 2.2.18-1. I worked on the same environment as Pizzi and the step after getting all these crashes was to downgrade. So, I think we need to consider that also as a source of stability, maybe.

Comment by Todd Stoffel (Inactive) [ 2018-12-22 ]

bianchi, they had to change that right back because the avro router does not work properly in 2.2.18.

Comment by Johan Wikman [ 2019-01-17 ]

Although we have not been able to directly reproduce this (we have with some explicit modifications), we have now fixed a race that we are quite confident is the cause for this.

While client connections can be accepted by any thread, the thread actually used for handling the client traffic is allocated in a round-robin fashion, which requires some inter-thread communication. In some situations it was possible that by the time the actual thread starting dealing with the connection, the connection had already been closed and an associated data structure deleted.

The fix that now has been made makes that scenario impossible.

Generated at Thu Feb 08 04:12:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.