[MXS-956] Maxscale crash: Removing DCB 0x7fbf94016760 but was in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close Created: 2016-10-25  Updated: 2016-12-14  Resolved: 2016-12-14

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 2.0.1, 2.0.2
Fix Version/s: 2.0.3

Type: Bug Priority: Blocker
Reporter: Marlin Cremers Assignee: markus makela
Resolution: Fixed Votes: 1
Labels: galera
Environment:

root@maxscale1:~# uname -a
Linux maxscale1 4.4.19-1-pve #1 SMP Wed Sep 14 14:33:50 CEST 2016 x86_64 x86_64 x86_64 GNU/Linux
root@maxscale1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
root@maxscale1:~# maxscale -v
MaxScale 2.0.1


Issue Links:
Duplicate
is duplicated by MXS-964 Fatal: MaxScale 2.0.1 received fatal ... Closed
Sprint: 2016-21, 2016-22, 2016-23

 Description   

Maxscale crashed with the following error messages in the log:

2016-10-25 05:52:31   error  : 140461134051072 [dcb_close] Error : Removing DCB 0x7fbf94016760 but was in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close. 
2016-10-25 05:52:31   error  : 140461125658368 [dcb_close] Error : Removing DCB 0x7fbf94012ca0 but was in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close.

Core dump:

2016-10-31 21:13:48   error  : 140245437769472 [dcb_close] Error : Removing DCB 0x13627f0 but was in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close. 
2016-10-31 21:13:48   error  : Fatal: MaxScale 2.0.1 received fatal signal 6. Attempting backtrace.
2016-10-31 21:13:48   error  : Commit ID: fa2a66719554d13a00db5c81c5c9ffd5b3a2ce14 System name: Linux Release string: Ubuntu 16.04.1 LTS
2016-10-31 21:13:48   error  :   /usr/bin/maxscale() [0x403ca7] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x113e0) [0x7f8d7ce183e0] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29) [0x7f8d7ce182b9] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_close+0x108) [0x7f8d7d2c2060] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(+0x32e7) [0x7f8d75e472e7] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x36b4) [0x7f8d74f6a6b4] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x32ce0) [0x7f8d7d2bfce0] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x32951) [0x7f8d7d2bf951] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_process_zombies+0x224) [0x7f8d7d2bf75a] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(poll_waitevents+0x73e) [0x7f8d7d2d6484] 
2016-10-31 21:13:48   error  :   /usr/bin/maxscale(worker_thread_main+0x2a) [0x404d8f] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x770a) [0x7f8d7ce0e70a] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f8d7c70082d] 

[maxscale]
threads=4
 
[Splitter Service]
type=service
router=readwritesplit
servers=galera1-alb, galera2-alb, galera3-alb
router_options=master_failure_mode=error_on_write
user=maxscale-alb
passwd=*snip*
 
[Splitter Listener]
type=listener
service=Splitter Service
protocol=MySQLClient
port=3306
address=0.0.0.0
 
[galera1-alb]
type=server
address=10.14.0.6
port=3306
protocol=MySQLBackend
priority=1
 
[galera2-alb]
type=server
address=10.14.0.7
port=3306
protocol=MySQLBackend
priority=2
 
[galera3-alb]
type=server
address=10.14.0.8
port=3306
protocol=MySQLBackend
priority=3
 
[Galera Monitor]
type=monitor
module=galeramon
#disable_master_role_setting=true
monitor_interval=500
servers=galera1-alb, galera2-alb, galera3-alb
user=maxscale-alb
passwd=*snip*
 
[CLI]
type=service
router=cli
 
[CLI Listener]
type=listener
service=CLI
protocol=maxscaled
address=localhost
port=6603

root@maxscale1:~# maxadmin -pmariadb show sessions
Session 1 (0x719310)
	State:               Listener Session
	Service:             Splitter Service (0x7056c0)
	Client DCB:          0x7186b0
	Connected:           Tue Oct 25 07:55:33 2016
Session 2 (0x721000)
	State:               Listener Session
	Service:             CLI (0x702910)
	Client DCB:          0x7193b0
	Connected:           Tue Oct 25 07:55:33 2016
Session 25 (0x7f7b58012140)
	State:               Session ready for routing
	Service:             Splitter Service (0x7056c0)
	Client DCB:          0x7f7b60015d10
	Client Address:      powerdns_alb@10.14.0.110
	Connected:           Tue Oct 25 07:55:52 2016
	Idle:                34 seconds
Session 26 (0x7f7b600149f0)
	State:               Session ready for routing
	Service:             Splitter Service (0x7056c0)
	Client DCB:          0x7f7b60014310
	Client Address:      powerdns_alb@10.14.0.110
	Connected:           Tue Oct 25 07:55:52 2016
	Idle:                33 seconds
Session 540 (0x7f7b58013450)
	State:               Session ready for routing
	Service:             CLI (0x702910)
	Client DCB:          0x7f7b58012360
	Client Address:      127.0.0.1
	Connected:           Tue Oct 25 08:41:16 2016
	Idle:                0 seconds
Session 23 (0x7f7b60015f50)
	State:               Session ready for routing
	Service:             Splitter Service (0x7056c0)
	Client DCB:          0x7f7b600157e0
	Client Address:      powerdns_alb@10.14.0.110
	Connected:           Tue Oct 25 07:55:52 2016
	Idle:                2721 seconds
Session 24 (0x7f7b600141a0)
	State:               Session ready for routing
	Service:             Splitter Service (0x7056c0)
	Client DCB:          0x7f7b60016170
	Client Address:      powerdns_alb@10.14.0.110
	Connected:           Tue Oct 25 07:55:52 2016
	Idle:                4 seconds

root@maxscale1:~# maxadmin -pmariadb show servers
Server 0x705120 (galera1-alb)
	Server:                              10.14.0.6
	Status:                              Master, Synced, Running
	Protocol:                            MySQLBackend
	Port:                                3306
	Server Version:                      10.1.18-MariaDB-1~xenial
	Node Id:                             0
	Master Id:                           -1
	Slave Ids:                           
	Repl Depth:                          0
	Server Parameters:
	                                       priority	1
	Number of connections:               531
	Current no. of conns:                3
	Current no. of operations:           0
Server 0x704be0 (galera2-alb)
	Server:                              10.14.0.7
	Status:                              Slave, Synced, Running
	Protocol:                            MySQLBackend
	Port:                                3306
	Server Version:                      10.1.18-MariaDB-1~xenial
	Node Id:                             1
	Master Id:                           -1
	Slave Ids:                           
	Repl Depth:                          0
	Server Parameters:
	                                       priority	2
	Number of connections:               545
	Current no. of conns:                3
	Current no. of operations:           0
Server 0x704700 (galera3-alb)
	Server:                              10.14.0.8
	Status:                              Slave, Synced, Running
	Protocol:                            MySQLBackend
	Port:                                3306
	Server Version:                      10.1.18-MariaDB-1~xenial
	Node Id:                             2
	Master Id:                           -1
	Slave Ids:                           
	Repl Depth:                          0
	Server Parameters:
	                                       priority	3
	Number of connections:               547
	Current no. of conns:                4
	Current no. of operations:           0
  



 Comments   
Comment by Marlin Cremers [ 2016-11-02 ]

I'm getting the following stack trace:

2016-10-31 21:13:48   error  : 140245437769472 [dcb_close] Error : Removing DCB 0x13627f0 but was in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close. 
2016-10-31 21:13:48   error  : Fatal: MaxScale 2.0.1 received fatal signal 6. Attempting backtrace.
2016-10-31 21:13:48   error  : Commit ID: fa2a66719554d13a00db5c81c5c9ffd5b3a2ce14 System name: Linux Release string: Ubuntu 16.04.1 LTS
2016-10-31 21:13:48   error  :   /usr/bin/maxscale() [0x403ca7] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x113e0) [0x7f8d7ce183e0] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29) [0x7f8d7ce182b9] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_close+0x108) [0x7f8d7d2c2060] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(+0x32e7) [0x7f8d75e472e7] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x36b4) [0x7f8d74f6a6b4] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x32ce0) [0x7f8d7d2bfce0] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x32951) [0x7f8d7d2bf951] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_process_zombies+0x224) [0x7f8d7d2bf75a] 
2016-10-31 21:13:48   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(poll_waitevents+0x73e) [0x7f8d7d2d6484] 
2016-10-31 21:13:48   error  :   /usr/bin/maxscale(worker_thread_main+0x2a) [0x404d8f] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x770a) [0x7f8d7ce0e70a] 
2016-10-31 21:13:48   error  :   /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f8d7c70082d] 

Comment by markus makela [ 2016-11-04 ]

We've build a package for Ubuntu Xenial for testing purposes. The package can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-release-nov4/mariadb-maxscale/

The package was built from commit 87e94f6bc6e09d274bba70d62083b1349688ff33.

Comment by markus makela [ 2016-11-04 ]

This package did not fix the problem and the same crash occurred with the same log output. This could point out that it isn't a simple case of wrong backend server reference state but something else.

Comment by markus makela [ 2016-11-05 ]

mcremers
One possible cause for this might be that there's something wrong in the reconnection logic. It would be worth testing if adding disable_sescmd_history=true to the router options would prevents the crash.

Also testing without master_failure_mode=error_on_write would be good to see if the read-only functionality is the root cause of the crash.

Comment by markus makela [ 2016-11-11 ]

We've fixed a few bugs in the reconnection logic. Here's an Ubuntu Xenial package for testing built from commit 7ef8b187b541410810fc090814de48a107d729b7:
http://max-tst-01.mariadb.com/ci-repository/2.0-release-nov11/mariadb-maxscale/

Although we haven't been able to reproduce the crash, we believe that the reason for the double closing of a DCB could be related to how a server reference got misplaced after the list of references was sorted to pick the best candidates. We believe that this has been fixed and that router_options=master_failure_mode=error_on_write should work again.

mcremers, if it is possible, please test with the original configuration and the new package.

Comment by Marlin Cremers [ 2016-11-13 ]

MariaDB Corporation MaxScale	/var/log/maxscale/maxscale1.log Fri Nov 11 11:01:24 2016
-----------------------------------------------------------------------
2016-11-11 11:01:24   notice : Working directory: /var/log/maxscale
2016-11-11 11:01:24   notice : MariaDB MaxScale 2.0.1 started
2016-11-11 11:01:24   notice : MaxScale is running in process 32648
2016-11-11 11:01:24   notice : Configuration file: /etc/maxscale.cnf
2016-11-11 11:01:24   notice : Log directory: /var/log/maxscale
2016-11-11 11:01:24   notice : Data directory: /var/lib/maxscale
2016-11-11 11:01:24   notice : Module directory: /usr/lib/x86_64-linux-gnu/maxscale
2016-11-11 11:01:24   notice : Service cache: /var/cache/maxscale
2016-11-11 11:01:24   notice : Initialise CLI router module V1.0.0.
2016-11-11 11:01:24   notice : Loaded module cli: V1.0.0 from /usr/lib/x86_64-linux-gnu/maxscale/libcli.so
2016-11-11 11:01:24   notice : Initializing statemend-based read/write split router module.
2016-11-11 11:01:24   notice : Loaded module readwritesplit: V1.1.0 from /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so
2016-11-11 11:01:24   notice : Initialise the MySQL Galera Monitor module V2.0.0.
2016-11-11 11:01:24   notice : Loaded module galeramon: V2.0.0 from /usr/lib/x86_64-linux-gnu/maxscale/libgaleramon.so
2016-11-11 11:01:24   notice : No query classifier specified, using default 'qc_sqlite'.
2016-11-11 11:01:24   notice : Loaded module qc_sqlite: V1.0.0 from /usr/lib/x86_64-linux-gnu/maxscale/libqc_sqlite.so
2016-11-11 11:01:24   notice : Using encrypted passwords. Encryption key: '/var/lib/maxscale/.secrets'.
2016-11-11 11:01:24   notice : Loaded 13 MySQL Users for service [Splitter Service].
2016-11-11 11:01:24   notice : Loaded module MySQLClient: V1.1.0 from /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so
2016-11-11 11:01:24   notice : Listening connections at 0.0.0.0:3306 with protocol MySQL
2016-11-11 11:01:24   notice : Loaded module maxscaled: V2.0.0 from /usr/lib/x86_64-linux-gnu/maxscale/libmaxscaled.so
2016-11-11 11:01:24   notice : Listening connections at localhost:6603 with protocol MaxScale Admin
2016-11-11 11:01:24   notice : Started MaxScale log flusher.
2016-11-11 11:01:24   notice : MaxScale started with 4 server threads.
2016-11-11 11:01:24   notice : Server changed state: galera1-alb[10.14.0.6:3306]: new_slave. [Running] -> [Slave, Synced, Running]
2016-11-11 11:01:24   notice : Server changed state: galera2-alb[10.14.0.7:3306]: new_master. [Running] -> [Master, Synced, Running]
2016-11-11 11:01:24   notice : Server changed state: galera3-alb[10.14.0.8:3306]: new_slave. [Running] -> [Slave, Synced, Running]
2016-11-11 11:01:31   notice : Loaded module MySQLAuth: V1.0.0 from /usr/lib/x86_64-linux-gnu/maxscale/libMySQLAuth.so
2016-11-11 11:01:31   notice : Loaded module MySQLBackend: V2.0.0 from /usr/lib/x86_64-linux-gnu/maxscale/libMySQLBackend.so
2016-11-12 09:13:48   notice : Server changed state: galera2-alb[10.14.0.7:3306]: lost_master. [Master, Synced, Running] -> [Running]
2016-11-12 09:13:48   notice : Server changed state: galera3-alb[10.14.0.8:3306]: new_master. [Slave, Synced, Running] -> [Master, Synced, Running]
2016-11-12 09:13:48   error  : Server 10.14.0.7:3306 lost the master status. Readwritesplit service can't locate the master. Client sessions will be closed.
2016-11-12 09:13:49   notice : Server changed state: galera2-alb[10.14.0.7:3306]: new_master. [Running] -> [Master, Synced, Running]
2016-11-12 09:13:49   notice : Server changed state: galera3-alb[10.14.0.8:3306]: new_slave. [Master, Synced, Running] -> [Slave, Synced, Running]
2016-11-12 17:45:58   notice : Server changed state: galera1-alb[10.14.0.6:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2016-11-12 17:46:15   notice : Server changed state: galera1-alb[10.14.0.6:3306]: slave_up. [Down] -> [Slave, Synced, Running]
2016-11-13 19:33:00   error  : [dcb_close] Error : Removing DCB 0x7f8bd8014d90 but it is in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close. The DCB is connected to: galera1-alb
2016-11-13 19:33:00   error  : Fatal: MaxScale 2.0.1 received fatal signal 6. Attempting backtrace.
2016-11-13 19:33:00   error  : Commit ID: 7ef8b187b541410810fc090814de48a107d729b7 System name: Linux Release string: Ubuntu 16.04.1 LTS
2016-11-13 19:33:00   error  :   /usr/bin/maxscale() [0x403ca7] 
2016-11-13 19:33:00   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x113e0) [0x7f8bf2f243e0] 
2016-11-13 19:33:00   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29) [0x7f8bf2f242b9] 
2016-11-13 19:33:00   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_close+0x39) [0x7f8bf33ce2b0] 
2016-11-13 19:33:00   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(+0x32ed) [0x7f8bebf532ed] 
2016-11-13 19:33:00   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x36ee) [0x7f8beb0766ee] 
2016-11-13 19:33:00   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x32eae) [0x7f8bf33cbeae] 
2016-11-13 19:33:00   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x32b1f) [0x7f8bf33cbb1f] 
2016-11-13 19:33:00   error  : [dcb_close] Error : Removing DCB 0x7f8bd8016100 but it is in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close. The DCB is connected to: 	@

The daemon just crashed with the latest build and the original config.

Comment by markus makela [ 2016-11-13 ]

The log message seems to tell us that it was a slave DCB, i.e. a connection, that was closed twice.

With this new information, it seems that the problem could lie within the closing of DCBs in the processing of session commands and the recovery of slave connections. To test this hypothesis, please add back the disable_sescmd_history=true to the router_options parameter. If the crash does not occur, it is likely that this is the reason.

Meanwhile, we will focus on investigating the new areas of code that this additional information has provided us and take a new look at the slave connection code.

Comment by markus makela [ 2016-11-15 ]

We've identified a few problem cases with how the connections are handled. The reconnection of slave servers could be triggered even if no slave was disconnected. This could happen if the master connection was interrupted unexpectedly and the master would still be in Master state. These are now fixed by always treating the disconnection of the master the same way.

We've done a special testing build with extra log messages added where the crash can happen that can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-markusjm-nov15/mariadb-maxscale/

This was build from the 2.0-markusjm branch and commit 9556ee0f01bf12b1baeb93ad9483b87c37320c1b.

mcremers Please test this with the original configuration posted. We should get very detailed information if the crash still occurs.

Comment by markus makela [ 2016-11-21 ]

We discussed this on #maxscale on FreeNode and it seems there have been no problems with the latest package.

I'll close this bug and it can be reopened if it still shows up with 2.0.2.

Comment by Marlin Cremers [ 2016-11-27 ]

Just got this in the log.

2016-11-27 13:32:45   notice : DCB was closed on line 4533.
2016-11-27 13:32:45   notice : DCB was closed on line 4533.
2016-11-27 13:32:45   notice : DCB was closed on line 4533.
2016-11-27 13:32:45   notice : DCB was closed on line 4533.
2016-11-27 13:32:45   error  : [dcb_close] Error : Removing DCB 0x7fe2fc012e10 but it is in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close. The DCB is connected to: ��

Comment by markus makela [ 2016-11-27 ]

Seems we didn't fix the true cause of the problem. Can you provide the stacktrace for the crash?

Comment by markus makela [ 2016-11-29 ]

The log messages would suggest that a DCB not related to the session (already closed) gets processed which triggers the reconnection logic. The reconnection seems to be the thing which causes these errors.

I've built a new package from commit ee3c42cff781bec3bbbf1898f5b248aaee92fefa with more detailed error messages about where the DCB was closed and where the attempt is being made. It also adds extra checks before the DCB is closed and warns if an reconnection occurs when it shouldn't happen.

The packages can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-markusjm-nov29/mariadb-maxscale/

Comment by markus makela [ 2016-12-01 ]

Based on a chat on IRC, MaxScale still crashes. I've added some extra logging about whether the DCB which is being closed has a corresponding backend server reference. I've also added a somewhat of a temporary fix which doesn't close the DCB in closeSession if it isn't in a valid state.

If possible, please test with this new package: http://max-tst-01.mariadb.com/ci-repository/2.0-markusjm-dec1/mariadb-maxscale/

The packages are build from commit 1272cccf537aad8b824e1df978e0454e5b3b6c40.

Comment by markus makela [ 2016-12-07 ]

A discussion on IRC pointed out that the temporary fix does indeed prevent the crash. It again pointed to some strange behavior in the error handling but it also uncovered a bug that could've allowed masters with inconsistent state to be used. Simplifying the connection closing logic in the error handler should give us a better guarantee that the backend server references and the actual connections stay in sync.

Comment by markus makela [ 2016-12-12 ]

We've fixed a minor bug with the error_on_write mode and some of the error handling. The code is also a bit simpler and all connections and the related backend references are closed at the same time.

The latest packages for the 2.0.3 release candidate can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-release-dec12/mariadb-maxscale/

The packages were built from commit 15a8675fca53da3417b8c0155e43d91e1173f208.

Comment by markus makela [ 2016-12-14 ]

With the fix mentioned earlier, the crash is prevented and this is OK for 2.0.3.

Generated at Thu Feb 08 04:03:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.