[MXS-559] Crash due to debug assertion in readwritesplit Created: 2016-01-26  Updated: 2016-02-01  Resolved: 2016-02-01

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 1.3.0
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: markus makela Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-548 Maxscale 1.2.1 crash on Ubuntu 14.04.... Closed
relates to MXS-564 Loading database dump through readwri... Closed

 Description   

Originally reported by engel75:

It looks like the "jan21" version is still crashing.

The crash was caused by using IP 10.0.248.202 which is listening to the RW split router of 5 MariaDB 5.5.47 nodes to dump a DB to the cluster:

mysql -h10.0.248.202  -uadmin -pxxxxxxxxxxxxxxxxxxxxxxxx ops_meta_db < ops_meta_db.sql

2016-01-25 16:21:34.150   error  : Backend hangup error handling.
2016-01-25 16:21:34.150   error  : Backend hangup error handling.
2016-01-25 19:04:29.502   error  : Backend hangup error handling.
2016-01-25 19:04:29.503   error  : Backend hangup error handling.
2016-01-25 19:34:53.271   error  : Backend hangup error handling.
2016-01-25 19:34:53.271   error  : Backend hangup error handling.
2016-01-25 19:34:53.273   error  : Backend hangup error handling.
2016-01-25 19:34:53.273   error  : Backend hangup error handling.
2016-01-25 20:06:34.445   error  : Backend hangup error handling.
2016-01-25 20:06:34.449   error  : Backend hangup error handling.
2016-01-25 20:06:34.452   error  : Backend hangup error handling.
2016-01-25 20:06:45.636   error  : Backend hangup error handling.
2016-01-25 20:07:26.176   error  : Backend hangup error handling.
2016-01-25 20:11:53.358   error  : debug assert /home/vagrant/workspace/server/modules/routing/readwritesplit/readwritesplit.c:1689 Dcb under- or overflow 
2016-01-25 20:11:53.358   error  : Fatal: MaxScale 1.3.0 received fatal signal 6. Attempting backtrace.
2016-01-25 20:11:53.358   error  : Commit ID: 99f39cb213511b95552c49052e77a12e46090129 System name: Linux Release string: Ubuntu 14.04.3 LTS Embedded library version: 5.5.42-MariaDB
2016-01-25 20:11:53.359   error  :   /usr/bin/maxscale() [0x54cb9e] 
2016-01-25 20:11:53.359   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fd6263c7340] 
2016-01-25 20:11:53.359   error  :   /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fd624ffecc9] 
2016-01-25 20:11:53.359   error  :   /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fd6250020d8] 
2016-01-25 20:11:53.359   error  :   /lib/x86_64-linux-gnu/libc.so.6(+0x2fb86) [0x7fd624ff7b86] 
2016-01-25 20:11:53.359   error  :   /lib/x86_64-linux-gnu/libc.so.6(+0x2fc32) [0x7fd624ff7c32] 
2016-01-25 20:11:53.359   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(+0x4c67) [0x7fd61d420c67] 
2016-01-25 20:11:53.359   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(+0x62a0) [0x7fd61d4222a0] 
2016-01-25 20:11:53.359   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(+0x5a80) [0x7fd61d421a80] 
2016-01-25 20:11:53.360   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x751c) [0x7fd608e2251c] 
2016-01-25 20:11:53.360   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x51fa) [0x7fd608e201fa] 
2016-01-25 20:11:53.360   error  :   /usr/bin/maxscale() [0x565b59] 
2016-01-25 20:11:53.360   error  :   /usr/bin/maxscale(poll_waitevents+0x67e) [0x565137] 
2016-01-25 20:11:53.360   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fd6263bf182] 
2016-01-25 20:11:53.360   error  :   /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fd6250c247d] 

/var/log/syslog from one of the Galera DB nodes:

Jan 25 16:21:34 wsccms-db05f-prod mysqld: 160125 16:21:34 [Warning] Aborted connection 694195 to db: 'ops_meta_db' user: 'admin' host: '10.0.249.211' (Unknown error)
Jan 25 19:04:29 wsccms-db05f-prod mysqld: 160125 19:04:29 [Warning] Aborted connection 704612 to db: 'ops_meta_db' user: 'admin' host: '10.0.249.211' (Unknown error)
Jan 25 19:34:53 wsccms-db05f-prod mysqld: 160125 19:34:53 [Warning] Aborted connection 706553 to db: 'ops_meta_db' user: 'admin' host: '10.0.249.211' (Unknown error)
Jan 25 20:06:34 wsccms-db05f-prod mysqld: 160125 20:06:34 [Warning] Aborted connection 708905 to db: 'ops_meta_db' user: 'admin' host: '10.0.249.211' (Unknown error)
Jan 25 20:11:53 wsccms-db05f-prod mysqld: 160125 20:11:53 [Warning] Aborted connection 331417 to db: 'unconnected' user: 'maxscalemon' host: '10.0.249.211' (Unknown error)

[MaxScale]
threads=4
syslog=1
maxlog=1
ms_timestamp=1
log_to_shm=0
log_warning=1
log_notice=1
log_info=0
log_debug=0
 
[CLI_listener]
type=listener
service=CLI
protocol=maxscaled
address=localhost
port=6603
 
[CLI]
type=service
router=cli
 
[debug]
type=service
router=debugcli
router_options=user
 
[debug_listener]
type=listener
service=debug
protocol=telnetd
port=4442
 
[galera_5_monitor]
type=monitor
module=galeramon
servers=dbf1,dbf2,dbf3,dbf4,dbf5
user=maxscalemon
passwd=euxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
monitor_interval=2000
disable_master_failback=1
available_when_donor=1
disable_master_role_setting=0
 
[galera_10_monitor]
type=monitor
module=galeramon
servers=dbg1,dbg2,dbg3,dbg4,dbg5
user=maxscalemon
passwd=exxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
monitor_interval=2000
disable_master_failback=1
available_when_donor=1
disable_master_role_setting=0
 
 
[galera_rr_service]
type=service
router=readconnroute
#connection_timeout=300
router_options=synced
servers=dbf1,dbf2,dbf3,dbf4,dbf5
user=maxscaleroute
passwd=Koxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
enable_root_user=0
version_string=5.5.47-MariaDB-RR
#optimize_wildcard=true
log_auth_warnings=true
 
[galera_rr10_service]
type=service
router=readconnroute
#connection_timeout=300
router_options=synced
servers=dbg1,dbg2,dbg3,dbg4,dbg5
user=maxscaleroute
passwd=Koxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
enable_root_user=0
version_string=10.1-MariaDB-RR
#optimize_wildcard=true
log_auth_warnings=true
 
 
[galera_rw_service]
type=service
router=readwritesplit
#connection_timeout=300
router_options=slave_selection_criteria=LEAST_CURRENT_OPERATIONS,disable_sescmd_history=true
max_slave_connections=100%
servers=dbf1,dbf2,dbf3,dbf4,dbf5
user=maxscaleroute
passwd=Kohxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
enable_root_user=0
version_string=5.5.47-MariaDB-RWsplit
#optimize_wildcard=true
log_auth_warnings=true
 
[galera_rw10_service]
type=service
router=readwritesplit
#connection_timeout=300
router_options=slave_selection_criteria=LEAST_CURRENT_OPERATIONS,disable_sescmd_history=true
max_slave_connections=100%
servers=dbg1,dbg2,dbg3,dbg4,dbg5
user=maxscaleroute
passwd=Koxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
enable_root_user=0
version_string=10.1-MariaDB-RWsplit
#optimize_wildcard=true
log_auth_warnings=true
 
 
[galera_rr_listener]
type=listener
service=galera_rr_service
protocol=MySQLClient
address=10.0.248.201
port=3306
 
[galera_rw_listener]
type=listener
service=galera_rw_service
protocol=MySQLClient
address=10.0.248.202
port=3306
 
[galera_rr10_listener]
type=listener
service=galera_rr10_service
protocol=MySQLClient
address=10.0.248.205
port=3306
 
[galera_rw10_listener]
type=listener
service=galera_rw10_service
protocol=MySQLClient
address=10.0.248.206
port=3306
 
[dbf1]
type=server
address=10.0.249.11
port=3306
protocol=MySQLBackend
 
[dbf2]
type=server
address=10.0.249.12
port=3306
protocol=MySQLBackend
 
[dbf3]
type=server
address=10.0.249.13
port=3306
protocol=MySQLBackend
 
[dbf4]
type=server
address=10.0.249.14
port=3306
protocol=MySQLBackend
 
[dbf5]
type=server
address=10.0.249.15
port=3306
protocol=MySQLBackend
 
[dbg1]
type=server
address=10.0.249.16
port=3306
protocol=MySQLBackend
 
[dbg2]
type=server
address=10.0.249.17
port=3306
protocol=MySQLBackend
 
[dbg3]
type=server
address=10.0.249.18
port=3306
protocol=MySQLBackend
 
[dbg4]
type=server
address=10.0.249.19
port=3306
protocol=MySQLBackend
 
[dbg5]
type=server
address=10.0.249.20
port=3306
protocol=MySQLBackend



 Comments   
Comment by Florian Engelmann [ 2016-01-26 ]

The .sql dump got 4.7GB. After adding

SET SESSION SQL_LOG_BIN=0;

to the beginning of the file the dump import failed again:

ERROR 2003 (HY000) at line 5452: Lost connection to backend server.

but Maxscale did not crash this time. So I enabled "log_info" (while "log_to_shm" is still disabled) to see which SQL statement fails the import but it did NOT fail this time.

The "log_bin" filesystem size is 20GB and still got 11GB left.

I'll import the dump once again to see what happens this time.

Comment by markus makela [ 2016-01-26 ]

It crashes on line 1691 in readwritesplit due to a debug check failure on the master connection's DCB. The use of the master's DCB isn't required as it could be replaced with the client's DCB since both of them use the same authentication data as was found out in MXS-548. The client session is also protected by a separate lock which will prevent any problems with state changes while a query is running.

Comment by Florian Engelmann [ 2016-01-26 ]

Ok sounds like an easy fix. Could we please get a new package including the fix and debugging enabled after you are ready?
I don't wanna stress you at all - I am just VERY willing to test like hell to get a stable version 1.3.0.

Comment by Florian Engelmann [ 2016-01-26 ]

no crash this time but the import failed again.

ERROR 2003 (HY000) at line 5441: Lost connection to backend server.
 
real    14m29.897s
user    2m11.685s
sys     0m8.347s

2016-01-26 13:51:34.344   [27]  info   : Route query to master 	10.0.249.12:3306 <
2016-01-26 13:51:34.659   [27]  info   : > Autocommit: [enabled], trx is [not open], cmd: COM_QUERY, type: QUERY_TYPE_WRITE, stmt: INSERT INTO `spider_logs` VALUES (25560922,'xxxxxxxxxxxxxx.xxx','xxx.255.253.xx','xxxdex','2014-04-11','2014-04-11 09:22:52'),(25560923,'xxxxxxxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:22:55'),(25560924,'xxxxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:24:30'),(25560925,'xxxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:24:32'),(25560926,'xxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:22:57'),(25560927,'xxxxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:23:02'),(25560928,'xxxx.xxx','xxx.249.73.xxx','Google','2014-04-11','2014-04-11 09:24:34'),(25560929,'xxxxxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:24:36'),(25560930,'xxxx.xxx','93.158.149.31','xxxdex','2014-04-11','2014-04-11 09:23:04'),(25560931,'xxxx.xxx','xxx.158.149.xxx','xxxdex','2014-04-11','2014-04-11 09:23:06'),(25560932,'xxxxxx.xxx','93.158.1 
2016-01-26 13:51:34.659   [27]  info   : Route query to master 	10.0.249.12:3306 <
2016-01-26 13:51:34.867   error  : Backend hangup error handling.
2016-01-26 13:51:34.867   error  : Backend hangup error handling.
2016-01-26 13:51:35.936   [27]  info   : Stopped galera_rw_service client session [27]

Comment by markus makela [ 2016-01-26 ]

Here's a build with the fix added for Ubuntu Trusty: http://maxscale-jenkins.mariadb.com/ci-repository/release-1.3.0-jan26/

Comment by Florian Engelmann [ 2016-01-26 ]

great! Testing this one now... thank you!

Comment by Florian Engelmann [ 2016-01-26 ]

same problem with the new jan26 version:

2016-01-26 16:17:26.710   [7]  info   : Route query to master 	10.0.249.12:3306 <
2016-01-26 16:17:26.926   error  : Backend hangup error handling.
2016-01-26 16:17:26.926   error  : Backend hangup error handling.
2016-01-26 16:17:28.181   [7]  info   : Stopped galera_rw_service client session [7]

time mysql -h10.0.248.202  -uadmin -pxxxxxxxxxxxxxxxxxxxxxxxx  ops_meta_db < ops_meta_db_2.sql 
ERROR 2003 (HY000) at line 5509: Lost connection to backend server.
 
real    14m12.207s
user    2m12.788s
sys     0m8.654s

an 26 16:17:26 wsccms-db05f-prod mysqld: 160126 16:17:26 [Warning] Aborted connection 786006 to db: 'ops_meta_db' user: 'admin' host: '10.0.249.211' (Unknown error)

Testing the same command via haproxy now to see if there is any problem related to our mariaDB Galera configuration.

Comment by Florian Engelmann [ 2016-01-26 ]

success via haproxy:

time mysql -h10.0.248.204  -uadmin -pxxxxxxxxxx  ops_meta_db < ops_meta_db_2.sql 
 
real    15m49.076s
user    2m34.957s
sys     0m17.460s

The problem is definitely maxscale related.

Comment by markus makela [ 2016-01-27 ]

The error that is logged is related to error handling for backend server connections. It would suggest that MaxScale lost connection to the server or it thought that it lost connection due to timeout values being too low. It might be a good idea to try with higher timeouts for the monitors if the database is slow during the dump: Monitor documentation

If that is the only relevant error message logged and it actually was caused by a timeout there should be more messages. I'll continue investigating this and see if I can reproduce it.

Comment by markus makela [ 2016-01-27 ]

One thing that could also help is to do the dump with the readconnroute module if possible. It uses connection based load balancing and is faster than the readwritesplit when all writes go to the master server. If it is possible for you to confirm that the dump is successful with the readconnroute router, we can confirm that it is a problem with the readwritesplit module.

For more details about the readconnroute module, please read the documentation here: https://github.com/mariadb-corporation/MaxScale/blob/release-1.3.0/Documentation/Routers/ReadConnRoute.md

Comment by markus makela [ 2016-01-27 ]

I was able to reproduce something similar to this:

2016-01-27 13:16:34   error  : Monitor timed out when connecting to server 192.168.0.200:3306 : "Lost connection to MySQL server at 'reading authorization packet', system error: 110 "Connection timed out""
2016-01-27 13:16:34   notice : Server 192.168.0.200:3306 lost the master status.
2016-01-27 13:16:34   notice : Server changed state: server1[192.168.0.200:3306]: master_down
2016-01-27 13:16:34   error  : No Master can be determined. Last known was 192.168.0.200:3306
2016-01-27 13:16:34   error  : Backend hangup error handling.
2016-01-27 13:16:34   error  : server 192.168.0.200:3306 lost the master status. Readwritesplit service can't locate the master. Client sessions will be closed.
2016-01-27 13:16:34   error  : Backend hangup -> closing session.
2016-01-27 13:16:45   notice : Server changed state: server1[192.168.0.200:3306]: master_up
2016-01-27 13:16:45   notice : A Master Server is now available: 192.168.0.200:3306

Here the monitor isn't able to connect to the master because of the intensity of the dump restoration that was happening but that was explicitly logged into the error logs.

Comment by markus makela [ 2016-01-28 ]

engel75 Could you try the dump again without the info or debug log leves on? That would filter out some of the extra noise and would allow errors to be spotted more easily. If you can also increase the timeouts for the monitor to about 10-15 seconds by adding backend_connect_timeout=10 and backend_read_timeout=10 to the monitor definition we can rule out the monitor timing out and closing the connection.

Comment by Florian Engelmann [ 2016-01-28 ]

OK - config looks like:

[galera_5_monitor]
type=monitor
module=galeramon
servers=dbf1,dbf2,dbf3,dbf4,dbf5
user=maxscalemon
passwd=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
monitor_interval=2000
disable_master_failback=1
available_when_donor=1
disable_master_role_setting=0
backend_connect_timeout=10
backend_read_timeout=10

Same result:

 time mysql -h10.0.248.202  -uadmin -pxxxxxxxxxxxxxxxxxx  ops_meta_db < ops_meta_db_2.sql 
ERROR 2003 (HY000) at line 5317: Lost connection to backend server.

2016-01-28 09:55:47.849   error  : Backend hangup error handling.

Jan 28 09:55:47 wsccms-db05f-prod mysqld: 160128  9:55:47 [Warning] Aborted connection 945714 to db: 'ops_meta_db' user: 'admin' host: '10.0.249.211' (Unknown error)

Maxscale did not crash, so no core dump.

Comment by Florian Engelmann [ 2016-01-28 ]

btw. importing the dump via the readconn router finished successful:

time mysql -h10.0.248.201  -uadmin -pxxxxxxxxxxxxxxxx  ops_meta_db < ops_meta_db_2.sql 
 
real    16m44.778s
user    2m35.446s
sys     0m10.656s

I'll test this once again to be sure...

Comment by Florian Engelmann [ 2016-01-28 ]

seems like readconn router does not cause any trouble. My 2nd import finished successful too.

Comment by markus makela [ 2016-01-28 ]

OK, that's good to hear. One thing that could be possibly helpful is to split the dump into separate statements and execute them one at a time or by splitting the dump in two. This way we could find out if it is always the same statement or if it seems to be a "random" failure.

For the time being, I suggest using the readconnroute to load dumps into a cluster simply because it does a lot less processing and thus is a lot faster. Although using the readwritesplit is not the fastest way to do it, it seems to be a great way to catch bugs in MaxScale .

Comment by markus makela [ 2016-02-01 ]

The crash due to debug assertion has been fixed but the dump of the database still fails. I'll create a new bug report for that issue: MXS-564

Generated at Thu Feb 08 04:00:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.