[MXS-3274] MySQL Server has gone away with connection_keepalive enabled Created: 2020-10-31  Updated: 2020-11-09  Resolved: 2020-11-09

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 2.4.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Yury Kirsanov Assignee: markus makela
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Ubuntu 18.04.5 LTS, MariaDB Galera cluster



 Description   

I have configured connection_keepalive=1 parameter under ReadWriteSplit router, out MariaDB servers are configured with wait_timeout=28800 and interactive_timeout=28800 but we are still experiencing 'MySQL Server has gone away' errors during operation of our cluster. According to documentation, MaxScale should keep connections open by sending periodic keepalive packets. I've checked our MariaDB cluster and can confirm that no nodes were failing to cause this issue, but indeed the disconnect was caused by a long open connection that wasn't sending any queries. Actually, application was sending queries like 'SELECT 1' by it looks like they were routed to slaves allowing master connection to time out.



 Comments   
Comment by Yury Kirsanov [ 2020-11-03 ]

Actually, this issue seems to be quite important and not related to wait_timeout. Approximately every minute I see this in logs:

2020-11-03 14:48:22   error  : (43377) [readwritesplit] Cannot choose server 'node3' as the master because it is not in use and a new connection to it cannot be created. Connection status:
name: [node1] status: [Slave, Synced, Running] state: [IN_USE] last opened at: [Thu Oct 29 01:14:42 2020] last closed at: [not closed] last close reason: [] num sescmd: [0]
name: [node2] status: [Slave, Synced, Running] state: [NOT_IN_USE] last opened at: [Sat Oct 24 08:52:55 2020] last closed at: [Mon Oct 26 05:11:53 2020] last close reason: [Slave connection failed: #HY000: Lost connection to backend server: connection closed by peer] num sescmd: [0]
name: [node3] status: [Master, Synced, Running] state: [NOT_IN_USE] last opened at: [Fri Oct  2 05:24:31 2020] last closed at: [Sat Oct  3 07:39:20 2020] last close reason: [Master connection failed: #HY000: Lost connection to backend server: connection closed by peer] num sescmd: [0]

But if I do 'list servers' I can see that all servers are up, Master is selected and they never go down according to logs. What could be the issue in this case and how to fix it? Thanks.

Comment by Yury Kirsanov [ 2020-11-03 ]

Configuration of this MaxScale server:

[maxscale]
threads=4
 
[GaleraMonitor]
type=monitor
module=galeramon
servers=node1,node2,node3
user=xxxx
password=xxxxxx
monitor_interval=1000
disable_master_failback=true
available_when_donor=true
 
[RW]
type=service
router=readwritesplit
servers=node1,node2,node3
user=xxxxxx
password=xxxxxxx
max_slave_connections=100%
connection_keepalive=30
master_failure_mode=fail_on_write
master_accept_reads=true
disable_sescmd_history=true
 
[CLI]
type=service
router=cli
 
[RWlistener]
type=listener
service=RW
protocol=mariadbclient
address=10.22.x.x
port=3306
 
[CLIListener]
type=listener
service=CLI
protocol=maxscaled
address=127.0.0.1
port=6603
 
[node1]
type=server
address=10.22.x.1
port=3306
protocol=mariadbbackend
priority=3
persistmaxtime=3600
persistpoolmax=100
 
[node2]
type=server
address=10.22.x.2
port=3306
protocol=mariadbbackend
priority=2
persistmaxtime=3600
persistpoolmax=100
 
[node3]
type=server
address=10.22.x.3
port=3306
protocol=mariadbbackend
priority=1
persistmaxtime=3600
persistpoolmax=100

Comment by markus makela [ 2020-11-03 ]

Looks like the connections are open for a relatively long time. Due to the way the connection_keepalive works in 2.4, the connections can be timed out by the backend if they are completely unused for the duration of wait_timeout. In 2.5, the connection keepalive was improved to keep completely idle connections alive.

I'd recommend replacing disable_sescmd_history=true with prune_sescmd_history=true and adding master_reconnection=true. This will allow a new connection to be created to the master in case the connection is closed.

Comment by Yury Kirsanov [ 2020-11-03 ]

Thanks, I'll give it a go tonight and report here! What about max_sescmd_history parameter? Should I adjust it from default 50 taking into account that we have at least 500 sessions to each MariaDB node of cluster?

Comment by markus makela [ 2020-11-03 ]

Unless you change over 50 session variables with each connection, the default value is usually enough. Memory-wise it doesn't take that much so increasing it wouldn't probably be that noticeable either.

Comment by Yury Kirsanov [ 2020-11-03 ]

Thanks a lot, I will report as soon as possible if these settings helped!

Comment by Yury Kirsanov [ 2020-11-05 ]

Just a quick update - I've set these parameters this night and since that change I don't see any errors about MaxScale not being able to choose new master. We will monitor this for another 24hrs as our wait_timeout is set to 86400 on MariaDB nodes. I will post another update tomorrow so ticket could be closed if everything is good. Thanks for your help!

Comment by markus makela [ 2020-11-05 ]

OK, that's good to hear. We'll wait on your confirmation before closing this.

Comment by Yury Kirsanov [ 2020-11-05 ]

Unfortunately, after 24hrs (wait_timeout=86400) I started to see disconnect errors again, but only to slave nodes, master connection seems to be fine this far:

2020-11-05 21:30:48   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:30:51   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:30:53   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:30:55   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:30:57   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:31:00   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:31:03   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:32:17   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:32:19   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 21:32:21   error  : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
2020-11-05 22:07:55   error  : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
2020-11-05 22:13:40   error  : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
2020-11-05 22:13:43   error  : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
2020-11-05 22:13:45   error  : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
2020-11-05 22:13:48   error  : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
2020-11-05 22:13:50   error  : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query

Error looks different but still happens even though keepalives are enabled.

MaxScale> list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status              
-------------------+-----------------+-------+-------------+--------------------
node1              | 10.22.x.1    |  3306 |         509 | Slave, Synced, Running
node2              | 10.22.x.2    |  3306 |         509 | Slave, Synced, Running
node3              | 10.22.x.3    |  3306 |         509 | Master, Synced, Running

MaxScale> show service RW
	Service:                             RW
	Router:                              readwritesplit
	State:                               Started
 
	use_sql_variables_in:      all
	slave_selection_criteria:  LEAST_CURRENT_OPERATIONS
	master_failure_mode:       fail_on_write
	max_slave_replication_lag: 0
	retry_failed_reads:        true
	strict_multi_stmt:         false
	strict_sp_calls:           false
	prune_sescmd_history:    true
	disable_sescmd_history:    false
	max_sescmd_history:        50
	master_accept_reads:       true
	connection_keepalive:       30
	causal_reads:       false
	causal_reads_timeout:       10
	master_reconnection:       true
	delayed_retry:        false
	delayed_retry_timeout:       10
 
	Number of router sessions:              1150478
	Current no. of router sessions:         509
	Number of queries forwarded:            18935459
	Number of queries forwarded to master:  4364322 (23.05%)
	Number of queries forwarded to slave:   14571137 (76.95%)
	Number of queries forwarded to all:     2310889 (12.20%)
	Number of read-write transactions:      0
	Number of read-only transactions:       559708
	Number of replayed transactions:        0
        Server      Total       Read      Write  Sess Avg:      dur      active    selects
         node2    5864500    5864500          0            1.07444s       0.47%          3
         node3    7667038    3302716    4364322            1.07444s       0.14%          4
         node1   12336588   12336588          0            1.07446s       1.19%          9
	Started:                             Thu Nov  5 01:27:30 2020
	Root user access:                    Disabled
	Backend databases:
		[10.22.x.1]:3306    Protocol: mariadbbackend    Name: node1
		[10.22.x.2]:3306    Protocol: mariadbbackend    Name: node2
		[10.22.x.3]:3306    Protocol: mariadbbackend    Name: node3
	Total connections:                   1150553
	Currently connected:                 509

On actual MariaDB nodes I can see many warnings like these:

2020-11-05 22:13:56 5932591 [Warning] Aborted connection 5932591 to db: 'unconnected' user: 'xxx' host: '10.22.x.x' (Got an error reading communication packets)                                                                     
2020-11-05 22:13:56 5932179 [Warning] Aborted connection 5932179 to db: 'unconnected' user: 'xxx' host: '10.22.x.x' (Got an error reading communication packets)                  

IP address here is IP of MaxScale server and user is the one configured under Monitor section. We definitely have no issues in local network so I'm not sure why this is happening. Any ideas? Thanks!

Comment by Yury Kirsanov [ 2020-11-09 ]

I still can see these errors in MaxScale log but because we have master_reconnection=true set it doesn't affect our application any more. So please feel free to close this ticket as I hope to switch over to 2.5.6 as soon as it's out and connection_keepalive is different there. Also, I've tuned my sysctl settings to:

net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 300

And this might also helped a bit as I can see less errors in MaxScale log, only a few during the day. Thanks for your help!

Generated at Thu Feb 08 04:20:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.