|
Actually, this issue seems to be quite important and not related to wait_timeout. Approximately every minute I see this in logs:
2020-11-03 14:48:22 error : (43377) [readwritesplit] Cannot choose server 'node3' as the master because it is not in use and a new connection to it cannot be created. Connection status:
|
name: [node1] status: [Slave, Synced, Running] state: [IN_USE] last opened at: [Thu Oct 29 01:14:42 2020] last closed at: [not closed] last close reason: [] num sescmd: [0]
|
name: [node2] status: [Slave, Synced, Running] state: [NOT_IN_USE] last opened at: [Sat Oct 24 08:52:55 2020] last closed at: [Mon Oct 26 05:11:53 2020] last close reason: [Slave connection failed: #HY000: Lost connection to backend server: connection closed by peer] num sescmd: [0]
|
name: [node3] status: [Master, Synced, Running] state: [NOT_IN_USE] last opened at: [Fri Oct 2 05:24:31 2020] last closed at: [Sat Oct 3 07:39:20 2020] last close reason: [Master connection failed: #HY000: Lost connection to backend server: connection closed by peer] num sescmd: [0]
|
But if I do 'list servers' I can see that all servers are up, Master is selected and they never go down according to logs. What could be the issue in this case and how to fix it? Thanks.
|
|
Configuration of this MaxScale server:
[maxscale]
|
threads=4
|
|
[GaleraMonitor]
|
type=monitor
|
module=galeramon
|
servers=node1,node2,node3
|
user=xxxx
|
password=xxxxxx
|
monitor_interval=1000
|
disable_master_failback=true
|
available_when_donor=true
|
|
[RW]
|
type=service
|
router=readwritesplit
|
servers=node1,node2,node3
|
user=xxxxxx
|
password=xxxxxxx
|
max_slave_connections=100%
|
connection_keepalive=30
|
master_failure_mode=fail_on_write
|
master_accept_reads=true
|
disable_sescmd_history=true
|
|
[CLI]
|
type=service
|
router=cli
|
|
[RWlistener]
|
type=listener
|
service=RW
|
protocol=mariadbclient
|
address=10.22.x.x
|
port=3306
|
|
[CLIListener]
|
type=listener
|
service=CLI
|
protocol=maxscaled
|
address=127.0.0.1
|
port=6603
|
|
[node1]
|
type=server
|
address=10.22.x.1
|
port=3306
|
protocol=mariadbbackend
|
priority=3
|
persistmaxtime=3600
|
persistpoolmax=100
|
|
[node2]
|
type=server
|
address=10.22.x.2
|
port=3306
|
protocol=mariadbbackend
|
priority=2
|
persistmaxtime=3600
|
persistpoolmax=100
|
|
[node3]
|
type=server
|
address=10.22.x.3
|
port=3306
|
protocol=mariadbbackend
|
priority=1
|
persistmaxtime=3600
|
persistpoolmax=100
|
|
|
Looks like the connections are open for a relatively long time. Due to the way the connection_keepalive works in 2.4, the connections can be timed out by the backend if they are completely unused for the duration of wait_timeout. In 2.5, the connection keepalive was improved to keep completely idle connections alive.
I'd recommend replacing disable_sescmd_history=true with prune_sescmd_history=true and adding master_reconnection=true. This will allow a new connection to be created to the master in case the connection is closed.
|
|
Thanks, I'll give it a go tonight and report here! What about max_sescmd_history parameter? Should I adjust it from default 50 taking into account that we have at least 500 sessions to each MariaDB node of cluster?
|
|
Unless you change over 50 session variables with each connection, the default value is usually enough. Memory-wise it doesn't take that much so increasing it wouldn't probably be that noticeable either.
|
|
Thanks a lot, I will report as soon as possible if these settings helped!
|
|
Just a quick update - I've set these parameters this night and since that change I don't see any errors about MaxScale not being able to choose new master. We will monitor this for another 24hrs as our wait_timeout is set to 86400 on MariaDB nodes. I will post another update tomorrow so ticket could be closed if everything is good. Thanks for your help!
|
|
OK, that's good to hear. We'll wait on your confirmation before closing this.
|
|
Unfortunately, after 24hrs (wait_timeout=86400) I started to see disconnect errors again, but only to slave nodes, master connection seems to be fine this far:
2020-11-05 21:30:48 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:30:51 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:30:53 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:30:55 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:30:57 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:31:00 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:31:03 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:32:17 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:32:19 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 21:32:21 error : Failed to execute query on server 'node2' ([10.22.x.2]:3306): Lost connection to MySQL server during query
|
2020-11-05 22:07:55 error : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
|
2020-11-05 22:13:40 error : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
|
2020-11-05 22:13:43 error : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
|
2020-11-05 22:13:45 error : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
|
2020-11-05 22:13:48 error : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
|
2020-11-05 22:13:50 error : Failed to execute query on server 'node1' ([10.22.x.1]:3306): Lost connection to MySQL server during query
|
Error looks different but still happens even though keepalives are enabled.
MaxScale> list servers
|
Servers.
|
-------------------+-----------------+-------+-------------+--------------------
|
Server | Address | Port | Connections | Status
|
-------------------+-----------------+-------+-------------+--------------------
|
node1 | 10.22.x.1 | 3306 | 509 | Slave, Synced, Running
|
node2 | 10.22.x.2 | 3306 | 509 | Slave, Synced, Running
|
node3 | 10.22.x.3 | 3306 | 509 | Master, Synced, Running
|
MaxScale> show service RW
|
Service: RW
|
Router: readwritesplit
|
State: Started
|
|
use_sql_variables_in: all
|
slave_selection_criteria: LEAST_CURRENT_OPERATIONS
|
master_failure_mode: fail_on_write
|
max_slave_replication_lag: 0
|
retry_failed_reads: true
|
strict_multi_stmt: false
|
strict_sp_calls: false
|
prune_sescmd_history: true
|
disable_sescmd_history: false
|
max_sescmd_history: 50
|
master_accept_reads: true
|
connection_keepalive: 30
|
causal_reads: false
|
causal_reads_timeout: 10
|
master_reconnection: true
|
delayed_retry: false
|
delayed_retry_timeout: 10
|
|
Number of router sessions: 1150478
|
Current no. of router sessions: 509
|
Number of queries forwarded: 18935459
|
Number of queries forwarded to master: 4364322 (23.05%)
|
Number of queries forwarded to slave: 14571137 (76.95%)
|
Number of queries forwarded to all: 2310889 (12.20%)
|
Number of read-write transactions: 0
|
Number of read-only transactions: 559708
|
Number of replayed transactions: 0
|
Server Total Read Write Sess Avg: dur active selects
|
node2 5864500 5864500 0 1.07444s 0.47% 3
|
node3 7667038 3302716 4364322 1.07444s 0.14% 4
|
node1 12336588 12336588 0 1.07446s 1.19% 9
|
Started: Thu Nov 5 01:27:30 2020
|
Root user access: Disabled
|
Backend databases:
|
[10.22.x.1]:3306 Protocol: mariadbbackend Name: node1
|
[10.22.x.2]:3306 Protocol: mariadbbackend Name: node2
|
[10.22.x.3]:3306 Protocol: mariadbbackend Name: node3
|
Total connections: 1150553
|
Currently connected: 509
|
On actual MariaDB nodes I can see many warnings like these:
2020-11-05 22:13:56 5932591 [Warning] Aborted connection 5932591 to db: 'unconnected' user: 'xxx' host: '10.22.x.x' (Got an error reading communication packets)
|
2020-11-05 22:13:56 5932179 [Warning] Aborted connection 5932179 to db: 'unconnected' user: 'xxx' host: '10.22.x.x' (Got an error reading communication packets)
|
IP address here is IP of MaxScale server and user is the one configured under Monitor section. We definitely have no issues in local network so I'm not sure why this is happening. Any ideas? Thanks!
|
|
I still can see these errors in MaxScale log but because we have master_reconnection=true set it doesn't affect our application any more. So please feel free to close this ticket as I hope to switch over to 2.5.6 as soon as it's out and connection_keepalive is different there. Also, I've tuned my sysctl settings to:
net.ipv4.tcp_keepalive_intvl = 60
|
net.ipv4.tcp_keepalive_probes = 9
|
net.ipv4.tcp_keepalive_time = 300
|
And this might also helped a bit as I can see less errors in MaxScale log, only a few during the day. Thanks for your help!
|