[MXS-3441] Maxscale loses connection to master when mysqldump is ran Created: 2021-03-13  Updated: 2021-03-24  Resolved: 2021-03-24

Status: Closed
Project: MariaDB MaxScale
Component/s: Monitor
Affects Version/s: 2.5.5
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Vitaly Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Debian buster with 1 Maxscale host, 1 master mariadb and 1 slave mariadb.
Maxscale 2.5.5
DB version: mysqldump Ver 10.17 Distrib 10.4.15-MariaDB, for Linux (x86_64)



 Description   

Hi, we have a farily large database, 500G, and when we run mysqldump against maxscale host, after about 30 seconds it exits with an error.
Maxscale prints this:
2021-03-13 14:40:26 error : Monitor timed out when connecting to server dbA[xx.xx.xx.01:3370] : ''
2021-03-13 14:40:26 notice : Server changed state: dbA[xx.xx.xx.01:3370]: master_down. [Master, Running] -> [Down]
2021-03-13 14:40:33 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|
2021-03-13 14:40:33 notice : Executed monitor script on event 'master_down'. Script was: '/opt/runtime/maxscaleNotification.sh'
2021-03-13 14:40:33 warning: [mariadbmon] Master has failed. If master does not return in 4 monitor tick(s), failover begins.
2021-03-13 14:40:33 error : (46) [readwritesplit] (rwSplitter) Lost connection to the master server, closing session. Lost connection to master server while waiting for a result. Connection has been idle for 52 seconds. Error caused by: #HY000: (Generated event) backend server: connection closed by peer. Last close reason: <none>. Last error:
2021-03-13 14:40:33 error : (42) [readwritesplit] (rwSplitter) Server 'dbA' was lost in the middle of a resultset, cannot continue the session: #HY000: (Generated event) backend server: connection closed by peer
2021-03-13 14:40:33 error : (54) [readwritesplit] (rwSplitter) Couldn't find suitable Master from 2 candidates.
2021-03-13 14:40:33 notice : Server changed state: dbA[xx.xx.xx.01:3370]: master_up. [Down] -> [Master, Running]
2021-03-13 14:40:33 error : (54) (rwSplitter) Failed to create new router session for service 'rwSplitter'. See previous errors for more details.
2021-03-13 14:40:33 error : (54) [mariadbclient] Failed to create session for 'someuser'@'::ffff:xx.xx.xx.01'.
2021-03-13 14:40:36 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|INFO SignalAccount - Config file is in use by another instance, waiting…

At the same time, master mariadb prints:
2021-03-13 9:40:20 10 [Warning] Aborted connection 10 to db: 'unconnected' user: 'orchestrator' host: 'maxScaleHostIP' (Got an error reading communication packets)
2021-03-13 9:40:23 40 [Warning] Aborted connection 40 to db: 'unconnected' user: 'unauthenticated' host: 'maxScaleHostIP' (This connection closed normally without authentication)
2021-03-13 9:40:26 41 [Warning] Aborted connection 41 to db: 'unconnected' user: 'unauthenticated' host: 'maxScaleHostIP' (This connection closed normally without authentication)

After further testing, it looks like behavior is the same whether mysqldump is ran against individual mariadb instances or even if mariabackup tool is used. In either case maxscale monitor complains that the master is down.

Here is the maxscale log while running mariabackup
2021-03-13 15:34:29 error : Monitor timed out when connecting to server dbA[10.68.200.107:3370] : ''
2021-03-13 15:34:29 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_down. [Slave, Running] -> [Down]
2021-03-13 15:34:32 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|1615649671531
2021-03-13 15:34:32 notice : Executed monitor script on event 'slave_down'. Script was: '/opt/runtime/maxscaleNotification.sh'
2021-03-13 15:34:38 notice : Server 'dbA' charset: latin1
2021-03-13 15:34:38 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_up. [Down] -> [Slave, Running]
2021-03-13 15:34:41 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|1615649680507
2021-03-13 15:34:41 notice : Executed monitor script on event 'slave_up'. Script was: '/opt/runtime/maxscaleNotification.sh'
2021-03-13 15:35:34 error : Monitor timed out when connecting to server dbA[xx.xx.xx.01:3370] : ''
2021-03-13 15:35:34 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_down. [Slave, Running] -> [Down]
2021-03-13 15:35:36 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|1615649736263
2021-03-13 15:35:36 notice : Executed monitor script on event 'slave_down'. Script was: '/opt/runtime/maxscaleNotification.sh'
2021-03-13 15:36:00 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_up. [Down] -> [Slave, Running]
2021-03-13 15:36:03 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|1615649762662
2021-03-13 15:36:03 notice : Executed monitor script on event 'slave_up'. Script was: '/opt/runtime/maxscaleNotification.sh'
2021-03-13 15:36:12 error : Monitor timed out when connecting to server dbA[xx.xx.xx.01:3370] : ''
2021-03-13 15:36:12 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_down. [Slave, Running] -> [Down]
2021-03-13 15:36:14 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|1615649774428
2021-03-13 15:36:14 notice : Executed monitor script on event 'slave_down'. Script was: '/opt/runtime/maxscaleNotification.sh'
2021-03-13 15:36:38 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_up. [Down] -> [Slave, Running]
2021-03-13 15:36:41 notice : /opt/runtime/maxscaleNotification.sh: OK|nil|1615649800784
2021-03-13 15:36:41 notice : Executed monitor script on event 'slave_up'. Script was: '/opt/runtime/maxscaleNotification.sh'

UPDATE: I updated to maxscale 2.5.9 and i am getting a bit different error:
2021-03-13 15:49:55 error : Monitor timed out when connecting to server dbA[10.68.200.107:3370] : 'Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 110'
2021-03-13 15:49:55 notice : Server changed state: dbA[xx.xx.xx.01:3370]: slave_down. [Slave, Running] -> [Down]

Any suggestions ?



 Comments   
Comment by markus makela [ 2021-03-14 ]

What are the connection timeouts for the monitor? I'd recommend increasing the timeouts from the default values to rule that out as a reason.

Comment by Vitaly [ 2021-03-14 ]

These are the monitor settings that i am using, not sure if am missing any other timeouts.
After more testing it looks like it doesnt matter whether db is master or slave, the timeouts still happen.

[monitor]
type=monitor
module=mariadbmon
servers=dbA,dbB
user=orchestrator
password=xxxxxxxxxxxx
monitor_interval=2s
enforce_read_only_slaves=true
enforce_simple_topology=true
auto_failover=true
auto_rejoin=true
replication_user=replication
replication_password=xxxxxxxxxxxxxxxxx
verify_master_failure=true
master_failure_timeout=180
failcount=5
script=/opt/runtime/maxscaleNotification.sh
events=master_down,slave_down,master_up,slave_up

Comment by markus makela [ 2021-03-15 ]

I'd recommend configuring the following parameters:

Comment by Vitaly [ 2021-03-15 ]

Thanks alot, we will try it and do more testing this week.
Any idea what a typical timeout would be ?
I am thinking of using 90s but that almost seems too much.

Comment by markus makela [ 2021-03-15 ]

It really depends on how slow the server becomes when you're loding the database dumps. I'd start with a value of 20 seconds and increase it if it keeps happening. The downside of increasing timeouts is that network failure detection becomes slower as MaxScale has to wait for the timeout.

Comment by Vitaly [ 2021-03-23 ]

Thanks alot guys!
Those 3 timeouts have "solved" this issue and i am now able to run mysqldump properly.
Is there by any chance a guide with recommended parameters and value for production ?

Comment by markus makela [ 2021-03-24 ]

I'm afraid that I don't know of a guide like that.

Generated at Thu Feb 08 04:21:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.