[MDEV-28520] Semi-Sync Replication | Lag exist on slave node with second_behind_master=0 Created: 2022-05-10  Updated: 2022-06-16  Resolved: 2022-06-14

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Pramod Mahto Assignee: Andrei Elkin
Resolution: Incomplete Votes: 0
Labels: None

Attachments: PNG File Replication Lag on Node 3 comparing to Node 2 .png     File Replication_lag_semi_sync.mp4     File maxscale.cnf     File node_01_my.cnf     File node_02_my.cnf     File node_03_my.cnf     Text File semi_sync_Replication_slave_running_behind_master .txt    
Issue Links:
Blocks
is blocked by MXS-4126 With Semi-Sync Replication | Lag does... Closed

 Description   

Maxscale monitoring output

 
[root@max_01 ~]# maxctrl list servers
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 0           │ Master, Running │ 1-1-109916 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 0           │ Slave, Running  │ 1-1-109916 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 0           │ Slave, Running  │ 1-1-109916 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘
[root@max_01 ~]# maxctrl list services
┌─────────────────────────┬────────────────┬─────────────┬───────────────────┬───────────────────────────┐
│ Service                 │ Router         │ Connections │ Total Connections │ Servers                   │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadOnlySlaveService1   │ readconnroute  │ 0           │ 0                 │ node_01, node_02, node_03 │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadWriteMasterService1 │ readconnroute  │ 0           │ 0                 │ node_01, node_02, node_03 │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadWriteSplitService1  │ readwritesplit │ 0           │ 0                 │ node_01, node_02, node_03 │
└─────────────────────────┴────────────────┴─────────────┴───────────────────┴───────────────────────────┘

Bulk transacation with 2.5K QPS using sysbench via maxscale.

 
[root@sysbench ~]# sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=192.168.47.22 --mysql-port=4404 --mysql-user=sbtestuser --mysql-password=sbtestuser --mysql-db=sbtest --db-driver=mysql --tables=20 --threads=75 --rate=140 --events=0 --thread-stack-size=128K --report-interval=10 --time=2400 --forced-shutdown='0' run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
 
 
 
Running the test with following options:
Number of threads: 75
Target transaction rate: 140/sec
Report intermediate results every 10 second(s)
Initializing random number generator from current time
 
Forcing shutdown in 2400 seconds
 
Initializing worker threads...
 
Threads started!
 
[ 10s ] thds: 75 tps: 110.98 qps: 2294.64 (r/w/o: 1615.18/394.42/285.04) lat (ms,95%): 2120.76 err/s: 0.00 reconn/s: 0.00
[ 10s ] queue length: 210, concurrency: 75
[ 20s ] thds: 75 tps: 102.80 qps: 2057.77 (r/w/o: 1443.65/362.21/251.91) lat (ms,95%): 4517.90 err/s: 0.00 reconn/s: 0.00
[ 20s ] queue length: 543, concurrency: 75
[ 30s ] thds: 75 tps: 122.60 qps: 2446.42 (r/w/o: 1710.24/433.79/302.39) lat (ms,95%): 5507.54 err/s: 0.00 reconn/s: 0.00
[ 30s ] queue length: 678, concurrency: 75
[ 40s ] thds: 75 tps: 113.80 qps: 2274.18 (r/w/o: 1591.55/402.31/280.31) lat (ms,95%): 7346.49 err/s: 0.00 reconn/s: 0.00
[ 40s ] queue length: 944, concurrency: 75
[ 50s ] thds: 75 tps: 111.00 qps: 2225.94 (r/w/o: 1559.05/392.89/273.99) lat (ms,95%): 8955.74 err/s: 0.00 reconn/s: 0.00
[ 50s ] queue length: 1235, concurrency: 75
[ 60s ] thds: 75 tps: 118.00 qps: 2366.57 (r/w/o: 1659.98/416.30/290.30) lat (ms,95%): 10917.50 err/s: 0.00 reconn/s: 0.00
[ 60s ] queue length: 1488, concurrency: 75
[ 70s ] thds: 75 tps: 127.20 qps: 2530.43 (r/w/o: 1769.05/448.49/312.89) lat (ms,95%): 12163.09 err/s: 0.00 reconn/s: 0.00

 
[root@max_01 ~]# maxctrl list servers
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 75          │ Master, Running │ 1-1-115551 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 75          │ Slave, Running  │ 1-1-115551 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 75          │ Slave, Running  │ 1-1-115551 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘
[root@max_01 ~]# maxctrl list services
┌─────────────────────────┬────────────────┬─────────────┬───────────────────┬───────────────────────────┐
│ Service                 │ Router         │ Connections │ Total Connections │ Servers                   │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadOnlySlaveService1   │ readconnroute  │ 0           │ 0                 │ node_01, node_02, node_03 │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadWriteMasterService1 │ readconnroute  │ 0           │ 0                 │ node_01, node_02, node_03 │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadWriteSplitService1  │ readwritesplit │ 75          │ 75                │ node_01, node_02, node_03 │
└─────────────────────────┴────────────────┴─────────────┴───────────────────┴───────────────────────────┘

Lets stop node_03 service

 
[root@node_03 ~]# systemctl stop mariadb
 
Every 2.0s: maxctrl list servers                                                                                                     Mon May  9 16:17:32 2022
 
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 75          │ Master, Running │ 1-1-123627 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 75          │ Slave, Running  │ 1-1-123627 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 0           │ Down            │ 1-1-121584 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘
 
[root@max_01 ~]# maxctrl list services
┌─────────────────────────┬────────────────┬─────────────┬───────────────────┬───────────────────────────┐
│ Service                 │ Router         │ Connections │ Total Connections │ Servers                   │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadOnlySlaveService1   │ readconnroute  │ 0           │ 0                 │ node_01, node_02, node_03 │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadWriteMasterService1 │ readconnroute  │ 0           │ 0                 │ node_01, node_02, node_03 │
├─────────────────────────┼────────────────┼─────────────┼───────────────────┼───────────────────────────┤
│ ReadWriteSplitService1  │ readwritesplit │ 75          │ 75                │ node_01, node_02, node_03 │
└─────────────────────────┴────────────────┴─────────────┴───────────────────┴───────────────────────────┘

Lets start node_03 service. From below output one can see GTID value is constant(lagging) and node_03 is lagging with SQL/IO Thread running as YES and second_behind_master=0

 
[root@node_03 ~]# systemctl start mariadb
 
Every 2.0s: maxctrl list servers                                                                                                     Mon May  9 16:18:19 2022
 
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 75          │ Master, Running │ 1-1-128503 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 75          │ Slave, Running  │ 1-1-128503 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 0           │ Slave, Running  │ 1-1-122052 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘
 
 
[root@node_03 ~]# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 10.3.28-11-MariaDB-enterprise-log MariaDB Enterprise Server
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
 
 
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.47.23
                   Master_User: repuser
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: node_01.000002
           Read_Master_Log_Pos: 100675097
                Relay_Log_File: node_03-relay-bin.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File: node_01.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 100675097
               Relay_Log_Space: 256
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 1-1-122052
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)



 Comments   
Comment by Andrei Elkin [ 2022-05-11 ]

SSS displays actually node3 is in sync:

Read_Master_Log_Pos: 100675097
Exec_Master_Log_Pos: 100675097

Comment by Andrei Elkin [ 2022-05-12 ]

pramod.mahto@mariadb.com, now I see what's the complain is about. So the slave IO thread is actually in a retry-reconnect loop

[Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'node_01.000002' at position 313353347; GTID position '1-1-381669'

and it can't succeed in that so no more events coming.
So the first thing to note is that the Slave IO thread state is not displayed correctly.

The reason or its traces of the reconnecting may be present in the master and slave error logs.
Could you please provide them.
Also it this effect can be reproduced make sure to configure both servers with log_warning=3 and upload more informative logs too.
Let me confirm though whether they will be necessary after I have processed any current logs.

Comment by Pramod Mahto [ 2022-05-13 ]

Elkin I Re-tested with log_warning=3 and nothing much logged into Master and slave node, while another [slave] node_03 still running behind other 2 nodes [master]* node_01* and [slave] node_02

[master] node_01
[slave] node_02, node_03

here* node_02* always remain sync with [master] node_01 while if we stop and start slave on [slave] node_03 , it will start lagging with SBM>0 . Once we stop the load on Master , after few seconds [slave] node_03 will resume replication with SMB>0 and will be full sync with [master] node_01. Attached the re-test case File name : semi_sync_Replication_slave_running_behind_master.txt

Comment by Andrei Elkin [ 2022-05-13 ]

pramod.mahto@mariadb.com, unfortunately there's no valuable data in your last message/attachment. I still need the full log_warning=3 error logs from the master and the lagging slave (better from both to have an option to compare). Sorry to ask that over again, you
may not believe but often developers pay attention to "unrelated" details (not to say how prone they are to ignore real important ones , at bad times of course).

If the error logs indeed won't have anything valuable, prepare yourself to investigate how/whether MXS is routing without any glitch.

Consider to exclude MXS from your test env to see how the system fares then.
And could you please report about when this has been done!

Check with P_S related docs to collect bytes sent/received from/to master/slave. OS level involvement - anything in syslogs?

Comment by Andrei Elkin [ 2022-05-27 ]

pramod.mahto@mariadb.com, thanks for the "confirmation" . Just in case what is this troubling VM?

Generated at Thu Feb 08 10:01:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.