[MXS-4126] With Semi-Sync Replication | Lag doesn't prevent traffic routing to Slave node having second_behind_master=0 Created: 2022-05-10  Updated: 2022-11-08  Resolved: 2022-09-08

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Pramod Mahto Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 1
Labels: readwritesplit

Attachments: File maxscale.cnf     File node_01_my.cnf     File node_02_my.cnf     File node_03_my.cnf    
Issue Links:
Blocks
blocks MDEV-28520 Semi-Sync Replication | Lag exist on ... Closed

 Description   

With semi-sync replication , replication waits for just one replica to acknowledge that it has received and logged the events. In such case considering second_behind_master parameter value to route the traffic to other slave node lead to false reporting.

May be we can think of alternate option of second_behind_master while routing the traffic on nodes via maxscale (May be comparing gtid_current_pos value across the nodes ) or we need to have second_behind_master>0 in-place if the slave node is running behind the master in case of semi-sync replication.

Bulk load using readwritesplit module

 
sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=192.168.47.22 --mysql-port=4404 --mysql-user=sbtestuser --mysql-password=sbtestuser --mysql-db=sbtest --db-driver=mysql --tables=20 --threads=75 --rate=140 --events=0 --thread-stack-size=128K --report-interval=10 --time=2400 --forced-shutdown='0' run
 
 
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
 
Running the test with following options:
Number of threads: 75
Target transaction rate: 140/sec
Report intermediate results every 10 second(s)
Initializing random number generator from current time
 
Forcing shutdown in 2400 seconds
 
Initializing worker threads...
 
Threads started!
 
[ 10s ] thds: 75 tps: 96.68 qps: 1987.71 (r/w/o: 1396.62/351.43/239.65) lat (ms,95%): 3095.38 err/s: 0.00 reconn/s: 0.00
[ 10s ] queue length: 462, concurrency: 75
[ 20s ] thds: 75 tps: 98.80 qps: 1968.43 (r/w/o: 1378.62/352.80/237.00) lat (ms,95%): 6713.97 err/s: 0.00 reconn/s: 0.00
 
 
 
[root@max_01 ~]# maxctrl list servers
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 0           │ Master, Running │ 1-1-146538 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 0           │ Slave, Running  │ 1-1-146538 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 0           │ Slave, Running  │ 1-1-146538 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘

Lets stop mariadb service on slave node_03

 
[root@node_03 ~]# systemctl stop mariadb
 
 
 
Every 2.0s: maxctrl list servers                                                                                                     Mon May  9 17:33:11 2022
 
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 75          │ Master, Running │ 1-1-148993 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 75          │ Slave, Running  │ 1-1-148993 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 0           │ Down            │ 1-1-146936 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘

Lets start mariadb service on node_03

 
[root@node_03 ~]# systemctl start  mariadb
 
 
Every 2.0s: maxctrl list servers                                                                                                     Mon May  9 17:34:05 2022
 
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 75          │ Master, Running │ 1-1-155397 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 75          │ Slave, Running  │ 1-1-155397 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 0           │ Slave, Running  │ 1-1-147361 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘

 
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.47.23
                   Master_User: repuser
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: node_01.000002
           Read_Master_Log_Pos: 121408196
                Relay_Log_File: node_03-relay-bin.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File: node_01.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 121408196
               Relay_Log_Space: 256
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 1-1-147361
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

Another read load via readwritesplit module through maxscale

 
[root@sysbench ~]# sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-host=192.168.47.22 --mysql-port=4404 --mysql-user=sbtestuser --mysql-password=sbtestuser --mysql-db=sbtest --db-driver=mysql --tables=20 --threads=10 --rate=140 --events=0 --thread-stack-size=128K --report-interval=10 --time=2400 --forced-shutdown='0' run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
 
Running the test with following options:
Number of threads: 10
Target transaction rate: 140/sec
Report intermediate results every 10 second(s)
Initializing random number generator from current time
 
Forcing shutdown in 2400 seconds
 
Initializing worker threads...
 
Threads started!
 
[ 10s ] thds: 10 tps: 14.00 qps: 226.94 (r/w/o: 197.95/0.00/28.99) lat (ms,95%): 8795.93 err/s: 0.00 reconn/s: 0.00
[ 10s ] queue length: 1263, concurrency: 10
[ 20s ] thds: 10 tps: 14.60 qps: 236.90 (r/w/o: 208.10/0.00/28.80) lat (ms,95%): 17435.99 err/s: 0.00 reconn/s: 0.00
[ 20s ] queue length: 2537, concurrency: 10
[ 30s ] thds: 10 tps: 14.40 qps: 234.90 (r/w/o: 205.70/0.00/29.20) lat (ms,95%): 26861.48 err/s: 0.00 reconn/s: 0.00
[ 30s ] queue length: 3798, concurrency: 10
 
 
Every 2.0s: maxctrl list servers                                                                                                     Mon May  9 17:39:40 2022
 
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │ GTID       │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_01 │ 192.168.47.23 │ 3306 │ 85          │ Master, Running │ 1-1-187704 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_02 │ 192.168.47.24 │ 3306 │ 85          │ Slave, Running  │ 1-1-187700 │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node_03 │ 192.168.47.25 │ 3306 │ 10          │ Slave, Running  │ 1-1-181709 │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┴────────────┘

Traffic routed to lagging slave node having second_behind_master=0

 
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.47.23
                   Master_User: repuser
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: node_01.000002
           Read_Master_Log_Pos: 149545891
                Relay_Log_File: node_03-relay-bin.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File: node_01.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 149545891
               Relay_Log_Space: 256
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 1-1-181709
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)
 



 Comments   
Comment by markus makela [ 2022-05-16 ]

Since this seems to be expected behavior, I'll change this to a feature request. If the server reports that it is up-to-date then MaxScale considers it to not be behind. Some other metric could be used to detect if something is up-to-date or lagging behind due to the master node not being able to send the data fast enough.

Comment by markus makela [ 2022-09-08 ]

This falls into the causal_reads territory as you can't guarantee zero lag without communicating with multiple servers.

Generated at Thu Feb 08 04:26:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.