[MDEV-31749] New test rpl.rpl_parallel_sbm in bb-10.4-MDEV-30619 sporadically fails in various locations (prepatch: lines 100, 177, 184) (postpatch_1: lines 180, 187) Created: 2023-07-20  Updated: 2023-08-08  Resolved: 2023-07-25

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4
Fix Version/s: 10.8.8, 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1

Type: Bug Priority: Blocker
Reporter: Roel Van de Paar Assignee: Andrei Elkin
Resolution: Fixed Votes: 0
Labels: regression, seconds-behind-master, sporadic

Issue Links:
Blocks
PartOf
is part of MDEV-30619 Parallel Slave SQL Thread Can Update ... Closed
Problem/Incident
is caused by MDEV-30619 Parallel Slave SQL Thread Can Update ... Closed
Relates

 Description   

mysql-test$ ./mtr --repeat 1000 rpl.rpl_parallel_sbm

Leads to:

bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Debug)

...
rpl.rpl_parallel_sbm 'mix'               [ 18 pass ]  18674
rpl.rpl_parallel_sbm 'mix'               [ 19 pass ]  18698
rpl.rpl_parallel_sbm 'mix'               [ 20 fail ]
        Test ended at 2023-07-20 10:13:11
 
CURRENT_TEST: rpl.rpl_parallel_sbm
mysqltest: At line 100: Seconds_Behind_Master should not have used second transaction timestamp
 
The result from queries just before the failure was:
< snip >
# Waiting for transaction to arrive on slave and begin SQL Delay..
# Validating SBM is updated on event arrival..
# ..done
connection slave;
UNLOCK TABLES;
include/sync_with_master_gtid.inc
#
# Pt 2) If the SQL thread has not entered an idle state, ensure
# following events do not update SBM
connection master;
# Sleep 2 to allow a buffer between events for SBM check
insert into t1 values (1);
# Sleep 3 to create gap between events
insert into t1 values (2);
include/save_master_pos.inc
connection slave;
LOCK TABLES t1 WRITE;
# Wait for first transaction to complete SQL delay and begin execution..
# Validate SBM calculation doesn't use the second transaction because SQL thread shouldn't have gone idle..
# SBM 3 was more recent than time since last transaction (4 seconds)

Or, in another instance:

bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Optimized)

...
rpl.rpl_parallel_sbm 'mix'               [ 19 pass ]  18622
rpl.rpl_parallel_sbm 'mix'               [ 20 pass ]  19592
rpl.rpl_parallel_sbm 'mix'               [ 21 fail ]
        Test ended at 2023-07-20 12:47:36
 
CURRENT_TEST: rpl.rpl_parallel_sbm
mysqltest: At line 177: Seconds_Behind_Master is too recent
 
The result from queries just before the failure was:
< snip >
connection master;
include/sync_slave_sql_with_master.inc
# Lock t1 on slave so the first received transaction does not complete/commit
connection slave;
LOCK TABLES t1 WRITE;
connection master;
INSERT INTO t1 VALUES (1);
include/save_master_gtid.inc
connection slave;
# Waiting for first transaction to begin..
# Wait for SQL thread to catch up and idle
# Sleep 2s to create a gap between events
LOCK TABLES mysql.gtid_slave_pos WRITE;
connection master;
INSERT INTO t2 VALUES (1);
include/save_master_gtid.inc
connection slave;
# Waiting for second transaction to begin..
# Validate that the second transaction doesn't update Seconds_Behind_Master on reception after the SQL thread has gone idle..
# Seconds_Behind_Master was 2 yet should not have been less than 3 (1689821254 - 1689821251)

And:

bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Optimized)

...
rpl.rpl_parallel_sbm 'mix'               [ 85 pass ]  18592
rpl.rpl_parallel_sbm 'mix'               [ 86 pass ]  19673
rpl.rpl_parallel_sbm 'mix'               [ 87 fail ]
        Test ended at 2023-07-20 13:33:27
 
CURRENT_TEST: rpl.rpl_parallel_sbm
mysqltest: At line 184: Seconds_Behind_Master is too old
 
The result from queries just before the failure was:
< snip >
include/sync_slave_sql_with_master.inc
# Lock t1 on slave so the first received transaction does not complete/commit
connection slave;
LOCK TABLES t1 WRITE;
connection master;
INSERT INTO t1 VALUES (1);
include/save_master_gtid.inc
connection slave;
# Waiting for first transaction to begin..
# Wait for SQL thread to catch up and idle
# Sleep 2s to create a gap between events
LOCK TABLES mysql.gtid_slave_pos WRITE;
connection master;
INSERT INTO t2 VALUES (1);
include/save_master_gtid.inc
connection slave;
# Waiting for second transaction to begin..
# Validate that the second transaction doesn't update Seconds_Behind_Master on reception after the SQL thread has gone idle..
# ..and that its value correctly corresponds to the first transaction's timestamp on the primary
# Seconds_Behind_Master was 3 yet should not have been more than 2 (1689824005 - 1689824003)

It usually takes between 20 to 50 attempts to reproduce, though it can take 100+. System load may affect reproducibility, more load is presumably better.
Observed on both debug and optimized builds as shown above.

This particular test is not available prior to the MDEV-30619 fixes, and the previous testcase (executed against a pre-MDEV-30619 debug (this test requires debug) build of 10.4.31 at ea4b8d4ce99d0f427dd7ed40b5eeb15e41eec01e) :

./mtr --repeat 100 rpl.rpl_delayed_parallel_slave_sbm

Does not produce any issue (400+ attempts).
May affect later MariaDB versions also.



 Comments   
Comment by Roel Van de Paar [ 2023-07-22 ]

Thank you for the patch bnestere. The bug is not fully resolved (line 180 still fails) as of the latest commit (006aca3111e0cbd5f172060774c74e328bea4b0a to bb-10.4-MDEV-30619). To speed up testing you can use:

./mtr --parallel=100 --repeat 1000 rpl.rpl_parallel_sbm

Leads to:

bb-10.4-MDEV-30619 10.4.31 006aca3111e0cbd5f172060774c74e328bea4b0a (Debug)

...
rpl.rpl_parallel 'innodb,stmt'           w47 [ 32 pass ]   9594
rpl.rpl_parallel 'innodb,row'            w34 [ 32 pass ]   9599
rpl.rpl_parallel_retry 'innodb,stmt'     w46 [ 43 pass ]   8796
rpl.rpl_parallel_multilevel2 'innodb,stmt' w57 [ 50 pass ]   7130
rpl.rpl_parallel_sbm 'row'               w39 [ 20 fail ]
        Test ended at 2023-07-22 12:29:42
 
CURRENT_TEST: rpl.rpl_parallel_sbm
mysqltest: At line 180: Seconds_Behind_Master is too recent
 
The result from queries just before the failure was:
< snip >
# MDEV-30619: Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers
#
connection slave;
# Ensure the replica is fully idle before starting transactions
# Lock t1 on slave so the first received transaction does not complete/commit
LOCK TABLES t1 WRITE;
connection master;
insert into t1 values (3);
include/save_master_gtid.inc
connection slave;
# Waiting for first transaction to begin..
# Wait for SQL thread to catch up and idle
connection master;
# Sleep 2s to create a gap between events
INSERT INTO t2 VALUES (1);
include/save_master_gtid.inc
connection slave;
# Waiting for second transaction to begin..
# Validate that the second transaction doesn't update Seconds_Behind_Master on reception when worker threads are busy..
# Seconds_Behind_Master was 2 yet should not have been less than 3 (1689992980 - 1689992977)

Issue observed >35 times. Takes about 5-10 minutes to reproduce on a high-end, medium-busy server.

Comment by Roel Van de Paar [ 2023-07-23 ]

After the 1.5 day weekend run (--repeat 1000 --parallel=50 of relevant tests), this issue was also observed 9 times in rpl.rpl_parallel_sbm:

10.4.31 006aca3111e0cbd5f172060774c74e328bea4b0a (Debug)

...
rpl.rpl_parallel_sbm 'row'               w28 [ 12 fail ]
        Test ended at 2023-07-22 17:09:32
 
CURRENT_TEST: rpl.rpl_parallel_sbm
mysqltest: At line 187: Seconds_Behind_Master is too old
 
The result from queries just before the failure was:
< snip >
#
connection slave;
# Ensure the replica is fully idle before starting transactions
# Lock t1 on slave so the first received transaction does not complete/commit
LOCK TABLES t1 WRITE;
connection master;
insert into t1 values (3);
include/save_master_gtid.inc
connection slave;
# Waiting for first transaction to begin..
# Wait for SQL thread to catch up and idle
connection master;
# Sleep 2s to create a gap between events
INSERT INTO t2 VALUES (1);
include/save_master_gtid.inc
connection slave;
# Waiting for second transaction to begin..
# Validate that the second transaction doesn't update Seconds_Behind_Master on reception when worker threads are busy..
# ..and that its value correctly corresponds to the first transaction's timestamp on the primary
# Seconds_Behind_Master was 3 yet should not have been more than 2 (1690009769 - 1690009767)

Comment by Andrei Elkin [ 2023-07-24 ]

Howdy Roel!

I found an easy solution in simplifying the assert part of the test.
Now it contains just one and it states that the 2nd measurement of SBM can't produce less than the $sleep parameter value. In the buggy "OLD" version it could.

My --repeat 100 --parallel 8 is coming to an end w/o any issue.

Please have a look.

Cheers,
Andrei

Comment by Andrei Elkin [ 2023-07-25 ]

BB and manual testing have confirmed the fixes.

Generated at Thu Feb 08 10:26:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.