Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31749

New test rpl.rpl_parallel_sbm in bb-10.4-MDEV-30619 sporadically fails in various locations (prepatch: lines 100, 177, 184) (postpatch_1: lines 180, 187)

Details

    Description

      mysql-test$ ./mtr --repeat 1000 rpl.rpl_parallel_sbm
      

      Leads to:

      bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Debug)

      ...
      rpl.rpl_parallel_sbm 'mix'               [ 18 pass ]  18674
      rpl.rpl_parallel_sbm 'mix'               [ 19 pass ]  18698
      rpl.rpl_parallel_sbm 'mix'               [ 20 fail ]
              Test ended at 2023-07-20 10:13:11
       
      CURRENT_TEST: rpl.rpl_parallel_sbm
      mysqltest: At line 100: Seconds_Behind_Master should not have used second transaction timestamp
       
      The result from queries just before the failure was:
      < snip >
      # Waiting for transaction to arrive on slave and begin SQL Delay..
      # Validating SBM is updated on event arrival..
      # ..done
      connection slave;
      UNLOCK TABLES;
      include/sync_with_master_gtid.inc
      #
      # Pt 2) If the SQL thread has not entered an idle state, ensure
      # following events do not update SBM
      connection master;
      # Sleep 2 to allow a buffer between events for SBM check
      insert into t1 values (1);
      # Sleep 3 to create gap between events
      insert into t1 values (2);
      include/save_master_pos.inc
      connection slave;
      LOCK TABLES t1 WRITE;
      # Wait for first transaction to complete SQL delay and begin execution..
      # Validate SBM calculation doesn't use the second transaction because SQL thread shouldn't have gone idle..
      # SBM 3 was more recent than time since last transaction (4 seconds)
      

      Or, in another instance:

      bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Optimized)

      ...
      rpl.rpl_parallel_sbm 'mix'               [ 19 pass ]  18622
      rpl.rpl_parallel_sbm 'mix'               [ 20 pass ]  19592
      rpl.rpl_parallel_sbm 'mix'               [ 21 fail ]
              Test ended at 2023-07-20 12:47:36
       
      CURRENT_TEST: rpl.rpl_parallel_sbm
      mysqltest: At line 177: Seconds_Behind_Master is too recent
       
      The result from queries just before the failure was:
      < snip >
      connection master;
      include/sync_slave_sql_with_master.inc
      # Lock t1 on slave so the first received transaction does not complete/commit
      connection slave;
      LOCK TABLES t1 WRITE;
      connection master;
      INSERT INTO t1 VALUES (1);
      include/save_master_gtid.inc
      connection slave;
      # Waiting for first transaction to begin..
      # Wait for SQL thread to catch up and idle
      # Sleep 2s to create a gap between events
      LOCK TABLES mysql.gtid_slave_pos WRITE;
      connection master;
      INSERT INTO t2 VALUES (1);
      include/save_master_gtid.inc
      connection slave;
      # Waiting for second transaction to begin..
      # Validate that the second transaction doesn't update Seconds_Behind_Master on reception after the SQL thread has gone idle..
      # Seconds_Behind_Master was 2 yet should not have been less than 3 (1689821254 - 1689821251)
      

      And:

      bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Optimized)

      ...
      rpl.rpl_parallel_sbm 'mix'               [ 85 pass ]  18592
      rpl.rpl_parallel_sbm 'mix'               [ 86 pass ]  19673
      rpl.rpl_parallel_sbm 'mix'               [ 87 fail ]
              Test ended at 2023-07-20 13:33:27
       
      CURRENT_TEST: rpl.rpl_parallel_sbm
      mysqltest: At line 184: Seconds_Behind_Master is too old
       
      The result from queries just before the failure was:
      < snip >
      include/sync_slave_sql_with_master.inc
      # Lock t1 on slave so the first received transaction does not complete/commit
      connection slave;
      LOCK TABLES t1 WRITE;
      connection master;
      INSERT INTO t1 VALUES (1);
      include/save_master_gtid.inc
      connection slave;
      # Waiting for first transaction to begin..
      # Wait for SQL thread to catch up and idle
      # Sleep 2s to create a gap between events
      LOCK TABLES mysql.gtid_slave_pos WRITE;
      connection master;
      INSERT INTO t2 VALUES (1);
      include/save_master_gtid.inc
      connection slave;
      # Waiting for second transaction to begin..
      # Validate that the second transaction doesn't update Seconds_Behind_Master on reception after the SQL thread has gone idle..
      # ..and that its value correctly corresponds to the first transaction's timestamp on the primary
      # Seconds_Behind_Master was 3 yet should not have been more than 2 (1689824005 - 1689824003)
      

      It usually takes between 20 to 50 attempts to reproduce, though it can take 100+. System load may affect reproducibility, more load is presumably better.
      Observed on both debug and optimized builds as shown above.

      This particular test is not available prior to the MDEV-30619 fixes, and the previous testcase (executed against a pre-MDEV-30619 debug (this test requires debug) build of 10.4.31 at ea4b8d4ce99d0f427dd7ed40b5eeb15e41eec01e) :

      ./mtr --repeat 100 rpl.rpl_delayed_parallel_slave_sbm
      

      Does not produce any issue (400+ attempts).
      May affect later MariaDB versions also.

      Attachments

        Issue Links

          Activity

            Roel Roel Van de Paar added a comment - - edited

            Thank you for the patch bnestere. The bug is not fully resolved (line 180 still fails) as of the latest commit (006aca3111e0cbd5f172060774c74e328bea4b0a to bb-10.4-MDEV-30619). To speed up testing you can use:

            ./mtr --parallel=100 --repeat 1000 rpl.rpl_parallel_sbm
            

            Leads to:

            bb-10.4-MDEV-30619 10.4.31 006aca3111e0cbd5f172060774c74e328bea4b0a (Debug)

            ...
            rpl.rpl_parallel 'innodb,stmt'           w47 [ 32 pass ]   9594
            rpl.rpl_parallel 'innodb,row'            w34 [ 32 pass ]   9599
            rpl.rpl_parallel_retry 'innodb,stmt'     w46 [ 43 pass ]   8796
            rpl.rpl_parallel_multilevel2 'innodb,stmt' w57 [ 50 pass ]   7130
            rpl.rpl_parallel_sbm 'row'               w39 [ 20 fail ]
                    Test ended at 2023-07-22 12:29:42
             
            CURRENT_TEST: rpl.rpl_parallel_sbm
            mysqltest: At line 180: Seconds_Behind_Master is too recent
             
            The result from queries just before the failure was:
            < snip >
            # MDEV-30619: Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers
            #
            connection slave;
            # Ensure the replica is fully idle before starting transactions
            # Lock t1 on slave so the first received transaction does not complete/commit
            LOCK TABLES t1 WRITE;
            connection master;
            insert into t1 values (3);
            include/save_master_gtid.inc
            connection slave;
            # Waiting for first transaction to begin..
            # Wait for SQL thread to catch up and idle
            connection master;
            # Sleep 2s to create a gap between events
            INSERT INTO t2 VALUES (1);
            include/save_master_gtid.inc
            connection slave;
            # Waiting for second transaction to begin..
            # Validate that the second transaction doesn't update Seconds_Behind_Master on reception when worker threads are busy..
            # Seconds_Behind_Master was 2 yet should not have been less than 3 (1689992980 - 1689992977)
            

            Issue observed >35 times. Takes about 5-10 minutes to reproduce on a high-end, medium-busy server.

            Roel Roel Van de Paar added a comment - - edited Thank you for the patch bnestere . The bug is not fully resolved (line 180 still fails) as of the latest commit ( 006aca3111e0cbd5f172060774c74e328bea4b0a to bb-10.4- MDEV-30619 ). To speed up testing you can use: ./mtr --parallel=100 --repeat 1000 rpl.rpl_parallel_sbm Leads to: bb-10.4-MDEV-30619 10.4.31 006aca3111e0cbd5f172060774c74e328bea4b0a (Debug) ... rpl.rpl_parallel 'innodb,stmt' w47 [ 32 pass ] 9594 rpl.rpl_parallel 'innodb,row' w34 [ 32 pass ] 9599 rpl.rpl_parallel_retry 'innodb,stmt' w46 [ 43 pass ] 8796 rpl.rpl_parallel_multilevel2 'innodb,stmt' w57 [ 50 pass ] 7130 rpl.rpl_parallel_sbm 'row' w39 [ 20 fail ] Test ended at 2023-07-22 12:29:42   CURRENT_TEST: rpl.rpl_parallel_sbm mysqltest: At line 180: Seconds_Behind_Master is too recent   The result from queries just before the failure was: < snip > # MDEV-30619: Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers # connection slave; # Ensure the replica is fully idle before starting transactions # Lock t1 on slave so the first received transaction does not complete/commit LOCK TABLES t1 WRITE; connection master; insert into t1 values (3); include/save_master_gtid.inc connection slave; # Waiting for first transaction to begin.. # Wait for SQL thread to catch up and idle connection master; # Sleep 2s to create a gap between events INSERT INTO t2 VALUES (1); include/save_master_gtid.inc connection slave; # Waiting for second transaction to begin.. # Validate that the second transaction doesn't update Seconds_Behind_Master on reception when worker threads are busy.. # Seconds_Behind_Master was 2 yet should not have been less than 3 (1689992980 - 1689992977) Issue observed >35 times. Takes about 5-10 minutes to reproduce on a high-end, medium-busy server.

            After the 1.5 day weekend run (--repeat 1000 --parallel=50 of relevant tests), this issue was also observed 9 times in rpl.rpl_parallel_sbm:

            10.4.31 006aca3111e0cbd5f172060774c74e328bea4b0a (Debug)

            ...
            rpl.rpl_parallel_sbm 'row'               w28 [ 12 fail ]
                    Test ended at 2023-07-22 17:09:32
             
            CURRENT_TEST: rpl.rpl_parallel_sbm
            mysqltest: At line 187: Seconds_Behind_Master is too old
             
            The result from queries just before the failure was:
            < snip >
            #
            connection slave;
            # Ensure the replica is fully idle before starting transactions
            # Lock t1 on slave so the first received transaction does not complete/commit
            LOCK TABLES t1 WRITE;
            connection master;
            insert into t1 values (3);
            include/save_master_gtid.inc
            connection slave;
            # Waiting for first transaction to begin..
            # Wait for SQL thread to catch up and idle
            connection master;
            # Sleep 2s to create a gap between events
            INSERT INTO t2 VALUES (1);
            include/save_master_gtid.inc
            connection slave;
            # Waiting for second transaction to begin..
            # Validate that the second transaction doesn't update Seconds_Behind_Master on reception when worker threads are busy..
            # ..and that its value correctly corresponds to the first transaction's timestamp on the primary
            # Seconds_Behind_Master was 3 yet should not have been more than 2 (1690009769 - 1690009767)
            

            Roel Roel Van de Paar added a comment - After the 1.5 day weekend run (--repeat 1000 --parallel=50 of relevant tests), this issue was also observed 9 times in rpl.rpl_parallel_sbm : 10.4.31 006aca3111e0cbd5f172060774c74e328bea4b0a (Debug) ... rpl.rpl_parallel_sbm 'row' w28 [ 12 fail ] Test ended at 2023-07-22 17:09:32   CURRENT_TEST: rpl.rpl_parallel_sbm mysqltest: At line 187: Seconds_Behind_Master is too old   The result from queries just before the failure was: < snip > # connection slave; # Ensure the replica is fully idle before starting transactions # Lock t1 on slave so the first received transaction does not complete/commit LOCK TABLES t1 WRITE; connection master; insert into t1 values (3); include/save_master_gtid.inc connection slave; # Waiting for first transaction to begin.. # Wait for SQL thread to catch up and idle connection master; # Sleep 2s to create a gap between events INSERT INTO t2 VALUES (1); include/save_master_gtid.inc connection slave; # Waiting for second transaction to begin.. # Validate that the second transaction doesn't update Seconds_Behind_Master on reception when worker threads are busy.. # ..and that its value correctly corresponds to the first transaction's timestamp on the primary # Seconds_Behind_Master was 3 yet should not have been more than 2 (1690009769 - 1690009767)
            Elkin Andrei Elkin added a comment -

            Howdy Roel!

            I found an easy solution in simplifying the assert part of the test.
            Now it contains just one and it states that the 2nd measurement of SBM can't produce less than the $sleep parameter value. In the buggy "OLD" version it could.

            My --repeat 100 --parallel 8 is coming to an end w/o any issue.

            Please have a look.

            Cheers,
            Andrei

            Elkin Andrei Elkin added a comment - Howdy Roel! I found an easy solution in simplifying the assert part of the test. Now it contains just one and it states that the 2nd measurement of SBM can't produce less than the $sleep parameter value. In the buggy "OLD" version it could. My --repeat 100 --parallel 8 is coming to an end w/o any issue. Please have a look. Cheers, Andrei
            Elkin Andrei Elkin added a comment -

            BB and manual testing have confirmed the fixes.

            Elkin Andrei Elkin added a comment - BB and manual testing have confirmed the fixes.

            People

              Elkin Andrei Elkin
              Roel Roel Van de Paar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.