Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33500

rpl.rpl_parallel_sbm can fail on slow machines, e.g. MSAN/Valgrind builders

Details

    Description

      In an addition to test rpl.rpl_parallel_sbm added by MDEV-32265, the test uses sleep statements alone to test Seconds_Behind_Master with delayed replication. On slow running machines, the test can pass the intended MASTER_DELAY duration and Seconds_Behind_Master can become 0, when the test expects the transaction to still be actively in a delaying state.

      The test failure resembles the following:

      rpl.rpl_parallel_sbm 'mix'               w20 [ fail ]
              Test ended at 2024-02-20 10:27:35
      CURRENT_TEST: rpl.rpl_parallel_sbm
      mysqltest: At line 85: Seconds_Behind_Master did not increase after sleeping, but should have
      

      And the following patch reproduces this consistently

      diff --git a/mysql-test/suite/rpl/t/rpl_parallel_sbm.test b/mysql-test/suite/rpl/t/rpl_parallel_sbm.test
      index 58c0db15e47..5baf6a56487 100644
      --- a/mysql-test/suite/rpl/t/rpl_parallel_sbm.test
      +++ b/mysql-test/suite/rpl/t/rpl_parallel_sbm.test
      @@ -77,6 +77,10 @@ if (`SELECT $sbm_trx1_arrive > ($seconds_since_idling + 1)`)
       --echo # Sleeping 1s to increment SBM
       sleep 1;
       
      +--disable_query_log
      +sleep 4;
      +--enable_query_log
      +
       --echo # Ensuring Seconds_Behind_Master increases after sleeping..
       --let $sbm_trx1_after_1s_sleep= query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1)
       if (`SELECT $sbm_trx1_after_1s_sleep <= $sbm_trx1_arrive`)
      

      Attachments

        Issue Links

          Activity

            Pushed into in 10.5 as b04c8575

            bnestere Brandon Nesterenko added a comment - Pushed into in 10.5 as b04c8575
            bnestere Brandon Nesterenko added a comment - - edited

            Reopening, as marko noted on MDEV-33499 that this test can still fail.

            rpl.rpl_parallel_sbm 'mix'               w24 [ fail ]
                    Test ended at 2024-09-06 11:43:50
             
            CURRENT_TEST: rpl.rpl_parallel_sbm
            mysqltest: At line 92: Seconds_Behind_Master did not increase after sleeping, but should have
             
            The result from queries just before the failure was:
            < snip >
            # Lock t1 on slave to ensure the event can't finish (and thereby update
            # Seconds_Behind_Master) so slow running servers don't accidentally
            # catch up to the master before checking SBM.
            connection server_2;
            LOCK TABLES t1 WRITE;
            include/start_slave.inc
            connection slave;
            # Waiting for replica to get blocked by the table lock
            Timeout in wait_condition.inc for SELECT count(*) FROM information_schema.processlist WHERE state LIKE 'Waiting for table metadata lock';
            Id  User  Host  db  Command Time  State Info  Progress
            5 root  localhost:51970 test  Sleep 31    NULL  0.000
            10  root  localhost:51980 test  Query 0 starting  show full processlist 0.000
            11  root  localhost:51986 test  Sleep 71    NULL  0.000
            16  system user   NULL  Slave_IO  31  Waiting for master to send event  NULL  0.000
            18  system user   NULL  Slave_worker  31  Waiting for work from SQL thread  NULL  0.000
            19  system user   NULL  Slave_worker  31  Waiting for work from SQL thread  NULL  0.000
            17  system user   NULL  Slave_SQL 31  Slave has read all relay log; waiting for more updates  NULL  0.000
            # Sleeping 1s to increment SBM 
            # Ensuring Seconds_Behind_Master increas
            

            bnestere Brandon Nesterenko added a comment - - edited Reopening, as marko noted on MDEV-33499 that this test can still fail. rpl.rpl_parallel_sbm 'mix' w24 [ fail ] Test ended at 2024-09-06 11:43:50   CURRENT_TEST: rpl.rpl_parallel_sbm mysqltest: At line 92: Seconds_Behind_Master did not increase after sleeping, but should have   The result from queries just before the failure was: < snip > # Lock t1 on slave to ensure the event can't finish (and thereby update # Seconds_Behind_Master) so slow running servers don't accidentally # catch up to the master before checking SBM. connection server_2; LOCK TABLES t1 WRITE; include/start_slave.inc connection slave; # Waiting for replica to get blocked by the table lock Timeout in wait_condition.inc for SELECT count(*) FROM information_schema.processlist WHERE state LIKE 'Waiting for table metadata lock'; Id User Host db Command Time State Info Progress 5 root localhost:51970 test Sleep 31 NULL 0.000 10 root localhost:51980 test Query 0 starting show full processlist 0.000 11 root localhost:51986 test Sleep 71 NULL 0.000 16 system user NULL Slave_IO 31 Waiting for master to send event NULL 0.000 18 system user NULL Slave_worker 31 Waiting for work from SQL thread NULL 0.000 19 system user NULL Slave_worker 31 Waiting for work from SQL thread NULL 0.000 17 system user NULL Slave_SQL 31 Slave has read all relay log; waiting for more updates NULL 0.000 # Sleeping 1s to increment SBM # Ensuring Seconds_Behind_Master increas

            Hi Elkin!

            Please see my fix for the test PR-3506.

            bnestere Brandon Nesterenko added a comment - Hi Elkin ! Please see my fix for the test PR-3506 .
            Elkin Andrei Elkin added a comment -

            Approved.

            Elkin Andrei Elkin added a comment - Approved.
            bnestere Brandon Nesterenko added a comment - - edited

            Pushed second fix into 10.5 as 68938d2b424

            There is a merge conflict in 11.4 with a fix in branch bb-11.4-MDEV-33500-v2-mergefix

            bnestere Brandon Nesterenko added a comment - - edited Pushed second fix into 10.5 as 68938d2b424 There is a merge conflict in 11.4 with a fix in branch bb-11.4-MDEV-33500-v2-mergefix

            The second iteration of fixes needed additional fixes in 11.6 for rpl.rpl_parallel_sbm.

            MDEV-33856 added a check for Master_Slave_time_diff in rpl.rpl_parallel_sbm in 11.6, which relied on a 3 second master_delay. The second iteration fix for MDEV-33500 (68938d2b42474a6ff1fced3b1495ec45de2f5c47) replaced the time based master_delay in rpl.rpl_parallel_sbm with debug_sync, thus removing the expected 3 seconds of execution time, triggering an assertion.

            The 11.6 merge fixes the test by making the assertion consistent with the MDEV-33500 debug_sync master_delay changes, i.e. by removing the 3 second delay from the assertion calculation.

            bnestere Brandon Nesterenko added a comment - The second iteration of fixes needed additional fixes in 11.6 for rpl.rpl_parallel_sbm. MDEV-33856 added a check for Master_Slave_time_diff in rpl.rpl_parallel_sbm in 11.6, which relied on a 3 second master_delay. The second iteration fix for MDEV-33500 (68938d2b42474a6ff1fced3b1495ec45de2f5c47) replaced the time based master_delay in rpl.rpl_parallel_sbm with debug_sync, thus removing the expected 3 seconds of execution time, triggering an assertion. The 11.6 merge fixes the test by making the assertion consistent with the MDEV-33500 debug_sync master_delay changes, i.e. by removing the 3 second delay from the assertion calculation.

            People

              bnestere Brandon Nesterenko
              bnestere Brandon Nesterenko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.