Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34778

Inconsistent Seconds_Behind_Master after slave stop+start (rpl.rpl_old_master sporadic failure)

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 11.4
    • None
    • Replication
    • None

    Description

      This is seen as sporadic failure of test case rpl.rpl_old_master, for
      example in commit 5dc2fe4815ddfb19c406f225576040882fe62b02:

      rpl.rpl_old_master 'mix'                 [ fail ]
              Test ended at 2024-08-19 19:37:08
       
      CURRENT_TEST: rpl.rpl_old_master
      mysqltest: In included file "./include/wait_for_slave_param.inc": 
      included from /kvm/src/my/sn-mdev30255/mysql-test/suite/rpl/t/rpl_old_master.test at line 38:
      At line 111: Timeout in include/wait_for_slave_param.inc
      

      The root cause is a real bug or at least inconsistency in the handling of
      Seconds_Behind_Master around slave restart. The issue seems to be related to
      changes done in this commit:

      commit d69e835787f9ce9848cb6c2a5343887dd0eec2ce
      Author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
      Date:   Thu Nov 3 14:56:50 2022 -0600
       
          MDEV-29639: Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas
      

      I am not sure it is worth to try and fix the problem in the code.
      Seconds_Behind_Master has historically had many problems with inconsistent
      behaviour, and MDEV-33856 provides a better solution. But I need to silence
      the failure in the test case, so I wanted to document the underlying root
      cause.

      The failure can be triggered reliably with the following patch, on commit
      5dc2fe4815ddfb19c406f225576040882fe62b02 for example:

      diff --git a/mysql-test/suite/rpl/t/rpl_old_master.test b/mysql-test/suite/rpl/t/rpl_old_master.test
      index 6faa8212d66..37594444472 100644
      --- a/mysql-test/suite/rpl/t/rpl_old_master.test
      +++ b/mysql-test/suite/rpl/t/rpl_old_master.test
      @@ -6,6 +6,7 @@
       --source include/master-slave.inc
       
       --connection slave
      +--sleep 15
       --source include/stop_slave.inc
       
       --connection master
      diff --git a/sql/slave.cc b/sql/slave.cc
      index d768d3e6442..22928196990 100644
      --- a/sql/slave.cc
      +++ b/sql/slave.cc
      @@ -4670,6 +4670,7 @@ static int exec_relay_log_event(THD* thd, Relay_log_info* rli,
               });
       #endif
       
      +if (typ == GTID_LIST_EVENT) my_sleep(10000000);
           DBUG_RETURN(exec_res);
         }
         mysql_mutex_unlock(&rli->data_lock);
      @@ -6117,6 +6118,7 @@ static int queue_event(Master_info* mi, const uchar *buf, ulong event_len)
                     checksum_alg == BINLOG_CHECKSUM_ALG_CRC32); 
       
         DBUG_ENTER("queue_event");
      +  if (buf[EVENT_TYPE_OFFSET] == GTID_LIST_EVENT) my_sleep(10000000);
         /*
           FD_queue checksum alg description does not apply in a case of
           FD itself. The one carries both parts of the checksum data.
      

      What happens is the following:

      1. The slave starts at the start of the testcase (include/master-slave.inc).

      2. The I/O thread writes the event prior to the Gtid_List event (for
      example) to the relay log, then goes to sleep waiting for more events from
      the master.

      3. The SQL thread in next_event() reaches the current end of the relay log,
      it sets sql_thread_caught_up=true and goes to sleep waiting for more events
      to be written:

          if (!cur_log->error) /* EOF */ {
            if (hot_log) {
              rli->sql_thread_caught_up= true;
      

      4. The I/O thread wakes up, writes the next event (Gtid_List in this case)
      to the relay log.

      5. The SQL thread wakes up, reads Gtid_List, returns it from next_event().

      6. The sql thread in exec_relay_log_event() does not set
      sql_thread_caught_up back to false, because the Gtid_List event has
      event_can_update_last_master_timestamp()=false; this code from the
      MDEV-29639 patch:

          if ((!rli->mi->using_parallel()) &&
              event_can_update_last_master_timestamp(ev)) {
            rli->last_master_timestamp= ev->when + (time_t) ev->exec_time;
            rli->sql_thread_caught_up= false;
      

      7. The SQL thread gets killed just after exec_relay_log_event() has
      applied the Gtid_List event. This causes the SQL thread to exit from a
      different code path than this one in next_event from MDEV-29639:

        /*
          Set that we are not caught up so if there is a hang/problem on restart,
          Seconds_Behind_Master will still grow.
        */
        rli->sql_thread_caught_up= false;
        DBUG_RETURN(0);
      

      This is the inconsistency (or bug) seemingly left from the MDEV-29639 patch.
      The sql_thread_caught_up will be set false or not at random at slave stop,
      depending on exactly where the kill signal is detected in the SQL thread.

      8. The slave stops with sql_thread_caught_up left at the value true. When
      the slave starts again, it will initially report Seconds_Behind_Master as 0
      until sql_thread_caught_up gets set back to false, due to this code in
      send_show_master_info_data():

              idle= mi->rli.sql_thread_caught_up;
            ...
            if (idle)
              time_diff= 0;
      

      9. In the test case rpl.rpl_old_master, the slave is started while FLUSH
      TABLES WITH READ LOCK is in effect. After MDEV-12179, this blocks the slave
      thread from doing anything (due to find_gtid_slave_pos_tables()) until
      UNLOCK TABLES. Since the test case waits for Seconds_Behind_Master to become
      non-zero before it does UNLOCK TABLES, this causes the test case to hang and
      eventually time out.

      So long and complex story.
      A fix could be to ensure that sql_thread_caught_up gets set consistently to
      false when the SQL thread exits due to being killed, no matter where in the
      code the kill is detected. But as written above, I do not think it is worth
      it to try to fix this (and risk introducing other inconsistencies or bugs in
      the very complex behaviour of Seconds_Behind_Master).

      For the test failure, the test is anyway no longer testing what it was
      supposed to, after MDEV-12179. So I will try to fix the test or just remove
      that bit from the test. It is testing MDEV-14528, that Seconds_Behind_Master
      will be updated when doing parallel replication from a pre-10.0 master,
      which does not seem a critical thing to test for now that 10.5 is earliest
      supported release.

      Attachments

        Issue Links

          Activity

            People

              bnestere Brandon Nesterenko
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.