Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33331

IO Thread Relay Log Inconsistent Statistics After MDEV-32551

    XMLWordPrintable

Details

    Description

      After MDEV-32551, in a master/slave setup, if the replica's IO thread quickly reconnects (i.e quickly running STOP SLAVE IO_THREAD followed by START SLAVE IO_THREAD), the relay rotation behavior changes. Additionally, there is a small gap of time in which the SHOW SLAVE STATUS variable Slave_IO_Running can be YES, but the Master_Log_File is empty, which was not observed pre-MDEV-32551.

      These issues make for unstable MTR tests that either 1) rely on consistent relay logging behavior, e.g. rpl_mariadb_slave_capability (in 10.11+), or 2) rely on binlog coordinates after `start_slave.inc` on replica with an empty state, e.g. after `RESET SLAVE` in rpl_using_gtid_default.

      Due to the primary server side changes of kill zombie dump threads:

      • kill_zombie_dump_threads() now does killing of dump threads properly.
      • It can now kill several threads (should be impossible but could
        happen if IO slaves reconnects very fast).
      • We now wait until the dump thread is done before starting the
        dump.

      That is because kill_zombie_dump_threads() now kills threads properly, and binlog dump threads will now kill themselves if they see another connection with the same server_id, to concretely outline when we get inconsistent relay logs:

      1. Slave: START SLAVE IO_THREAD; --source include/wait_for_slave_io_to_start.inc : Start the IO thread as normal, and wait for Slave_IO_Running==YES. This just waits for the initial handshake to complete. We do not yet receive anything from the master binlog dump thread (in particular, the fake rotate event, which initializes the name of the binary log to read)
      2. Master: The master's binlog dump thread tries to send the fake rotate event to initialize the binlog name on the slave.
      3. Slave: STOP SLAVE IO_THREAD. The issue is that, we can stop the slave potentially before it can receive the initial fake rotate log event, because that is outside of start_slave.inc 's check.

      I wonder if the initial fake rotate log event should be a part of the "handshake", such that the Slave_IO_Running is not changed to "Yes" until the replica receives it from the primary.

      Attachments

        Issue Links

          Activity

            People

              bnestere Brandon Nesterenko
              bnestere Brandon Nesterenko
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.