Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33602

rpl.rpl_gtid_stop_start At line 187: sync_with_master failed: 'select master_pos_wait('master-bin.000005', 1186, 300, '')' returned NULL indicating slave SQL thread

Details

    Description

      appears to be different from other open MDEV on same test case (that I couldn't find instances of).

      cr 10.6

      https://buildbot.mariadb.org/#/builders/481/builds/13235 10.6 3907345e2

      rpl.rpl_gtid_stop_start 'innodb,mix'     w11 [ fail ]
              Test ended at 2024-02-12 18:05:48
       
      CURRENT_TEST: rpl.rpl_gtid_stop_start
      analyze: sync_with_master
      mysqltest: At line 187: sync_with_master failed: 'select master_pos_wait('master-bin.000005', 1186, 300, '')' returned NULL indicating slave SQL thread failure
       
      The result from queries just before the failure was:
      < snip >
      include/wait_for_slave_to_start.inc
      connection server_1;
      connection server_2;
      SELECT * FROM t1 ORDER BY a;
      a
      1
      2
      

      11.4 example

      Attachments

        Activity

          I was not able to reproduce this locally, but I was able to analyze what I believe is the root cause from server error logs from buildbot failures.

          2024-04-23 11:19:57 6 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000005' at position 979, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-7,1-1-1'
          2024-04-23 11:19:57 0 [Note] /home/buildbot/amd64-ubuntu-2310/build/sql/mariadbd: ready for connections.
          Version: '10.11.8-MariaDB-log'  socket: '/home/buildbot/amd64-ubuntu-2310/build/mysql-test/var/tmp/5/mysqld.2.sock'  port: 16081  Source distribution
          2024-04-23 11:19:57 6 [Note] Error reading relay log event: slave SQL thread was killed
          2024-04-23 11:19:57 6 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000005' at position 256; GTID position '0-1-7,1-1-1', master: 127.0.0.1:16080
          2024-04-23 11:19:57 5 [Note] Slave I/O thread exiting, read up to log 'master-bin.000005', position 315; GTID position 1-1-1,0-1-7, master 127.0.0.1:16080
          2024-04-23 11:19:57 7 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
          2024-04-23 11:19:57 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='315'. New state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='256'.
          2024-04-23 11:19:57 7 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No
          

          Here we see the slave is started in GTID mode, the current GTID position is 0-1-7,1-1-1 which corresponds to old-style position 'master-bin.000005' / 979. But the slave threads are immediately stopped while receiving initial parts of the binlog (Format_description, Gtid_list, Binlog_checkpoint), before reaching the starting GTID position, and in particular before receiving the old-style position corresponding to that point. Thus the old-style position points too early, at position 256, when the tests subsequently restarts the slave in non-GTID mode. This leads to re-applying old events and duplicate key error.

          I'm not sure if this should be considered a bug or not. In the general case, we cannot set a meaningful and correct old-style position in all cases until the GTID connect has reached the starting GTID position. The "current position" could correspond to different points in the master's binlog (in case of multiple domains in the GTID position), which cannot be represented in the old-style position at all. Or the corresponding position may simply be unknown right after SET GLOBAL gtid_slave_pos.

          Though in this particular case, the position is already known (from an earlier GTID connect), so perhaps the server should try to keep on to this correct position, and not overwrite it with older positions while replicating Binlog_checkpoint and so on? Not sure.

          Note that the state of not knowing / having incorrect old-style position is only there for a brief period, as soon as the GTID connect completes the master sends a fake GTID list containing the correct position - in fact this is precisely what is tested in the test case.

          If we decide this is "not a bug", I think the sporadic test failure can be eliminated by injecting an extra event and syncing the slave to it before switching to non-GTID mode. I will post a patch for this to the mailing list for review.

          knielsen Kristian Nielsen added a comment - I was not able to reproduce this locally, but I was able to analyze what I believe is the root cause from server error logs from buildbot failures. 2024-04-23 11:19:57 6 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000005' at position 979, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-7,1-1-1' 2024-04-23 11:19:57 0 [Note] /home/buildbot/amd64-ubuntu-2310/build/sql/mariadbd: ready for connections. Version: '10.11.8-MariaDB-log' socket: '/home/buildbot/amd64-ubuntu-2310/build/mysql-test/var/tmp/5/mysqld.2.sock' port: 16081 Source distribution 2024-04-23 11:19:57 6 [Note] Error reading relay log event: slave SQL thread was killed 2024-04-23 11:19:57 6 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000005' at position 256; GTID position '0-1-7,1-1-1', master: 127.0.0.1:16080 2024-04-23 11:19:57 5 [Note] Slave I/O thread exiting, read up to log 'master-bin.000005', position 315; GTID position 1-1-1,0-1-7, master 127.0.0.1:16080 2024-04-23 11:19:57 7 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info' 2024-04-23 11:19:57 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='315'. New state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='256'. 2024-04-23 11:19:57 7 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No Here we see the slave is started in GTID mode, the current GTID position is 0-1-7,1-1-1 which corresponds to old-style position 'master-bin.000005' / 979. But the slave threads are immediately stopped while receiving initial parts of the binlog (Format_description, Gtid_list, Binlog_checkpoint), before reaching the starting GTID position, and in particular before receiving the old-style position corresponding to that point. Thus the old-style position points too early, at position 256, when the tests subsequently restarts the slave in non-GTID mode. This leads to re-applying old events and duplicate key error. I'm not sure if this should be considered a bug or not. In the general case, we cannot set a meaningful and correct old-style position in all cases until the GTID connect has reached the starting GTID position. The "current position" could correspond to different points in the master's binlog (in case of multiple domains in the GTID position), which cannot be represented in the old-style position at all. Or the corresponding position may simply be unknown right after SET GLOBAL gtid_slave_pos. Though in this particular case, the position is already known (from an earlier GTID connect), so perhaps the server should try to keep on to this correct position, and not overwrite it with older positions while replicating Binlog_checkpoint and so on? Not sure. Note that the state of not knowing / having incorrect old-style position is only there for a brief period, as soon as the GTID connect completes the master sends a fake GTID list containing the correct position - in fact this is precisely what is tested in the test case. If we decide this is "not a bug", I think the sporadic test failure can be eliminated by injecting an extra event and syncing the slave to it before switching to non-GTID mode. I will post a patch for this to the mailing list for review.

          Pushed a work-around for this to 10.5, we will see in buildbot if it solves the problem.
          Filed a separate bug MDEV-33996 for the underlying issue with temporary incorrect old-style position during GTID connect.

          knielsen Kristian Nielsen added a comment - Pushed a work-around for this to 10.5, we will see in buildbot if it solves the problem. Filed a separate bug MDEV-33996 for the underlying issue with temporary incorrect old-style position during GTID connect.

          People

            knielsen Kristian Nielsen
            danblack Daniel Black
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.