Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29934

rpl.rpl_start_alter_chain_basic, rpl.rpl_start_alter_restart_slave sometimes fail in BB with result content mismatch

Details

    Description

      rpl.rpl_start_alter_chain_basic

      https://buildbot.mariadb.org/#/builders/203/builds/11894

      10.11 307d935e2

      rpl.rpl_start_alter_chain_basic 'innodb,stmt' w8 [ fail ]
              Test ended at 2022-10-24 16:10:18
       
      CURRENT_TEST: rpl.rpl_start_alter_chain_basic
      --- /home/buildbot/amd64-ubuntu-1804/build/mysql-test/suite/rpl/r/rpl_start_alter_chain_basic.result	2022-10-24 15:35:45.000000000 +0000
      +++ /home/buildbot/amd64-ubuntu-1804/build/mysql-test/suite/rpl/r/rpl_start_alter_chain_basic.reject	2022-10-24 16:10:17.810733190 +0000
      @@ -67,7 +67,7 @@
       connection server_3;
       select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
       domain_id	seq_no
      -0	12
      +0	11
       include/stop_slave.inc
       set global slave_parallel_threads = 0;;
       set global slave_parallel_mode = optimistic;;
       
      mysqltest: Result content mismatch
      

      Seems to happen only on amd64 platforms. Could not reproduce locally with test repeats.

      rpl.rpl_start_alter_restart_slave

      The following output occurs on amd64 platforms.

      0619127290e6d336

      rpl.rpl_start_alter_restart_slave 'innodb,mix' w12 [ fail ]
              Test ended at 2023-01-20 16:04:19
       
      CURRENT_TEST: rpl.rpl_start_alter_restart_slave
      --- /home/buildbot/amd64-ubuntu-2204-debug-ps/build/mysql-test/suite/rpl/r/rpl_start_alter_restart_slave.result	2023-01-20 15:58:58.000000000 +0000
      +++ /home/buildbot/amd64-ubuntu-2204-debug-ps/build/mysql-test/suite/rpl/r/rpl_start_alter_restart_slave.reject	2023-01-20 16:04:19.044075982 +0000
      @@ -83,7 +83,7 @@
       # Everything from the master binlog must have been applied now:
       select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
       domain_id	seq_no
      -0	7
      +0	6
       # slave gtid state is 0-1-7
       # The list of events after the slave has synchronized must have both CA:
       show binlog events  from <binlog_start>;
       
      mysqltest: Result content mismatch
      

      Attachments

        1. mysqld.1.err
          1.04 MB
        2. mysqld.2.err
          1.11 MB
        3. mysqld.3.err
          100 kB
        4. mysqld.4.err
          51 kB
        5. var.tar.gz
          3.10 MB

        Issue Links

          Activity

            angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - Error logs and var directory for https://buildbot.mariadb.org/#/builders/172/builds/9990 attached.

            For rpl.rpl_start_alter_chain_basic, all servers should be synced by include/rpl_sync.inc, as below code (10.8 88c35781) shows. The mismatch only seems to occur with server_3, which has parallel threads and gtid_strict_mode=1.

             36 --source include/rpl_sync.inc
             37
             38
             39 --connection server_2
             40 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
             41
             42 --connection server_3
             43 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
            

            For rpl.rpl_start_alter_restart_slave, the slave (which like server_3 in rpl_start_alter_chain_basic also has parallel threads and gtid_strict_mode=1). Likewise, it should be synced with sync_slave_with_master right before the mismatched select domain_id, seq_no…:

             92 --source include/start_slave.inc
             93 --connection master
             94 --sync_slave_with_master
             95 --echo # Everything from the master binlog must have been applied now:
             96 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
             97 --let $slave_gtid_state = `select @@gtid_binlog_state`
             98 --echo # slave gtid state is $slave_gtid_state
            

            With this mismatch, gtid_slave_pos is 0-1-6 but gtid_binlog_state is 0-1-7. I wonder if using the macro sync_with_master_gtid.inc would be more suitable.
            Also, I don’t understand why set debug_sync="now wait_for CA_1_processing”; and set debug_sync="now signal proceed_CA_1”; are commented out. Were these there for debugging during development and not needed anymore? Or are they actually needed but their inclusion was overlooked?

            I added checking the gtid_binlog_state for server_3 to rpl.rpl_start_alter_chain_basic, and calling sync_with_master_gtid.inc to rpl.rpl_start_alter_restart_slave. Did this here: https://github.com/MariaDB/server/commit/943989c9ef9b7d01333ad14a09547585666c9eeb (incorrectly put 29943).

            angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - For rpl.rpl_start_alter_chain_basic, all servers should be synced by include/rpl_sync.inc , as below code (10.8 88c35781) shows. The mismatch only seems to occur with server_3 , which has parallel threads and gtid_strict_mode=1 . 36 --source include/rpl_sync.inc 37 38 39 --connection server_2 40 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1; 41 42 --connection server_3 43 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1; For rpl.rpl_start_alter_restart_slave, the slave (which like server_3 in rpl_start_alter_chain_basic also has parallel threads and gtid_strict_mode=1 ). Likewise, it should be synced with sync_slave_with_master right before the mismatched select domain_id, seq_no… : 92 --source include/start_slave.inc 93 --connection master 94 --sync_slave_with_master 95 --echo # Everything from the master binlog must have been applied now: 96 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1; 97 --let $slave_gtid_state = `select @@gtid_binlog_state` 98 --echo # slave gtid state is $slave_gtid_state With this mismatch, gtid_slave_pos is 0-1-6 but gtid_binlog_state is 0-1-7. I wonder if using the macro sync_with_master_gtid.inc would be more suitable. Also, I don’t understand why set debug_sync="now wait_for CA_1_processing”; and set debug_sync="now signal proceed_CA_1”; are commented out. Were these there for debugging during development and not needed anymore? Or are they actually needed but their inclusion was overlooked? I added checking the gtid_binlog_state for server_3 to rpl.rpl_start_alter_chain_basic, and calling sync_with_master_gtid.inc to rpl.rpl_start_alter_restart_slave. Did this here: https://github.com/MariaDB/server/commit/943989c9ef9b7d01333ad14a09547585666c9eeb (incorrectly put 29943).

            People

              Elkin Andrei Elkin
              angelique.sklavounos Angelique Sklavounos (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.