Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35098

rpl.rpl_mysqldump_gtid_slave_pos fails in buildbot

Details

    Description

      The test rpl.rpl_mysqldump_gtid_slave_pos, added in MDEV-15393, fails somewhat frequently on buildbot with various failures. A few examples:

      rpl.rpl_mysqldump_gtid_slave_pos 'mix'   w4 [ fail ]
              Test ended at 2024-07-25 08:21:02
       
      CURRENT_TEST: rpl.rpl_mysqldump_gtid_slave_pos
      --- /usr/share/mysql-test/suite/rpl/r/rpl_mysqldump_gtid_slave_pos.result	2024-07-25 07:06:08.000000000 +0000
      +++ /dev/shm/var/4/log/rpl_mysqldump_gtid_slave_pos.reject	2024-07-25 08:21:01.514074976 +0000
      @@ -36,14 +36,9 @@
       select * from mysql.gtid_slave_pos;
       domain_id	sub_id	server_id	seq_no
       0	2	1	1
      -0	3	1	2
      -0	4	1	3
      -0	5	1	4
      -0	6	1	5
      -0	7	1	6
       select @@global.gtid_slave_pos as "after initial slave got in sync";
       after initial slave got in sync
      -0-1-6
      +0-1-1
       # 2.
       include/stop_slave.inc
       # 3. A
      @@ -63,15 +58,15 @@
       # 6.
       select * from mysql.gtid_slave_pos;
       domain_id	sub_id	server_id	seq_no
      -0	2	1	6
      +0	2	1	1
       select @@global.gtid_slave_pos as "after dump restore";
       after dump restore
      -0-1-6
      +0-1-1
       # 7.
       reset slave;
       select @@global.gtid_slave_pos;
       @@global.gtid_slave_pos
      -0-1-6
      +0-1-1
       change master to master_use_gtid=no;
       connection master;
       insert into t1 select 1+max(a),2 from t1;
      @@ -81,7 +76,7 @@
       include/sync_with_master_gtid.inc
       select * from mysql.gtid_slave_pos;
       domain_id	sub_id	server_id	seq_no
      -0	2	1	6
      +0	2	1	1
       0	3	1	1
       0	4	1	2
       0	5	1	3
      @@ -107,15 +102,15 @@
       # 6.
       select * from mysql.gtid_slave_pos;
       domain_id	sub_id	server_id	seq_no
      -0	2	1	6
      +0	2	2	2
       select @@global.gtid_slave_pos as "after dump restore";
       after dump restore
      -0-1-6
      +0-2-2
       # 7.
       reset slave;
       select @@global.gtid_slave_pos;
       @@global.gtid_slave_pos
      -0-1-6
      +0-2-2
       change master to master_use_gtid=no;
       connection master;
       insert into t1 select 1+max(a),1 from t1;
      @@ -125,7 +120,7 @@
       include/sync_with_master_gtid.inc
       select * from mysql.gtid_slave_pos;
       domain_id	sub_id	server_id	seq_no
      -0	2	1	6
      +0	2	2	2
       0	3	1	1
       0	4	1	2
       0	5	1	3
       
      Result length mismatch
      

      rpl.rpl_mysqldump_gtid_slave_pos 'stmt'  w2 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2024-09-20 12:49:04
      line
      2024-09-20 12:49:00 268 [Warning] Master is configured to log replication events with checksum, but will not send such events to slaves that cannot process them
      ^ Found warnings in /run/shm/var/2/log/mysqld.1.err
      ok
       
       - saving '/run/shm/var/2/log/rpl.rpl_mysqldump_gtid_slave_pos-stmt/' to '/run/shm/var/log/rpl.rpl_mysqldump_gtid_slave_pos-stmt/'
       
      Retrying test rpl.rpl_mysqldump_gtid_slave_pos, attempt(2/3)...
      

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            As an intermediate report, tried reproducing to run a sequence of tests executed by the failed mtr worker e.g w40 in
            BB, to no success.

            Elkin Andrei Elkin added a comment - As an intermediate report, tried reproducing to run a sequence of tests executed by the failed mtr worker e.g w40 in BB , to no success.

            I took a quick look at what the test case does, I see what the problem is.

            The test is selecting from mysql.gtid_slave_pos on the slave, but the contents of that table is not deterministic for two reasons:
            1. Rows from that table are deleted in batches every --gtid_cleanup_batch_size rows. It will be completely random when the next batch deletion point will be hit, depending on how many GTIDS were replicated in previous test cases.
            2. Even if the batch deletion was controlled (eg. by restarting mysqld.2 at the start of the test to reset the batch counter and forcing --gtid_cleanup_batch_size to a specific value), the contents will still be non-deterministic after the first delete batch is hit, since the deletion happens in the background in a separate thread.

            I was able to reproduce on commit 13612464860f6007f2a9ce1a84bc0f444c6b6e7a as described in MDEV-35153:

            mtr --mem --no-reorder rpl.rpl_mysql_upgrade,stmt rpl.rpl_mysqldump_gtid_slave_pos,stmt
            

            Another way to see the problem is to set a different value of --gtid_cleanup_batch_size:

            mtr --mem rpl.rpl_mysqldump_gtid_slave_pos,stmt --mysqld=--gtid_cleanup_batch_size=2
            

            Possibly the test can be made stable simply by setting --gtid_cleanup_batch_size to a high value to disable the batch deletion of rows from mysql.gtid_slave_pos:

            diff --git a/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test b/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test
            index e66a92a6153..53a1c074b22 100644
            --- a/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test
            +++ b/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test
            @@ -3,6 +3,10 @@
             
             --echo # Initialize
             --connection slave
            +# Set a large deletion batch size to disable any background deletion of rows
            +# from mysql.gtid_slave pos. Otherwise the contents of the table will be
            +# non-deterministic, causing this test to fail.
            +SET GLOBAL gtid_cleanup_batch_size= 2000000000;
             ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
             
             --echo # Setup data
            

            But perhaps it would be better to change the test to not depend on the exact data in the mysql.gtid_slave_pos table. I did not check in detail what the test is trying to achieve, maybe simply remove those SELECT * FROM mysql.gtid_slave_pos, if they are not essential for the test's purpose?

            knielsen Kristian Nielsen added a comment - I took a quick look at what the test case does, I see what the problem is. The test is selecting from mysql.gtid_slave_pos on the slave, but the contents of that table is not deterministic for two reasons: 1. Rows from that table are deleted in batches every --gtid_cleanup_batch_size rows. It will be completely random when the next batch deletion point will be hit, depending on how many GTIDS were replicated in previous test cases. 2. Even if the batch deletion was controlled (eg. by restarting mysqld.2 at the start of the test to reset the batch counter and forcing --gtid_cleanup_batch_size to a specific value), the contents will still be non-deterministic after the first delete batch is hit, since the deletion happens in the background in a separate thread. I was able to reproduce on commit 13612464860f6007f2a9ce1a84bc0f444c6b6e7a as described in MDEV-35153 : mtr --mem --no-reorder rpl.rpl_mysql_upgrade,stmt rpl.rpl_mysqldump_gtid_slave_pos,stmt Another way to see the problem is to set a different value of --gtid_cleanup_batch_size: mtr --mem rpl.rpl_mysqldump_gtid_slave_pos,stmt --mysqld=--gtid_cleanup_batch_size=2 Possibly the test can be made stable simply by setting --gtid_cleanup_batch_size to a high value to disable the batch deletion of rows from mysql.gtid_slave_pos: diff --git a/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test b/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test index e66a92a6153..53a1c074b22 100644 --- a/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test +++ b/mysql-test/suite/rpl/t/rpl_mysqldump_gtid_slave_pos.test @@ -3,6 +3,10 @@ --echo # Initialize --connection slave +# Set a large deletion batch size to disable any background deletion of rows +# from mysql.gtid_slave pos. Otherwise the contents of the table will be +# non-deterministic, causing this test to fail. +SET GLOBAL gtid_cleanup_batch_size= 2000000000; ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; --echo # Setup data But perhaps it would be better to change the test to not depend on the exact data in the mysql.gtid_slave_pos table. I did not check in detail what the test is trying to achieve, maybe simply remove those SELECT * FROM mysql.gtid_slave_pos, if they are not essential for the test's purpose?
            ycp Yuchen Pei added a comment -

            Thanks for investigating knielsen. There is one difference though. The regression in MDEV-35153 is 10.5-only, and does not reproduce in 10.6. However, the invocation in your comment with --gtid_cleanup_batch_size fails in 10.6 7372ecc396517839a1546a8c1dd9192711922ef1 too:

            mtr --mem rpl.rpl_mysqldump_gtid_slave_pos,stmt --mysqld=--gtid_cleanup_batch_size=2
            

            Do you have an explanation for this discrepancy, or does this mean the present issue is a different problem from that of the regression in MDEV-35153?

            ycp Yuchen Pei added a comment - Thanks for investigating knielsen . There is one difference though. The regression in MDEV-35153 is 10.5-only, and does not reproduce in 10.6. However, the invocation in your comment with --gtid_cleanup_batch_size fails in 10.6 7372ecc396517839a1546a8c1dd9192711922ef1 too: mtr --mem rpl.rpl_mysqldump_gtid_slave_pos,stmt --mysqld=--gtid_cleanup_batch_size=2 Do you have an explanation for this discrepancy, or does this mean the present issue is a different problem from that of the regression in MDEV-35153 ?
            knielsen Kristian Nielsen added a comment - - edited

            The explanation is that there is a global counter pending_gtid_count that starts at 0 when the server starts and is incremented for every GTID replicated. Every time the counter reaches the value of --gtid_cleanup_batch_size (64 by default), a background job starts and deletes unused rows. The rpl.rpl_mysqldump_gtid_slave_pos testcase fails if this deletion happens to remove rows that would otherwise be returned from one of the SELECT * FROM mysql.gtid_slave_pos statements in the test. Thus, the failure of the test case depends entirely on the amount of GTIDs replicated by prior testcases since last slave restart, modulo 64. Thus, it is essentially random, though somewhat deterministic for given set of testcases.

            In your MDEV-35153 patch, you add one extra statement to be replicated in the prior testcase rpl.rpl_mysql_upgrade:

              +# MDEV-34716 Fix mysql.servers socket max length too short
              +ALTER TABLE servers
              +  MODIFY Socket char(108) NOT NULL DEFAULT '';
            

            This means the deletion happens one GTID earlier in rpl.rpl_mysqldump_gtid_slave_pos, which in your case just happens to trigger the error. You can test this in your branch by running the test with the value of --gtid_cleanup_batch_size set one higher to 65; then your extra replicated GTID is cancelled by the incremented value of the batch size, and the test failure no longer occurs:

            mtr --mem --no-reorder rpl.rpl_mysql_upgrade,stmt rpl.rpl_mysqldump_gtid_slave_pos,stmt --mysqld=--gtid_cleanup_batch_size=65
            

            As for 10.6, presumably something else was changed so that a different number of GTIDs get replicated in this test combination, and now the deletions happen at a different point again that doesn't trigger the failure in this particular situation. But of course the problem is still there and would occur with some other sequence of tests or some new changes that end up pushing the deletion point again to a place where it causes the test failure.

            knielsen Kristian Nielsen added a comment - - edited The explanation is that there is a global counter pending_gtid_count that starts at 0 when the server starts and is incremented for every GTID replicated. Every time the counter reaches the value of --gtid_cleanup_batch_size (64 by default), a background job starts and deletes unused rows. The rpl.rpl_mysqldump_gtid_slave_pos testcase fails if this deletion happens to remove rows that would otherwise be returned from one of the SELECT * FROM mysql.gtid_slave_pos statements in the test. Thus, the failure of the test case depends entirely on the amount of GTIDs replicated by prior testcases since last slave restart, modulo 64. Thus, it is essentially random, though somewhat deterministic for given set of testcases. In your MDEV-35153 patch, you add one extra statement to be replicated in the prior testcase rpl.rpl_mysql_upgrade: +# MDEV-34716 Fix mysql.servers socket max length too short +ALTER TABLE servers + MODIFY Socket char(108) NOT NULL DEFAULT ''; This means the deletion happens one GTID earlier in rpl.rpl_mysqldump_gtid_slave_pos , which in your case just happens to trigger the error. You can test this in your branch by running the test with the value of --gtid_cleanup_batch_size set one higher to 65; then your extra replicated GTID is cancelled by the incremented value of the batch size, and the test failure no longer occurs: mtr --mem --no-reorder rpl.rpl_mysql_upgrade,stmt rpl.rpl_mysqldump_gtid_slave_pos,stmt --mysqld=--gtid_cleanup_batch_size=65 As for 10.6, presumably something else was changed so that a different number of GTIDs get replicated in this test combination, and now the deletions happen at a different point again that doesn't trigger the failure in this particular situation. But of course the problem is still there and would occur with some other sequence of tests or some new changes that end up pushing the deletion point again to a place where it causes the test failure.
            ycp Yuchen Pei added a comment -

            Thanks for explaining knielsen, it sounds like this issue can potentially block any changes made to the system table scripts by (fake) test regressions, in which case shouldn't this issue be of higher priority?

            ycp Yuchen Pei added a comment - Thanks for explaining knielsen , it sounds like this issue can potentially block any changes made to the system table scripts by (fake) test regressions, in which case shouldn't this issue be of higher priority?

            People

              Elkin Andrei Elkin
              bnestere Brandon Nesterenko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.