[MDEV-29934] rpl.rpl_start_alter_chain_basic, rpl.rpl_start_alter_restart_slave sometimes fail in BB with result content mismatch Created: 2022-11-02  Updated: 2024-01-30

Status: Open
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: 10.11, 11.0

Type: Bug Priority: Major
Reporter: Angelique Sklavounos (Inactive) Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File mysqld.1.err     File mysqld.2.err     File mysqld.3.err     File mysqld.4.err     File var.tar.gz    
Issue Links:
Relates
relates to MDEV-30460 rpl.rpl_start_alter_restart_slave som... Open

 Description   

rpl.rpl_start_alter_chain_basic

https://buildbot.mariadb.org/#/builders/203/builds/11894

10.11 307d935e2

rpl.rpl_start_alter_chain_basic 'innodb,stmt' w8 [ fail ]
        Test ended at 2022-10-24 16:10:18
 
CURRENT_TEST: rpl.rpl_start_alter_chain_basic
--- /home/buildbot/amd64-ubuntu-1804/build/mysql-test/suite/rpl/r/rpl_start_alter_chain_basic.result	2022-10-24 15:35:45.000000000 +0000
+++ /home/buildbot/amd64-ubuntu-1804/build/mysql-test/suite/rpl/r/rpl_start_alter_chain_basic.reject	2022-10-24 16:10:17.810733190 +0000
@@ -67,7 +67,7 @@
 connection server_3;
 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
 domain_id	seq_no
-0	12
+0	11
 include/stop_slave.inc
 set global slave_parallel_threads = 0;;
 set global slave_parallel_mode = optimistic;;
 
mysqltest: Result content mismatch

Seems to happen only on amd64 platforms. Could not reproduce locally with test repeats.

rpl.rpl_start_alter_restart_slave

The following output occurs on amd64 platforms.

0619127290e6d336

rpl.rpl_start_alter_restart_slave 'innodb,mix' w12 [ fail ]
        Test ended at 2023-01-20 16:04:19
 
CURRENT_TEST: rpl.rpl_start_alter_restart_slave
--- /home/buildbot/amd64-ubuntu-2204-debug-ps/build/mysql-test/suite/rpl/r/rpl_start_alter_restart_slave.result	2023-01-20 15:58:58.000000000 +0000
+++ /home/buildbot/amd64-ubuntu-2204-debug-ps/build/mysql-test/suite/rpl/r/rpl_start_alter_restart_slave.reject	2023-01-20 16:04:19.044075982 +0000
@@ -83,7 +83,7 @@
 # Everything from the master binlog must have been applied now:
 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
 domain_id	seq_no
-0	7
+0	6
 # slave gtid state is 0-1-7
 # The list of events after the slave has synchronized must have both CA:
 show binlog events  from <binlog_start>;
 
mysqltest: Result content mismatch



 Comments   
Comment by Angelique Sklavounos (Inactive) [ 2022-12-07 ]

Error logs and var directory for https://buildbot.mariadb.org/#/builders/172/builds/9990 attached.

Comment by Angelique Sklavounos (Inactive) [ 2023-01-25 ]

For rpl.rpl_start_alter_chain_basic, all servers should be synced by include/rpl_sync.inc, as below code (10.8 88c35781) shows. The mismatch only seems to occur with server_3, which has parallel threads and gtid_strict_mode=1.

 36 --source include/rpl_sync.inc
 37
 38
 39 --connection server_2
 40 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
 41
 42 --connection server_3
 43 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;

For rpl.rpl_start_alter_restart_slave, the slave (which like server_3 in rpl_start_alter_chain_basic also has parallel threads and gtid_strict_mode=1). Likewise, it should be synced with sync_slave_with_master right before the mismatched select domain_id, seq_no…:

 92 --source include/start_slave.inc
 93 --connection master
 94 --sync_slave_with_master
 95 --echo # Everything from the master binlog must have been applied now:
 96 select domain_id, seq_no from mysql.gtid_slave_pos order by seq_no desc limit 1;
 97 --let $slave_gtid_state = `select @@gtid_binlog_state`
 98 --echo # slave gtid state is $slave_gtid_state

With this mismatch, gtid_slave_pos is 0-1-6 but gtid_binlog_state is 0-1-7. I wonder if using the macro sync_with_master_gtid.inc would be more suitable.
Also, I don’t understand why set debug_sync="now wait_for CA_1_processing”; and set debug_sync="now signal proceed_CA_1”; are commented out. Were these there for debugging during development and not needed anymore? Or are they actually needed but their inclusion was overlooked?

I added checking the gtid_binlog_state for server_3 to rpl.rpl_start_alter_chain_basic, and calling sync_with_master_gtid.inc to rpl.rpl_start_alter_restart_slave. Did this here: https://github.com/MariaDB/server/commit/943989c9ef9b7d01333ad14a09547585666c9eeb (incorrectly put 29943).

Generated at Thu Feb 08 10:12:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.