I was not able to reproduce this locally, but I was able to analyze what I believe is the root cause from server error logs from buildbot failures.
2024-04-23 11:19:57 6 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000005' at position 979, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-7,1-1-1'
|
2024-04-23 11:19:57 0 [Note] /home/buildbot/amd64-ubuntu-2310/build/sql/mariadbd: ready for connections.
|
Version: '10.11.8-MariaDB-log' socket: '/home/buildbot/amd64-ubuntu-2310/build/mysql-test/var/tmp/5/mysqld.2.sock' port: 16081 Source distribution
|
2024-04-23 11:19:57 6 [Note] Error reading relay log event: slave SQL thread was killed
|
2024-04-23 11:19:57 6 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000005' at position 256; GTID position '0-1-7,1-1-1', master: 127.0.0.1:16080
|
2024-04-23 11:19:57 5 [Note] Slave I/O thread exiting, read up to log 'master-bin.000005', position 315; GTID position 1-1-1,0-1-7, master 127.0.0.1:16080
|
2024-04-23 11:19:57 7 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info'
|
2024-04-23 11:19:57 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='315'. New state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='256'.
|
2024-04-23 11:19:57 7 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No
|
Here we see the slave is started in GTID mode, the current GTID position is 0-1-7,1-1-1 which corresponds to old-style position 'master-bin.000005' / 979. But the slave threads are immediately stopped while receiving initial parts of the binlog (Format_description, Gtid_list, Binlog_checkpoint), before reaching the starting GTID position, and in particular before receiving the old-style position corresponding to that point. Thus the old-style position points too early, at position 256, when the tests subsequently restarts the slave in non-GTID mode. This leads to re-applying old events and duplicate key error.
I'm not sure if this should be considered a bug or not. In the general case, we cannot set a meaningful and correct old-style position in all cases until the GTID connect has reached the starting GTID position. The "current position" could correspond to different points in the master's binlog (in case of multiple domains in the GTID position), which cannot be represented in the old-style position at all. Or the corresponding position may simply be unknown right after SET GLOBAL gtid_slave_pos.
Though in this particular case, the position is already known (from an earlier GTID connect), so perhaps the server should try to keep on to this correct position, and not overwrite it with older positions while replicating Binlog_checkpoint and so on? Not sure.
Note that the state of not knowing / having incorrect old-style position is only there for a brief period, as soon as the GTID connect completes the master sends a fake GTID list containing the correct position - in fact this is precisely what is tested in the test case.
If we decide this is "not a bug", I think the sporadic test failure can be eliminated by injecting an extra event and syncing the slave to it before switching to non-GTID mode. I will post a patch for this to the mailing list for review.
I was not able to reproduce this locally, but I was able to analyze what I believe is the root cause from server error logs from buildbot failures.
2024-04-23 11:19:57 6 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000005' at position 979, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-7,1-1-1'
2024-04-23 11:19:57 0 [Note] /home/buildbot/amd64-ubuntu-2310/build/sql/mariadbd: ready for connections.
Version: '10.11.8-MariaDB-log' socket: '/home/buildbot/amd64-ubuntu-2310/build/mysql-test/var/tmp/5/mysqld.2.sock' port: 16081 Source distribution
2024-04-23 11:19:57 6 [Note] Error reading relay log event: slave SQL thread was killed
2024-04-23 11:19:57 6 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000005' at position 256; GTID position '0-1-7,1-1-1', master: 127.0.0.1:16080
2024-04-23 11:19:57 5 [Note] Slave I/O thread exiting, read up to log 'master-bin.000005', position 315; GTID position 1-1-1,0-1-7, master 127.0.0.1:16080
2024-04-23 11:19:57 7 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info'
2024-04-23 11:19:57 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='315'. New state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='256'.
2024-04-23 11:19:57 7 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No
Here we see the slave is started in GTID mode, the current GTID position is 0-1-7,1-1-1 which corresponds to old-style position 'master-bin.000005' / 979. But the slave threads are immediately stopped while receiving initial parts of the binlog (Format_description, Gtid_list, Binlog_checkpoint), before reaching the starting GTID position, and in particular before receiving the old-style position corresponding to that point. Thus the old-style position points too early, at position 256, when the tests subsequently restarts the slave in non-GTID mode. This leads to re-applying old events and duplicate key error.
I'm not sure if this should be considered a bug or not. In the general case, we cannot set a meaningful and correct old-style position in all cases until the GTID connect has reached the starting GTID position. The "current position" could correspond to different points in the master's binlog (in case of multiple domains in the GTID position), which cannot be represented in the old-style position at all. Or the corresponding position may simply be unknown right after SET GLOBAL gtid_slave_pos.
Though in this particular case, the position is already known (from an earlier GTID connect), so perhaps the server should try to keep on to this correct position, and not overwrite it with older positions while replicating Binlog_checkpoint and so on? Not sure.
Note that the state of not knowing / having incorrect old-style position is only there for a brief period, as soon as the GTID connect completes the master sends a fake GTID list containing the correct position - in fact this is precisely what is tested in the test case.
If we decide this is "not a bug", I think the sporadic test failure can be eliminated by injecting an extra event and syncing the slave to it before switching to non-GTID mode. I will post a patch for this to the mailing list for review.