[MDEV-33602] rpl.rpl_gtid_stop_start At line 187: sync_with_master failed: 'select master_pos_wait('master-bin.000005', 1186, 300, '')' returned NULL indicating slave SQL thread - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6, 11.4
Fix Version/s: 10.6.19, 10.11.9, 11.1.6, 11.2.5, 11.4.3
Component/s: Replication, Tests
Labels:
None

Description

appears to be different from other open MDEV on same test case (that I couldn't find instances of).

https://buildbot.mariadb.org/#/builders/481/builds/13235 10.6 3907345e2
rpl.rpl_gtid_stop_start 'innodb,mix' w11 [ fail ]
Test ended at 2024-02-12 18:05:48

CURRENT_TEST: rpl.rpl_gtid_stop_start
analyze: sync_with_master
mysqltest: At line 187: sync_with_master failed: 'select master_pos_wait('master-bin.000005', 1186, 300, '')' returned NULL indicating slave SQL thread failure

The result from queries just before the failure was:
< snip >
include/wait_for_slave_to_start.inc
connection server_1;
connection server_2;
SELECT * FROM t1 ORDER BY a;
a
1
2

11.4 example

Attachments

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2024-04-23 18:51

I was not able to reproduce this locally, but I was able to analyze what I believe is the root cause from server error logs from buildbot failures.

2024-04-23 11:19:57 6 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000005' at position 979, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-7,1-1-1'

2024-04-23 11:19:57 0 [Note] /home/buildbot/amd64-ubuntu-2310/build/sql/mariadbd: ready for connections.

Version: '10.11.8-MariaDB-log'  socket: '/home/buildbot/amd64-ubuntu-2310/build/mysql-test/var/tmp/5/mysqld.2.sock'  port: 16081  Source distribution

2024-04-23 11:19:57 6 [Note] Error reading relay log event: slave SQL thread was killed

2024-04-23 11:19:57 6 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000005' at position 256; GTID position '0-1-7,1-1-1', master: 127.0.0.1:16080

2024-04-23 11:19:57 5 [Note] Slave I/O thread exiting, read up to log 'master-bin.000005', position 315; GTID position 1-1-1,0-1-7, master 127.0.0.1:16080

2024-04-23 11:19:57 7 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'

2024-04-23 11:19:57 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='315'. New state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='256'.

2024-04-23 11:19:57 7 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No

Here we see the slave is started in GTID mode, the current GTID position is 0-1-7,1-1-1 which corresponds to old-style position 'master-bin.000005' / 979. But the slave threads are immediately stopped while receiving initial parts of the binlog (Format_description, Gtid_list, Binlog_checkpoint), before reaching the starting GTID position, and in particular before receiving the old-style position corresponding to that point. Thus the old-style position points too early, at position 256, when the tests subsequently restarts the slave in non-GTID mode. This leads to re-applying old events and duplicate key error.

I'm not sure if this should be considered a bug or not. In the general case, we cannot set a meaningful and correct old-style position in all cases until the GTID connect has reached the starting GTID position. The "current position" could correspond to different points in the master's binlog (in case of multiple domains in the GTID position), which cannot be represented in the old-style position at all. Or the corresponding position may simply be unknown right after SET GLOBAL gtid_slave_pos.

Though in this particular case, the position is already known (from an earlier GTID connect), so perhaps the server should try to keep on to this correct position, and not overwrite it with older positions while replicating Binlog_checkpoint and so on? Not sure.

Note that the state of not knowing / having incorrect old-style position is only there for a brief period, as soon as the GTID connect completes the master sends a fake GTID list containing the correct position - in fact this is precisely what is tested in the test case.

If we decide this is "not a bug", I think the sporadic test failure can be eliminated by injecting an extra event and syncing the slave to it before switching to non-GTID mode. I will post a patch for this to the mailing list for review.

Kristian Nielsen added a comment - 2024-04-23 18:51 I was not able to reproduce this locally, but I was able to analyze what I believe is the root cause from server error logs from buildbot failures. 2024-04-23 11:19:57 6 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000005' at position 979, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-7,1-1-1' 2024-04-23 11:19:57 0 [Note] /home/buildbot/amd64-ubuntu-2310/build/sql/mariadbd: ready for connections. Version: '10.11.8-MariaDB-log' socket: '/home/buildbot/amd64-ubuntu-2310/build/mysql-test/var/tmp/5/mysqld.2.sock' port: 16081 Source distribution 2024-04-23 11:19:57 6 [Note] Error reading relay log event: slave SQL thread was killed 2024-04-23 11:19:57 6 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000005' at position 256; GTID position '0-1-7,1-1-1', master: 127.0.0.1:16080 2024-04-23 11:19:57 5 [Note] Slave I/O thread exiting, read up to log 'master-bin.000005', position 315; GTID position 1-1-1,0-1-7, master 127.0.0.1:16080 2024-04-23 11:19:57 7 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info' 2024-04-23 11:19:57 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='315'. New state master_host='127.0.0.1', master_port='16080', master_log_file='master-bin.000005', master_log_pos='256'. 2024-04-23 11:19:57 7 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No Here we see the slave is started in GTID mode, the current GTID position is 0-1-7,1-1-1 which corresponds to old-style position 'master-bin.000005' / 979. But the slave threads are immediately stopped while receiving initial parts of the binlog (Format_description, Gtid_list, Binlog_checkpoint), before reaching the starting GTID position, and in particular before receiving the old-style position corresponding to that point. Thus the old-style position points too early, at position 256, when the tests subsequently restarts the slave in non-GTID mode. This leads to re-applying old events and duplicate key error. I'm not sure if this should be considered a bug or not. In the general case, we cannot set a meaningful and correct old-style position in all cases until the GTID connect has reached the starting GTID position. The "current position" could correspond to different points in the master's binlog (in case of multiple domains in the GTID position), which cannot be represented in the old-style position at all. Or the corresponding position may simply be unknown right after SET GLOBAL gtid_slave_pos. Though in this particular case, the position is already known (from an earlier GTID connect), so perhaps the server should try to keep on to this correct position, and not overwrite it with older positions while replicating Binlog_checkpoint and so on? Not sure. Note that the state of not knowing / having incorrect old-style position is only there for a brief period, as soon as the GTID connect completes the master sends a fake GTID list containing the correct position - in fact this is precisely what is tested in the test case. If we decide this is "not a bug", I think the sporadic test failure can be eliminated by injecting an extra event and syncing the slave to it before switching to non-GTID mode. I will post a patch for this to the mailing list for review.

Kristian Nielsen added a comment - 2024-04-25 09:37

Pushed a work-around for this to 10.5, we will see in buildbot if it solves the problem.
Filed a separate bug MDEV-33996 for the underlying issue with temporary incorrect old-style position during GTID connect.

Kristian Nielsen added a comment - 2024-04-25 09:37 Pushed a work-around for this to 10.5, we will see in buildbot if it solves the problem. Filed a separate bug MDEV-33996 for the underlying issue with temporary incorrect old-style position during GTID connect.

MariaDB Server

rpl.rpl_gtid_stop_start At line 187: sync_with_master failed: 'select master_pos_wait('master-bin.000005', 1186, 300, '')' returned NULL indicating slave SQL thread

Details

Description

Attachments

Activity

People

Dates

Git Integration