Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.0.3
-
None
-
None
Description
Let's consider the following testing setup (I use code at revision 3773 of 10.0 branch).
Start 2 MariaDB servers with "--gtid-strict-mode --replicate-wild-ignore-table=mysql.gtid_slave_pos". Set server 2 a slave of server 1. Execute on server 1:
create database test;
use test;
create table t (n int) engine innodb;
insert into t values (1);
After that @@global.gtid_current_pos is '0-1-3' on both servers.
Now imagine production situation: server 2 goes down, server 1 continues to be a master and execute transactions, then at some point it's taken down for cold backup, restored on a new machine without binlogs, but @@global.gtid_slave_pos is set to the value of @@global.gtid_current_pos that was set at the moment of server going down. And after that server continues to be a master.
Let's emulate this situation: stop slave on server 2, bring down server 1, delete all master-bin.* files, bring up server 1, set @@global.gtid_slave_pos = '0-1-5', start slave on server 2. And what do you know, server 2 doesn't have any errors. And if I execute new transactions on server 1 they are happily replicated. So server 2 skipped transactions and no one noticed that. That's not how strict mode should work.
Let's continue the experiment. Let's say we stopped at the GTID '0-1-6'. Now
"stop slave" on server 2
"reset slave all" on server 2
shutdown server 2
delete all master-bin.* files
bring up server 2
"set @@global.gtid_slave_pos = '0-2-10'"
"change master to" on server 1 to make server 2 master
"start slave" on server 1
try to execute transactions on server 2
For some reason at this point server 1 doesn't have any errors and doesn't replicate anything from server 2. Oops. If after advancing gtid_current_pos on server 2 we "stop slave" on server 1 and "start slave" on server 1 then we start seeing error: "connecting slave requested to start from GTID 0-1-6, which is not in the master's binlog". This is the expected behavior. Why it couldn't show this error at the very beginning, before server 2 had any events in the binlog?
Now moving further. Let's say we restored replication and stopped at gtid_current_pos = '0-2-11'. Now
"stop slave" on server 1
execute transaction on server 2
execute transaction on server 1
At this point server 1 has gtid_current_pos = '0-1-12' and server 2 has gtid_current_pos = '0-2-12', i.e they have alternate futures. Now if we make server 1 master and connect server 2 to it server 2 will show error "connecting slave requested to start from GTID 0-2-12, which is not in the master's binlog". This is not helpful. Alternate future is a very serious problem and there should be an easy and clear way to detect this situation. The error message is the obvious choice for detection tools but MariaDB doesn't distinguish it from the "slave start from GTID that is before binlogs were started" situation. Can this be changed? I've already requested this behavior in MDEV-4478, but apparently it was either forgotten, or for some reason you decided not to implement it. If the latter I'd like to hear why.