[MDEV-4820] GTID strict mode is full of bugs and doesn't serve its purpose Created: 2013-07-27  Updated: 2013-08-16  Resolved: 2013-08-16

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.3
Fix Version/s: 10.0.5

Type: Bug Priority: Blocker
Reporter: Pavel Ivanov Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File patch.txt    

 Description   

Let's consider the following testing setup (I use code at revision 3773 of 10.0 branch).
Start 2 MariaDB servers with "--gtid-strict-mode --replicate-wild-ignore-table=mysql.gtid_slave_pos". Set server 2 a slave of server 1. Execute on server 1:
create database test;
use test;
create table t (n int) engine innodb;
insert into t values (1);
After that @@global.gtid_current_pos is '0-1-3' on both servers.
Now imagine production situation: server 2 goes down, server 1 continues to be a master and execute transactions, then at some point it's taken down for cold backup, restored on a new machine without binlogs, but @@global.gtid_slave_pos is set to the value of @@global.gtid_current_pos that was set at the moment of server going down. And after that server continues to be a master.
Let's emulate this situation: stop slave on server 2, bring down server 1, delete all master-bin.* files, bring up server 1, set @@global.gtid_slave_pos = '0-1-5', start slave on server 2. And what do you know, server 2 doesn't have any errors. And if I execute new transactions on server 1 they are happily replicated. So server 2 skipped transactions and no one noticed that. That's not how strict mode should work.

Let's continue the experiment. Let's say we stopped at the GTID '0-1-6'. Now
"stop slave" on server 2
"reset slave all" on server 2
shutdown server 2
delete all master-bin.* files
bring up server 2
"set @@global.gtid_slave_pos = '0-2-10'"
"change master to" on server 1 to make server 2 master
"start slave" on server 1
try to execute transactions on server 2
For some reason at this point server 1 doesn't have any errors and doesn't replicate anything from server 2. Oops. If after advancing gtid_current_pos on server 2 we "stop slave" on server 1 and "start slave" on server 1 then we start seeing error: "connecting slave requested to start from GTID 0-1-6, which is not in the master's binlog". This is the expected behavior. Why it couldn't show this error at the very beginning, before server 2 had any events in the binlog?

Now moving further. Let's say we restored replication and stopped at gtid_current_pos = '0-2-11'. Now
"stop slave" on server 1
execute transaction on server 2
execute transaction on server 1
At this point server 1 has gtid_current_pos = '0-1-12' and server 2 has gtid_current_pos = '0-2-12', i.e they have alternate futures. Now if we make server 1 master and connect server 2 to it server 2 will show error "connecting slave requested to start from GTID 0-2-12, which is not in the master's binlog". This is not helpful. Alternate future is a very serious problem and there should be an easy and clear way to detect this situation. The error message is the obvious choice for detection tools but MariaDB doesn't distinguish it from the "slave start from GTID that is before binlogs were started" situation. Can this be changed? I've already requested this behavior in MDEV-4478, but apparently it was either forgotten, or for some reason you decided not to implement it. If the latter I'd like to hear why.



 Comments   
Comment by Pavel Ivanov [ 2013-08-03 ]

I'm attaching a patch with my approach to resolving this bug. It looks like it covers all possible use cases in GTID strict mode. I couldn't figure out what would be the intended behavior in such use cases for the server in non-strict mode, so I didn't change that. Also I didn't check if START SLAVE UNTIL still works properly in all cases with GTID strict mode.

Comment by Kristian Nielsen [ 2013-08-09 ]

I cannot repeat the first part. This is using 10.0-base revision
revid:igor@askmonty.org-20130806203318-esxb7kpq9kab0i97

Here is my test case:

--let $rpl_topology=1->2
--source include/rpl_init.inc
 
--connection server_2
--source include/stop_slave.inc
SET GLOBAL gtid_strict_mode= 1;
CHANGE MASTER TO master_use_gtid=slave_pos;
--source include/start_slave.inc
 
--connection server_1
SET GLOBAL gtid_strict_mode= 1;
CREATE TABLE t1 (a INT PRIMARY KEY);
INSERT INTO t1 VALUES (1);
--save_master_pos
 
--connection server_2
--sync_with_master
SELECT * FROM t1 ORDER BY a;
 
--source include/stop_slave.inc
 
--connection server_1
INSERT INTO t1 VALUES (2);
INSERT INTO t1 VALUES (3);
SET @old_gtid_pos= @@GLOBAL.gtid_current_pos;
RESET MASTER;
SET GLOBAL gtid_slave_pos= @old_gtid_pos;
 
--connection server_2
--source include/start_slave.inc
 
--connection server_1
INSERT INTO t1 VALUES (4);
--save_master_pos
 
--connection server_2
--sync_with_master
SELECT * FROM t1 ORDER BY a;
 
# Clean up.
--connection server_1
DROP TABLE t1;
 
--source include/rpl_end.inc

As expected, the slave fails to connect with the error: "[ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'The binlog on the master is missing the GTID 0-1-2 requested by the slave (even though both a prior and a subsequent sequence number does exist), and GTID strict mode is enabled', Internal MariaDB error code: 1236"

This is as expected. If the requested position is missing in the binlogs on
the master, it must match exactly with @@GLOBAL.gtid_slave_pos.

My guess is you are looking at old code. The most recent code for GTID is in
10.0-base, merges to 10.0 happen only irregularly.

Comment by Kristian Nielsen [ 2013-08-09 ]

Hm, I cannot reproduce either on rev 3773 of branch 10.0, slave gets an error message on connect.

Can you please elaborate how to reproduce / how the situation you describe differ from my test case?

Comment by Kristian Nielsen [ 2013-08-09 ]

For the second part of the problem: We cannot give an error when server 1
connects to server 2. By deleting the binlogs on server 2, it is effectively a
fresh server, it is perfectly valid to start replicating from it (eg. using a
different domain_id). But we should give an error when we receive the first,
incorrect event in the domain 0 (but the code does not currently), I will fix
that.

For the third part: If I understand correctly, you want the server to give
different error messages for these two cases:

  • Slave requests to start at some point G from master. Master does not have
    G, but it is itself a slave of an upstream master, and will receive G
    shortly.
  • Slave requested to start at a GTID that does not exist on the master, and
    never will (what you refer to as "alternate future").

In most cases we can determine which one it is simply by looking at the
sequence numbers, that is probably a good idea. I will try to come up with
something (but I don't consider wording of error messages urgent, so not
immediately).

However, note that both of these cases are distinct from "slave requests to
start from a GTID that has been purged". This already has a separate error
message. However, by deleting binlogs, the information needed to distinguish
this case is lost.

Comment by Pavel Ivanov [ 2013-08-10 ]

Note that the patch I've attached have test case that should reproduce the problems.

Regarding your code: I'm not so sure that RESET MASTER is equivalent to stopping server, deleting binlogs and starting again. I'd think that some in-memory structures are not cleaned.

Regarding second part: note that test case doesn't say anything about different domain_id – it's about different server_id. Also note that server 1 doesn't replicate at all when first connecting to server 2. And in strict mode server 2 can send error to server 1 right away, because it doesn't have GTID that is used by server 1 to connect.

Regarding third part: you understood me incorrectly. I'm calling "alternate future" not the situation when slave requested GTID that doesn't exist on master. That's too generic. "Alternate future" is when master has GTID with the same domain_id and seq_no, but different server_id. In strict mode with correct failover process this situation shouldn't ever happen. So it must be detected to understand if failover gone wrong somewhere.
When GTID doesn't exist on master it can be as you are saying master will receive this GTID shortly (although I don't know how server can detect that, and this situation shouldn't ever happen in strict mode and correct failover process). Also it can be that slave has GTID that is too old and master doesn't have the appropriate binlog already.

Comment by Kristian Nielsen [ 2013-08-16 ]

Fix pushed to 10.0-base:

Revision: revid:knielsen@knielsen-hq.org-20130816131025-etjrvmfvupsjzq83

MDEV-4820: Empty master does not give error for slave GTID position that does not exist in the binlog

The main bug here was the following situation:

Suppose we set up a completely new master2 as an extra multi-master to an
existing slave that already has a different master1 for domain_id=0. When the
slave tries to connect to master2, master2 will not have anything that slave
requests in domain_id=0, but that is fine as master2 is supposedly meant to
serve eg. domain_id=1. (This is MDEV-4485).

But suppose that master2 then actually starts sending events from
domain_id=0. In this case, the fix for MDEV-4485 was incomplete, and the code
would fail to give the error that the position requested by the slave in
domain_id=0 was missing from the binlogs of master2. This could lead to lost
events or completely wrong replication.

The patch for this bug fixes this issue.

In addition, it cleans up the code a bit, getting rid of the fake_gtid_hash in
the code. And the error message when slave and master have diverged due to
alternate future is clarified, as requested in the bug description.

Generated at Thu Feb 08 06:59:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.