[MDEV-4820] GTID strict mode is full of bugs and doesn't serve its purpose Created: 2013-07-27 Updated: 2013-08-16 Resolved: 2013-08-16 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | 10.0.3 |
| Fix Version/s: | 10.0.5 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Pavel Ivanov | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
Let's consider the following testing setup (I use code at revision 3773 of 10.0 branch). Let's continue the experiment. Let's say we stopped at the GTID '0-1-6'. Now Now moving further. Let's say we restored replication and stopped at gtid_current_pos = '0-2-11'. Now |
| Comments |
| Comment by Pavel Ivanov [ 2013-08-03 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I'm attaching a patch with my approach to resolving this bug. It looks like it covers all possible use cases in GTID strict mode. I couldn't figure out what would be the intended behavior in such use cases for the server in non-strict mode, so I didn't change that. Also I didn't check if START SLAVE UNTIL still works properly in all cases with GTID strict mode. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2013-08-09 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I cannot repeat the first part. This is using 10.0-base revision Here is my test case:
As expected, the slave fails to connect with the error: "[ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'The binlog on the master is missing the GTID 0-1-2 requested by the slave (even though both a prior and a subsequent sequence number does exist), and GTID strict mode is enabled', Internal MariaDB error code: 1236" This is as expected. If the requested position is missing in the binlogs on My guess is you are looking at old code. The most recent code for GTID is in | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2013-08-09 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hm, I cannot reproduce either on rev 3773 of branch 10.0, slave gets an error message on connect. Can you please elaborate how to reproduce / how the situation you describe differ from my test case? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2013-08-09 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
For the second part of the problem: We cannot give an error when server 1 For the third part: If I understand correctly, you want the server to give
In most cases we can determine which one it is simply by looking at the However, note that both of these cases are distinct from "slave requests to | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pavel Ivanov [ 2013-08-10 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Note that the patch I've attached have test case that should reproduce the problems. Regarding your code: I'm not so sure that RESET MASTER is equivalent to stopping server, deleting binlogs and starting again. I'd think that some in-memory structures are not cleaned. Regarding second part: note that test case doesn't say anything about different domain_id – it's about different server_id. Also note that server 1 doesn't replicate at all when first connecting to server 2. And in strict mode server 2 can send error to server 1 right away, because it doesn't have GTID that is used by server 1 to connect. Regarding third part: you understood me incorrectly. I'm calling "alternate future" not the situation when slave requested GTID that doesn't exist on master. That's too generic. "Alternate future" is when master has GTID with the same domain_id and seq_no, but different server_id. In strict mode with correct failover process this situation shouldn't ever happen. So it must be detected to understand if failover gone wrong somewhere. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2013-08-16 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Fix pushed to 10.0-base: Revision: revid:knielsen@knielsen-hq.org-20130816131025-etjrvmfvupsjzq83
The main bug here was the following situation: Suppose we set up a completely new master2 as an extra multi-master to an But suppose that master2 then actually starts sending events from The patch for this bug fixes this issue. In addition, it cleans up the code a bit, getting rid of the fake_gtid_hash in |