[MDEV-26632] multi source replication filters breaking GTID semantic Created: 2021-09-17 Updated: 2023-12-11 Resolved: 2023-12-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.4.21, 10.5.12 |
| Fix Version/s: | 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | VAROQUI Stephane | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Howto reproduce: Issue: Fixing: An other fix would be to write the event in the binlog with an extra flag filtered and stream it to the replica as well , so the position exits and a parameter can be added at any layer of the replication tree to restore those events |
| Comments |
| Comment by Andrei Elkin [ 2021-09-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
I'd agree with such concept. The filtered out info should not affect the slave state. We need a more clear policy for that. Thank you for pointing to this issue! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Richard DEMONGEOT [ 2022-02-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, Just tested with GTID Domain filtering, and it works fine. Filtered events are not written onto the binlogs nor the parents relay-logs. Best regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-02-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rdem, thanks for trying. Yet the case is about stopping account the filtered-out/ignored gtids in gtid_slave_pos. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I wrote a test case to reproduce what I think is the scenario described here. When server 2 has gtid_slave_pos=1-1-7, we stop it and CHANGE MASTER to be a slave of server 3. But as the test case demonstrates, this scenario can work by configuring --gtid-strict-mode=0 and --gtid-ignore-duplicates=1. Let me explain why. The server tries hard to protect the user from common mistakes with GTID in simple topologies. If a slave would be allowed to connect with a GTID that never existed on the master, the slave could skip events infinitely searching for the missing GTID, which would probably be unexpected. That is why the server errors on the GTID 1-1-7 by default. In this case though, the user has set up a complex topology, where filtering is used so that server data and binlogs are not identical between servers. In particular, user wants to allow the slave to connect at a "hole" in the master's binlog, eg. at position 1-1-7 which is the position between adjacent events 1-1-6 and 1-1-8 in the master's binlog. Since the binlogs are not identical across the topology, this is not a "strict" GTID setup, so we need to configure --gtid-strict-mode=0. This will allow the slave to use 1-1-7 to denote the "hole" between 1-1-6 and 1-1-8. We also need to configure --gtid-ignore-duplicates=1 to allow the slave to connect at a GTID position that is not yet available in the binlog. Even though we do not have duplicate GTIDs in play here, the point of --gtid-ignore-duplicates is to allow GTIDs to arrive in non-strict ways; in this case, the GTID 1-1-7 can be seen only as a "hole" when 1-1-8 is received later, and setting --gtid-ignore-duplicates=1 requests the server to allow this situation without giving an error. So in summary, the server behaviour is correct and supports this application by configuring --gtid-strict-mode=0 and --gtid-ignore-duplicates=1. I think it's correct to update the GTID for filtered events. If not, the gtid_slave_pos could become so far behind that binlog purge on the master makes the slave unable to connect, even though the slave is fully caught up. I also don't think the ignored events should be binlogged and replicated down the topology. This is an explicit design of MariaDB GTID that it can tolerate holes in the GTID sequence and events can be properly filtered without poluting the entire replication topology. Does it make sense, and solve the case at hand? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Your test case is not covering scenario we are after , Source A, server 1 -> source B, server 2 + filter -> source B server 3 + filter We stop server 2 replication on a filtered event and we are electing server 3 to replace server 2 . The question is can a start slave on named source can connect and get a success just taking the last event that match his domain vector in the leader GTID , ignore-duplicates scares us in such scenario as could lead to some write not being applied on old leader but if gtid-strict-mode=0 enable this named replication to connect it could work ? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Richard DEMONGEOT [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Kristian Nielsen; Thanks for time. After verifiying, my setup is : Primary cluster : srv 1 and srv2(gtid domain 11). Replicas connected to this server are named : C1 Second cluster : srv3 and srv4 On cluster 2; i have : There is no filters on C2 named flow. For now, the setup is : srv2 <-- I'll plan to change gtid_ignore_duplicates to ON; and test again fail-overs between srv3 and srv4. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
stephane@skysql.com, my test is intended to cover exactly the scenario you describe. rdem This is the setup I try to cover, I just omitted srv2 as it is not involved in the failover. Using named slave->master connections should not matter I think. The test uses domain_id=1 for srv1 and domain_id=0 for srv3/srv4. My understanding is the issue is with CHANGE MASTER on srv3 to replicate from srv4? In my test, GTID 1-1-7 is filtered. So srv4 has in its binlog 1-1-6,1-1-8, it is missing GTID 1-1-7. The srv3 has gtid_slave_pos="1-1-7", and it gets this error:
If this is not the error you are describing, let me know which error it is. In sql/sql_repl.cc, there is code to disable exactly this error:
That is why setting --gtid-ignore-duplicates=1 is needed. With this setting, your scenario is valid and should work. The errors are only to help users with incorrect domain_id configuration. When domain_id is configured correctly, --gtid-ignore-duplicates=1 should not be scary and not lead to events being lost. It only ignores events that have the same domain_id but a smaller seq_no than the previous event. To explain the wrong configuration the errors are there to prevent, imagine a user with your setup that did not configure different domain id (maybe upgrade from 5.5 to 10.0). The events from srv1 and srv3 will be duplicating each other's seq_no, e.g.: 0-1-10, 0-1-11, 0-3-9, 0-3-12, 0-1-12, 0-3-13, ... Now imagine that srv3 and srv4 filter out event 0-1-12. There is no way on srv3 and srv4 to know if 0-1-12 should come before or after 0-3-12. Therefore the server code acts safe and throws an error. The --gtid-ignore-duplicate=1 means the user did configure domains correctly, and sequence numbers will always be strictly increasing in each domain_id. Then this problem can not occur, and the error can be safely silenced. I'm not sure this is documented anywhere outside the server source code, so your question/concerns are very valid. Also I'm not sure if --gtid-ignore-duplicates will allow to connect a missing GTID if there's a switchover from srv1 to srv2 at the same time (eg. if 1-1-7 is followed by 1-4-8 in my test, not by 1-1-8). If not, that may be a bug that should be fixed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Richard DEMONGEOT [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, Yes, the scenario is the same. I'll update you soon. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-10-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I made a patch that will allow to do the described scenario with --gtid-strict-mode enabled (as long as --gtid-ignore-duplicate is also enabled): https://github.com/MariaDB/server/commits/knielsen_mdev26632 Note that the scenario already works as described without the patch when --gtid-strict-mode=0. The patch is just to allow to run with --gtid-strict-mode=1 to help avoid incorrect GTID sequence. I also wrote some additional documentation of --gtid-ignore-duplicate for the KB, given below for reference. - Kristian.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-12-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have pushed a testcase to 10.4 that demostrates this kind of setup and verifies that it is working (with --gtid-strict-mode=0 and --gtid-ignore-duplicates=1). There appears to be no consensus to change --gtid-strict-mode=1 to allow holes in the binlog stream (due to filtering) even in --gtid-ignore-duplicates=1, so this change is left out for now. So no functional changes, this is already working in the server with suitable configuration, just a testcase pushed to make sure this keeps working. |