[MDEV-9108] "GTID not in master's binlog" error with {ignore|do}_domain_ids Created: 2015-11-09 Updated: 2017-11-05 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.1.8, 10.1.10 |
| Fix Version/s: | 10.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Rodrigo Severo | Assignee: | Kristian Nielsen |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:
After initially starting all replications: Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1. The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids. |
| Comments |
| Comment by Elena Stepanova [ 2015-11-10 ] |
|
Assuming for the time being that my theory in the comment to |
| Comment by Rodrigo Severo [ 2015-11-10 ] |
|
S2's binlog certainly has GTID 2-2-10 as it points to an event that was issued directly on S2. S3 has GTID 2-2-10 on it's binlog also as it got from S2 through replication channel S3_R2 . But S1 doesn't have GTID 2-2-10 in it's binlog as it never got it from S2 (observe that GTID 2-2-10 was issued in S2 after S1 was stopped). This fact isn't a problem at all. The problem is that because GTIDs for ALL tracked domain ids on the slave are sent to the master on replication channel (re)start, when S2 tries to re-establish replication channel S2_R1 with S1, it will fail several times until S1 finally gets GTID 2-2-10. The point here is that in this particular situation we are dealing "only" with a transient, annoying and silly failure as S1 will eventually get GTID 2-2-10 through replication channel S1_R2. I say it's silly because why the eventual unavailability of a GTID related to a domain id that should be filtered out of the channel blocks the channel from going up? Observe that the problematic GTID is 2-2-10 and channel S2_R1 should deal only with domain id 1. I would really expect that whatever the GTID position of domain id 2 is, it won't affect the availability of a replication channel that is set to deal with domain id 1. Here I'm mentioning a transient failure but now I will mention an even simplier and more common scenario where this replication scenario start failure will be permanent and catastrophic: Let's consider 2 masters: SETTING: Server_id: 1 | IP: 10.0.3.223 SETTING: Server_id: 2 | IP: 10.0.3.136 This could be used so that having 2 different databases, each databases were changed on one server and the other works as slave. Because a network connection problem between the servers replication channels from both servers stop. But only the network connection between the servers have stopped. Both are still accessible from their main softwares. In this case the GTID of both servers will be updated during the inter network connection breakup. When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will. This happens because during replication channel start GTID for all tracked domain on the slave are sent to the master. If the slave only sent GTIDs for the domain ids the replication channel should actually deal with, this problem wouldn't happen at all and the 2 replication channels of all these setups would go back immediatelly without any hiccups at all. And again, please observe that on the first scenario we are talking about an annoying delay on the replication channel restart but on the second scenario we are talking about a fatal failure that actually blocks indefinitely the replication channels from being restarted at all. |
| Comment by Elena Stepanova [ 2015-11-11 ] |
|
Did you actually try the scenario you have described in the last comment? It's a good one, because as you said it is (almost) deterministic, so you'll get a reliable result. If you didn't try it yet, please do. If you tried and got the failure you are describing, please send cnf files from both servers, binary logs, error logs and output of show all slaves status \G where you observe the errors. Maybe you have encountered another bug there. Meanwhile, please consider – if it works as you described with do_domain_ids, it should be surely so without any do_domain_ids clauses, right? In that case, servers would have no excuse whatsoever for not tracking both domains, so the starting GTIDs are certainly stored, and following the same logic, they will definitely fail upon reconnect? I think the reason is here:
What will be updated during network problems is @@gtid_current_pos, @@gtid_binlog_pos, and Gtid_Slave_Pos in the slave status. But these are not the values that the slave uses while reconnecting to the master. Neither @@gtid_slave_pos, nor mysql.gtid_slave_pos table, nor Gtid_IO_Pos will be updated. So, when the network connection is restored, the slave will connect to the master requesting the exact same position it had upon replicating the last event. Note: While looking into it, I realized there is a good reason for confusion – we were trying so hard to give meaningful names for GTID fields and variables, but the result turned out horrible: we have @@gtid_slave_pos variable, and we have Gtid_Slave_Pos field in the SHOW SLAVE STATUS output, and these values have nearly nothing in common... Now, back to your original scenario – I cannot reproduce the problem. I have no doubts that you actually observed it; it's possible that there is a race condition that I'm failing to hit, but it's also possible that it is a weird consequence of |
| Comment by Rodrigo Severo [ 2015-11-11 ] |
|
When trying to implement the 6 servers setup across 2 cities I described in
I will try to implement the setup I envision using gtid-ignore-duplicates as you suggested. Thanks for pointing me to this option. It might work and it will be great if it does. But the problem I mention here is real. If you are not receiving more reports about that I bet is just because few people are exercising GTID based replication that much and being a new concept, when facing problems are probably just leaving it alone thinking "Am I doing something wrong?" Why I say so? Because If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with And the ones that are actually trying might be giving up prematurely because of issues like the ones we are discussing. Thanks again for your help. I will return with new info after my tests with gtid-ignore-duplicates. |
| Comment by Rodrigo Severo [ 2015-11-11 ] |
|
One detail, that is probably important: all the times I faced the issues described here, I had actually stopped the replication for one reason or another. On START SLAVE I got the problems mentioned here. The network connection problem was just me trying to imagine a more common situation where these issues might happen but the network connection situation and the STOP SLAVE/START SLAVE situations don't share the "slave sending start GTIDs to master" step. Just the STOP SLAVE/START SLAVE one has this step. This is the step where these issues happen. Sorry for the confusion. |
| Comment by Elena Stepanova [ 2015-11-11 ] |
Please provide the error logs, binary logs and cnf files from both servers. If it's reproducible for you with STOP SLAVE, even better – please do reproduce with STOP SLAVE and attach the logs/configs and the SHOW ALL SLAVES STATUS output. I'm talking about the two-server setup that you described before.
This is very true of course, but you've missed my point. The problem with two-server setup as you described it does not require trying GTID to it's full potential – it should affect pretty much every user who ever tries a simple 2-master topology and ever stops at least one slave. For one, I asked before but you never answered how you currently avoid |
| Comment by Rodrigo Severo [ 2015-11-11 ] |
|
To see the catastrophic failure do the following: SETTING: Server_id: 1 | Domain_id: 1 | IP: 10.0.0.1 SETTING: Server_id: 2 | Domain_id: 2 | IP: 10.0.0.2 SETTING: Server_id: 3 | Domain_id: 3 | IP: 10.0.0.3 SETTING: Server_id: 4 | Domain_id: 4 | IP: 10.0.0.4 All of them should have gtid_ignore_duplicates set to OFF. After starting all replication channels and confirming that they are all up and running: Watch how both of them will fail with a message of the form: “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” I have this setup working where servers 1 and 2 are at one city and servers 3 and 4 are at another one. |
| Comment by Elena Stepanova [ 2015-11-11 ] |
|
Okay, good, so there are 4 servers after all, not just 2. |
| Comment by Rodrigo Severo [ 2015-11-11 ] |
|
I'm quite sure this isn't a side effect of This issue is about a slave asking for a GTID of a domain id that shouldn't be treated by a replication channel and the master refusing to start said replication channel because the asked GTID of this not-to-be-cared-of domain-id being too up to date. |
| Comment by Elena Stepanova [ 2015-11-11 ] |
|
rsevero, |
| Comment by Rodrigo Severo [ 2015-11-11 ] |
|
With only 2 servers I don't see the problem. I can only reproduce it with 4 servers on the setup I detailed above. |
| Comment by Rodrigo Severo [ 2016-01-04 ] |
|
The problem with the 4 servers setup still exists on MariaDB 10.1.10. |
| Comment by Rodrigo Severo [ 2016-01-06 ] |
|
I believe there were some expectation that fixing |
| Comment by Kristian Nielsen [ 2016-01-15 ] |
|
Generally, a slave is not allowed to connect to a master on a GTID which is There are a couple of exceptions to this rule. One is that if the master has I think the request here is for another similar exception in case of An implementation might be as the reporter suggests. When the slave sends |
| Comment by Elena Stepanova [ 2016-01-19 ] |
|
knielsen, do you want it to be converted into a feature request? |
| Comment by Andrei Elkin [ 2017-09-13 ] |
|
The ignore_domain_ids options could be helpful to |