[MDEV-9107] GTID Slave Pos of untrack domain ids being updated Created: 2015-11-09 Updated: 2018-07-24 Resolved: 2018-07-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.1.8, 10.1.10 |
| Fix Version/s: | 10.1.35 |
| Type: | Bug | Priority: | Major |
| Reporter: | Rodrigo Severo | Assignee: | Sachin Setiya (Inactive) |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | galera | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:
After initially starting all replications: Observe that the GTID from steps 2 and 4 are diferent. Replication channel S1_R3 updated the GTID Slave Pos of domain ID 2 despite having been configured to just track domain ID 3! When replication channel S1_R2 is brought back online the changes that occured on step 3 will be lost on server 1. The solution for this issue seems to be to make each replication channel thread to update only the GTID Slave Pos for the domain IDs it should track as defined by {ignore|do}_domain_ids. |
| Comments |
| Comment by Elena Stepanova [ 2015-11-10 ] | ||||||||||||||||||||||||||||||
|
I think there might be some confusion here.
And slave has this:
And it has been going on this way for a while. However, this is just my understanding which can be wrong. I will assign it to nirbhay_c to confirm (or object); and in any case, this point, executed vs tracked, should be clearly explained in the documentation. Besides, while looking into this, I encountered the problem described in | ||||||||||||||||||||||||||||||
| Comment by Rodrigo Severo [ 2015-11-10 ] | ||||||||||||||||||||||||||||||
|
Elena, first of all thanks for taking your time to deal with this issue and for sharing your thoughts. I see this can get much more abstract them I was expecting but you are right, there are more points of view to this issue than I initially thought about. About the scenario you proposed I have a few observations as it's sure simplier because it involves less servers and much less replication channels but:
Alternative commands: Now our user starts to also replicate domain id 2 in the slave and then he issues the following command: On the master everything will be ok, on the slave we will have a duplicate entry error. But more more importantly to me: if {do|ignore}_domain_ids just sets the domain IDs as unexecuted and unreplicated but keeps them tracked I can't see how to implement something like the scenario I'm trying to deal with where I have 3 master each with 2 replication channels for the other 2 masters. It's not a question of which behaviour is more desirable. It becames impossible. Or am I missing something? | ||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-11-10 ] | ||||||||||||||||||||||||||||||
|
First of all, regarding
The scenario was not literal of course, it was schematic just to make a point. But certainly, all kinds of playing with replication topology and settings, apart from using simple default ones, assume that the operators know what they are doing, it's their responsibility to keep configuration consistent. If we don't assume that, forget local domain IDs, your whole replication setup is not viable. What if we start populating the same table on S1, S2, S3 with conflicting data? Obviously, you expect it will never be the case – and same goes for using several domain IDs on the same server.
Okay, maybe it was too schematic. Let me create a legend about it. So, on S1 you have the production data updates under default domain ID 1, and analytics data is updated under domain ID 100. It has been working so for quite some time. One shiny weekend you decide to go green and not to waste electricity on a separate server just so the backoffice could pull tiny data. You still don't want them to connect to S1, as it's production, but you decide it should be fine to have their data on S2 – after all, it's a backup server, nobody else goes there. Disclaimer: once again, until Nirbhay confirmed it, it's just my understanding of the design of do_domain_ids, it can easily be wrong. But in the unlikely scenario that you do have all binary logs, or that you actually know from which point you want to start, you can always set @@gtid_slave_pos to the desired value, and it will go back to that point to replicate from it.
The problem is, I don't really know what exactly you are trying to achieve. I understand your replication topology, but I don't know why you are setting do_domain_ids. If you want to avoid the same event bouncing back and forth between the servers, setting them to gtid-ignore-duplicates seems to be much easier way, but you are not using it, so I assume there are some other considerations. If you explain what they are, maybe we could come up with some ideas... | ||||||||||||||||||||||||||||||
| Comment by Rodrigo Severo [ 2015-11-10 ] | ||||||||||||||||||||||||||||||
|
Ok, you got me as what you just described looks like a real scenario First let me explain what I'm trying to implement: That would be a 6 servers setup. Three on one city and 3 on another. All of them working as masters. Each three on one city would have 2 replication channels to the other 2 so changes done on any of them would be replicated with minimum delay to the others on the same city. One server on each city would have an extra replication channel to one server on the other city to get all changes done on the other city. This setup would:
About the scenario you presented: If {do|ignore}_domain_ids were to work as I understand they should - making each replication channel actually care about only the domain ids specified, i.e., asking fot GTIDs, replicating, executing and tracking only the domain ids specified - to start the replication of domain id 100 on S2 you would only have to find in S1 which is the current GTID for domain id 100 and set that value in the GTID Slave Pos variable of server S2. After that you would have the same exact result and behaviour you mentioned. Obviously for this particular situation a much more elegant solution would be to implement a master_pos option for CHANGE MASTER's MASTER_USE_GTID definition. | ||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-11-10 ] | ||||||||||||||||||||||||||||||
Surely, and the same the other way round – if you want to start replication not from the current position, but from an earlier position, you would only have to set that value in the GTID Slave Pos variable; so, I don't see any problem with the current design (again, if it is a design, we are still not sure). Back to what you are trying to achieve. I still don't understand why you need do_domain_ids for this. Did you try to do the same without them (with gtid-ignore-duplicates of course)? If you did, what was the problem you encountered that made you switch to do_domain_ids? BTW you didn't mention it, but I assume you do actually care about splitting the load so that there are no conflicting concurrent updates on the servers, right? | ||||||||||||||||||||||||||||||
| Comment by Rodrigo Severo [ 2015-11-11 ] | ||||||||||||||||||||||||||||||
|
Using gtid-ignore-duplicates I could prevent being affected by the 3 issues I reported: | ||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-11-11 ] | ||||||||||||||||||||||||||||||
|
Okay, good. So, the remaining question in this issue ( | ||||||||||||||||||||||||||||||
| Comment by Rodrigo Severo [ 2016-01-04 ] | ||||||||||||||||||||||||||||||
|
Sorry, I had my tests wrong. I had "log_slave_updates" turned off. I have just retested this issue with "log_slave_updates" on as it should and this issue still exists on MariaDB 10.1.10. This issue should NOT be closed. And to clarify why this issue is a problem that should be solved consider the situation described on the first report of this bug. If after performing all the steps mentioned above on the first report I restart replication channel S1_R2, server S1 won't have the changes done on step 3 above. This is not acceptable. | ||||||||||||||||||||||||||||||
| Comment by Rodrigo Severo [ 2016-01-06 ] | ||||||||||||||||||||||||||||||
|
I believe there were some expectation that fixing | ||||||||||||||||||||||||||||||
| Comment by Esa Korhonen [ 2018-04-03 ] | ||||||||||||||||||||||||||||||
|
I have now noticed this bug as well, although in a simpler setting. All it needs is a master server which changes its gtid_domain_id, and a slave which is only replicating the old domain (with the DO_DOMAIN_IDS-setting). The slave will update its gtid_slave_pos to include the new domain (5), even when in reality it does not add the events from the new domain: @@gtid_binlog_pos |0-3001-7139 This means that even gtid_current_pos cannot be trusted to be correct. gtid_binlog_pos and gtid_binlog_state do seem to be correct, but these require log_slave_updates. This has implications for the failover functionality in MaxScale. Server version: 10.2.6 | ||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-07-20 ] | ||||||||||||||||||||||||||||||
|
Hi esa.korhonen! In your case you can just update gtid_slave_pos so that slave gets all the events from master. Lets say master added 3 events in new domain id =X, And old Domain id was Y. SO you can simple set gtid_slave_pos="Y-server-id-seq_no" , So it will forget all the tracking of X domain id. And you will get all the events. | ||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-07-21 ] | ||||||||||||||||||||||||||||||
|
Hi rsevero According to documentation do_domain_id is worked as expected , So I am closing this issue as wont fix. | ||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-07-21 ] | ||||||||||||||||||||||||||||||
|
Test case patch mdev-9107.diff |