[MDEV-30386] flush binary logs delete_domain_id not deleting specific non present domain_id Created: 2023-01-11 Updated: 2023-11-28 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.4.26 |
| Fix Version/s: | 10.4, 10.5, 10.6, 10.11 |
| Type: | Bug | Priority: | Major |
| Reporter: | Manuel Arostegui | Assignee: | Brandon Nesterenko |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | binlog, flush, gtid | ||
| Environment: |
debian bullseye |
||
| Description |
|
We are trying to clean up the output of gtid_slave_pos so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains. While cleaning them up via flush binary logs delete_domain_id we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case. For this example we have db1117 is the one we want to clean up and leave only replicating with db1195 domain's id
This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
This is its replication data:
If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
The same result on its master
Logs have been flushed/rotated just in case on both hosts. On sql/rpl_gtid.cc we can see this piece of code
Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not. Some other do work fine when being cleaned up on that same host:
The same happens when trying to delete the default gtid_domain_id 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts. |
| Comments |
| Comment by Kristian Nielsen [ 2023-04-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks, Manuel Arostegui, for an excellent write-up of the problem. This part of the code you refer to looks obviously wrong:
The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different server_ids), then obviously all but one will not match. So this would make it impossible to delete any domain which was served by two different masters Some extra information would be useful: The value of @@GTID_BINLOG_STATE, and the mysqlbinlog dump of the GTID_LIST event at the start of the first binlog file on the slave. And also the output of SHOW WARNINGS after the failing command, if any. I know this is an old bug report, so I understand if this is no longer possible to obtain. Also, I wonder if there is some confusion here. The DELETE_DOMAIN_ID is for removing domains from the @@GTID_BINLOG_STATE. But the report mentions deleting from the @@GTID_SLAVE_POS. The latter is done simply by setting @@GTID_SLAVE_POS to the new desired value without the unwanted domains. However, the observed behaviour still looks like a bug to me... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different > server_ids), then obviously all but one will not match. Sorry about that, I got confused. The code loops until it finds an entry that does not (not-match), ie. until it finds a matching GTID. So the test is not reversed. Still trying to understand how the origial behaviour could occur.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yeah, sorry about the confusion with GTID_BINLOG_STATE and GTID_SLAVE_POS | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-04-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To conclude: From checking the code, it looks like this behaviour can only occur if the This situation represents a corruption of the database state, as the The gtid_binlog_state is saved between server restarts in the file
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-04-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the answer. It is impossible to be 100% sure, but I am 99.9% sure we've not done any of that. It is very very unlikely we've done so. We always remove binlogs (if we have to) via purge binary logs to This is of course too late now but leaving it here:
In any case, what is the recommended workaround if that'd have been the case? Is it just issuing reset master on the primary master and then rebuild the whole replication topology? If that is the case, that's pretty much impossible in most production environments. Would a reset slave all clean up all the weird states? That would be a lot easier to issue in production. Thanks Kristian! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-04-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, a number of points here, I'll try to answer all of them: > Would a reset slave all clean up all the weird states? That would be a lot So there's a concrete issue here about a wrong or misleading error message From the given slave positions and binlog states, it looks like every Having 10 different domains in the replication position is a very You mention needing to clean up gtid_slave_pos, otherwise the slave cannot But if all the domain_ids were needed at one point and then later no longer > Maybe an implementation of a way to force This is already implemented. To delete from the slave position, use Note there is no requirement that the first binlog file be present. It's the > Thanks for the answer. It is impossible to be 100% sure, but I am 99.9% > @@gtid_binlog_state Right. So the domain 171966484 that couldn't be deleted is not in the The gtid_binlog_state is basically a summary of the GTID information in the > In any case, what is the recommended workaround if that'd have been the No, that is not necessary. If you really have the situation that domain D=171966484 is not in any If alternatively the situation is that the domain is already not present in > Is there a way to implement a solution from this cause I don't think it is It's not a problem to delete old binlog files manually, it will not cause a Hope this helps,
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Kristian, thanks a lot for your very detailed answer. And sorry it's taken me long to answer - very busy with some things. >Having 10 different domains in the replication position is a very We introduced it years ago to try to get GTID+multisource enabled, but we ran into a very nasty bug We do have A -> B -> C (and even more depth) in our infrastructure so that's why there're so many IDs there. I am going to discuss with the team about this. >If you really have the situation that domain D=171966484 is not in any That's what actually made me file the bug. As you can see on the original post, that gtid_domain_id isn't present on any of the binlogs. In all these time those logs would have been flushed flushed a few times already. >If alternatively the situation is that the domain is already not present in I think I will try to go for this one, but also, maybe it is just better for us to totally unset gtid_domain_id on the first place and then try to clean them up entirely, to avoid running into these issues. Thanks again for your detailed answer. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Kristian, I think it would still be nice to understand why is making FLUSH BINARY LOGS DELETE_DOMAIN_ID fail on the first place with that misleading error because this is still very unclear to me and we'd still need it to be able to set gtid_domain_id=0 and clean up all the unused values to be able to fully have a clean environment without unexpected GTID errors when handling replication via GTID
Now, let's try to delete 171970595 which is indeed not in Gtid_list:
Per your comment earlier, it is expected that it will fail if you try to remove a domain id that is not in Gtid_list on the first event of the binlog, right?
I suppose I can also run this on this host which is now a master, but was also a slave at some point (which is going to be the case for all the replication topologies out there, as soon as they have a master switchover) - so I'd guess this is not strictly necessary, or shouldn't be - thoughts?:
So my question is, is this still a valid bug with no workaround possible? Thanks for all your detailed explanations! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Manuel, Ah, I hadn't understood that you can still reproduce this. Thanks for persevering, I definitely want to get to the bottom of this then. Can you provide the value of @@global.gtid_binlog_state ? Does it contain the domain_id 171970595 that we are trying to delete? I suspect it does not, because the value of gtid_binlog_state is what is written in the GTID_LIST event, and this is shown as [171970745-171970745-2370743389,171970577-171970577-53037843,0-171978825-1006906065] in your output, which does not contain 171970595. Just in case this isn't clear: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(D) deletes D from the gtid_binlog_state. Not from the gtid_slave_pos or anywhere else. Thus, if 171970595 is not in gtid_binlog_state, then it is correct that the DELETE_DOMAIN_ID command fails. It is obviously wrong (and very confusing) that it fails with the wrong message. On the other hand, if 171970595 is in the gtid_binlog_state, then there is a bug that the GTID_LIST event written into the new binlog db1125-bin.000047 is missing the domain, and we need to try to understand how that happens. So let's get the value of gtid_binlog_state corresponding to a known occurrence of this problem, so we know which of these to cases we are facing. Again, thanks Manuel for persisting in this, and thanks for the detailed and obviously competent bug reports and information/discussions. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yeah, I have a test cluster I can run and break as we want 171970595 isn't present indeed.
However, taking this answer: If it is correct that if fails, then what is the way to actually be able to proceed and delete it from gtid_slave_pos which is what is messing up with replication in the first place as it breaks with
For us, if we want to go back and using just domain_id=0, and start playing with GTID, we'd still need to clean them all as otherwise moving replicas around with GTID enabled ends up breaking because of the above error. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Or to put it more succinctly: 1. If you want to remove a domain from the gtid_slave_pos, then the correct command to use is SET GLOBAL gtid_slave_pos= <new position>. FLUSH BINARY LOGS DELETE_DOMAIN_ID cannot be used for this. 2. The real bug here might be that FLUSH BINARY LOGS DELETE_DOMAIN_ID gives the wrong error message, not that it does something wrong (to be confirmed). - Kristian. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> 1. If you want to remove a domain from the gtid_slave_pos, then the correct command to use is SET GLOBAL gtid_slave_pos= <new position>. FLUSH > BINARY LOGS DELETE_DOMAIN_ID cannot be used for this. I am going to double check the possible scenarios of this + non existent domain_id and report back. Thanks! > The real bug here might be that FLUSH BINARY LOGS DELETE_DOMAIN_ID gives the wrong error message, not that it does something wrong (to be confirmed). Yes, at least if we can make it a lot more meaningful, that'd be already a big win Thank you again for your time | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Manuel for the quick answer. > 171970595 isn't present indeed. Aha. So the error message is wrong. It should say that the domain 171970595 cannot be deleted from gtid_binlog_state because it is not present. I'll dig a bit more and see if I can reproduce this error locally. > what is the way to actually be able to proceed and delete it from gtid_slave_pos The way to delete a domain from the gtid_slave_pos is: SET GLOBAL gtid_slave_pos = "..." (Eg.in this case by specifying the old position with the to-be-deleted domains removed). For example, say that the gtid_slave_pos is: 10-10-1000, 20-20-2000 But the current master is using domain_id 20 as the only domain, and you are getting an error that position 10-10-1000 is not present on the master binlog and want to remove it. Simply do SET GLOBAL gtid_slave_pos = "20-20-2000"; Maybe I should write up some article that clearly explains how to use the domain_id of MariaDB global transaction id, and see if I can get it visible somehow. There is a way to think of domains that should make it relatively easy to understand what is going on. But I know the whole GTID implementation is somewhat complex, and I remember seeing multiple descriptions around on the 'net that are completely wrong wrt. domain_id. - Kristian. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A generic SQL to delete a domain @domain from the gtid_slave_pos:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manuel Arostegui [ 2023-05-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you Kristian. At this point, I am very suspicious on how corrupted all this could be. I have tried the following approach with a slave. A: Primary master with the following gtid_domain_id=171974662 (this runs pt-heartbeat) So the replication is: This is its replication status:
Let's try to clean up the domain_ids. Stoping slave and capturing the GTID position from both hosts:
So they are: Let's clean the rest:
There is absolutely no way the the old binlogs could have been purged. At this point I am not sure anymore whether this is trustable or not.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> At this point, I am very suspicious on how corrupted all this could be I understand this sentiment. But I think it will be useful for you to get out of this sentiment and get an understanding what is really going on. From what I see here, there is so far just one bug, the incorrectly worded error message (which is sufficiently wrong that it confused both of us). I believe the real problem is the configuration of domain_ids, which is doing something very differently from what you expect it to do. The TL;DR of it is that by giving every server a distinct domain_id, you have configured GTID for a topology where every server is an active master that is potentially replicating to every other server. This is obviously not what is intended, and most of these potential replication streams are not set up. Therefore, the GTID behaviour of the server is different from what you expect. Let me see if I can illustrate it: > Interestingly all come back to normal after doing [disabling GTID] Let's start from this, as this is familiar to you. Here, there is a single replication stream. The slave points to a specific location in the master binlog files, and from there it will apply events one after the other in sequence. This matches the non-multisource topology A -> B -> Slave Now imagine that we delete all data on the slave and do a RESET SLAVE. We can then reconnect the slave to the master, and it will fetch all events from the start of the binlog (RESET SLAVE makes the position empty). Eventually the slave will be restored to the original state when all events are applied. Note that if binlog files were purged on the master, this will silently lose data; this is something that GTID prevents by instead giving an error when the slave requests data that is missing on the master. If you were to RESET SLAVE without deleting any data, then you would instead need to RESET MASTER at a point where slave and master are in sync. Otherwise the slave will duplicate old events when it replicates from the start. Next, consider GTID replication with a simple gtid_slave_pos=0-1-100. This is the same as non-GTID replication, the slave points to a specific location in the master binlog files (just after transaction 0-1-100). It will apply events from there in sequence. Here also we can delete all data on the slave, set gtid_slave_pos to the empty position, and reconnect to the master. The empty position makes the slave start from the beginning of the binlog on the master and eventually restore the state by applying all events. All of this is with a single domain id. But imagine we have multi-source replication 1 -> 3, 2 -> 3, 1 -> 4, 2 -> 4. Here server3 has two different positions, one in each of server1 and server2's binlogs. In non-GTID replication there is no way for server3 to replicate from server4 as a master, because in general there is no single position it server4's binlog that corresponds to both of server3's current positions. But GTID can do this using the domain_id. Each domain_id in the gtid_slave_pos is essentially an independent replication stream. In this case, two domain_ids will be configured, say 1 and 2. Now server3 can start replicating from two different positions in server4's binlog. And it will continue keeping track of both independent positions so it can later move back to replicating from server1 and server2 individually again. Finally, suppose we remove server2 (and all its data) from the system (maybe a customer quit or something), leaving us with the topology 1 -> 4 -> 3. Now if we try to make server1 a slave of server3, 4 -> 3 -> 1, we get a problem. server1 is missing all of the data from the domain 2. It will connect to server3 with empty position in domain 2, and will need to replicate all events ever generated on server2. This will not be possible if any binlog files were ever purged on server3. To do this, we delete the domain_id=2 from the binlog state of server3: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(2). We do this on the master, because we want to change the history and pretend that domain 2 never existed in the system. Then later we might want to move server4 to replicate from server3: 3 -> 4, 3 -> 1. In this case, we want to remove the domain 2 from the slave position of server 4:
You see, here DELETE_DOMAIN_ID corresponds to RESET MASTER, and SET gtid_slave_pos corresponds to RESET SLAVE in the non-GTID case. But we are doing it only in the domain 2 which we want to remove, while leaving the other domain intact. So it should be familiar to you once you start thinking of each domain as independent replication streams. In your case, because you configured so many domain_id's, MariaDB GTID is obliged to act as if you have a complex multi-source topology. To remove a domain, you therefore have to do it in the same way as in the example: Remove it from the master binlog_state at level N with DELETE_DOMAIN_ID, and at the same time remove it from the slave position at level (N+1). Note that you have to compare the gtid_binlog_state on the master with the gtid_slave_pos on the slave to understand what needs to be in sync. These changes have to be synced so the slave can connect to the master. But you can do it one level at a time, from the bottom up. Overall, anything you can do with non-GTID replication, you can do with GTID replication using a single domain only. You can do even more advanced things with GTID and multiple domain_id, but it is complex and you need to understand the concept of multiple independent replication streams to work effectively with this. I wrote this up as a trail for writing some general article that explains this succinctly. It's still longer than I would like :-/. Maybe it can still be helpful, and any feedback welcome. Also feel free to reach me on IRC #maria or Zulip for more interactive chat. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Back to the bug that DELETE_DOMAIN_ID gives the wrong error message when the domain is not present in the gtid_binlog_state. This really is very confusing, and I'd like to get it fixed. I tried to reproduce on 10.4.26, but in my test it worked as expected:
No wrong error message, and a sensible warning. And looking at the code, it seems impossible to get this error for a non-existing domain (though as we know, "impossible" is only until the "aha"-moment where the real problem is discovered):
If the domain is not present in the gtid_binlog_state, `elem` will be false and the `continue` is hit, skipping the error message that you saw... So I'm not sure ... can you still reproduce it easily in the test environment? Could we try with a custom server build with some extra logging, or something? - Kristian. |