Global transaction ID
(MDEV-26)
|
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Technical task | Priority: | Minor |
| Reporter: | Elena Stepanova | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
The provided test case
The slave attempts to start from the 4th event. Depending on the nature of the events and the exact number of the "few" events in the second round, it might result either in a replication failure, or with "fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-3, which is not in the master's binlog'", or in silent ignoring of the first events in the new master binlog.
|
| Comments |
| Comment by Kristian Nielsen [ 2013-03-24 ] |
|
To reset the GTID state, one currently must do CHANGE MASTER TO MASTER_GTID_POS='' I agree that it is natural to think that RESET SLAVE [ALL] would reset the Maybe we could implement RESET ALL SLAVES that would also reset the GTID In any case, I very much welcome suggestions for improving this. What do you |
| Comment by Elena Stepanova [ 2013-03-24 ] |
|
I'm already thinking about it, as I found out that in case of multi-source replication it didn't work as i expected it, either (or maybe multi-source variant just hasn't been implemented yet?) |
| Comment by Elena Stepanova [ 2013-03-25 ] |
|
I'm changing it to a task for now since it's clearly not a straightforward bug but a separate topic that requires discussion and consideration. I think it should be convenient enough to have it inside a JIRA issue since it's public and is easier to watch than e-mails. |
| Comment by Elena Stepanova [ 2013-03-26 ] |
|
So, I experimented a bit, trying to abstract myself from implementation details and imagine possible user expectations. And, while I understand the technical challenge and reasoning, something feels inconsistent about CHANGE MASTER / RESET SLAVE behavior in regard to GTID. Here is some initial contemplation on the subject, no good suggestions yet. Part 1 I am User1, whose setup never involves multi-source, it's just plain master=>slave. Of course, you never know, maybe I'll need to switch them one day and want to be prepared to it, or I just like new cool things – anyway, I decide to use GTID. I start a fresh new pair and configure the slave For me master_gtid_pos is a parameter which defines the replication position – same way master_log_pos and master_log_file did before, so it's quite natural to have it in CHANGE MASTER (actually I don't know why I should provide it – I don't have to set default values of master_log_file/pos, but maybe it's because I need to indicate I want to use GTID now). I start replication, it goes on for a while, then something bad happens. I still have all master binlogs, so it's not a big deal, I can start over; I've been there. RESET SLAVE is supposed to do just that, it is defined as The problem is, I might not even notice that RESET SLAVE didn't work, so I won't start looking for alternative solutions. create table t1 (i int); Slave synchronized with master, so it also has t1. So, sooner or later when I detect the problem, I start googling and find out that instead of RESET SLAVE I apparently need to do That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER; so if it's okay to set master_gtid_pos via CHANGE MASTER (and we already decided that it is), then it should be reverted by RESET SLAVE... Now, back to our side for a minute: we have clearly changed semantics of RESET SLAVE: earlier it would make slave forget the position, now (with GTID) it doesn't. But what does it do, then? I mean, yes, it still resets master_log_pos and master_log_file, but what is the meaning of it, apart from desynchronizing remaining GTID position and master log position? Part 2 It's basically the same as the story for User1, only here I didn't do anything bad, I just at some point decided to move my master server to another host. Slave is fully synchronized, backups are in place, so I just stop replication, shut down master, move the data files (but not binlogs, I don't need them) to the new host, start master – effectively, it's the same as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the position and connection parameters, set up replication again, start slave... Part 3 As a User3, I want to create a multi-source setup. and a disaster happens, my data gets all messed up, because m1 has started from the beginning. One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?). Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that? But even after I figured out RESET SLAVE is not my command anymore, how do I actually do it? So, unlike for User1, for User3 it doesn't look natural at all to set master_gtid_pos via CHANGE MASTER command, since it's not a local slave parameter. Instead, User3 needs a way to tell one slave to start from a particular position without affecting other slaves... But it doesn't seem possible with GTID, does it? |
| Comment by Kristian Nielsen [ 2013-03-27 ] |
|
> So, I experimented a bit, trying to abstract myself from implementation Excellent analysis! It helped me a lot to get a better overview of where we > I start a fresh new pair and configure the slave Right. > For me master_gtid_pos is a parameter which defines the replication position Yes, it is to indicate using GTID. Actually, you do have to set default values of master_log_file/pos in normal It is quite deep in the design that GTID state is a global property of the Now with your analysis, I am thinking that I did this incorrectly with CHANGE
This makes it clear that GTID state is global on the server, separate from any What do you think? I now understand that this is how I meant things to work, > RESET SLAVE is supposed to do just that, it is defined as I just read the documentation, indeed that is what it says. But it's rubbish, But there is clearly a bug here! RESET SLAVE should remove Using_Gtid, it does Now, if user does RESET SLAVE and then START SLAVE, things will > I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before. Right, this was a bug, fixed now. Now, replication will start without using GTID, from the first binlog file on I won't say this is good behaviour, but it at least seems consistent with how > But if master continues with a different table Yeah. I would prefer giving an error in case no position specified, but that At least, if we can educate the user that GTID state is set separately with > CHANGE MASTER TO master_gtid_pos = ''; Yes, it is wierd. Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because Let me hear your opinion on CHANGE GTID / SHOW GTID STATUS / MASTER_USE_GTID, > Now, back to our side for a minute: we have clearly changed semantics of With the above bug fixed, now it sets also Using_Gtid=0. > It's basically the same as the story for User1, only here I didn't do With the above bug fixed, things should work, but you will no longer be using If you add MASTER_GTID_POS=AUTO to the CHANGE MASTER command, you should get Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or The "recommended" way to do the above would be to copy the binlog files along Does that sound ok? Any suggestions for improvement? > As a User3, I want to create a multi-source setup. You do not need to specify master_gtid_pos='' in the second CHANGE CHANGE GTID TO ''; > One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?). Yes. > Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that? First, to use multi-source with GTID, you have to setup the two different Then you need to get the current GTID state, using SHOW ALL SLAVES STATUS Now you want to start from the beginning of domain 2 (the domain of m2). So CHANGE MASTER TO MASTER_GTID_POS="1-10-100" (or CHANGE GTID TO "1-10-100"). Alternatively, you can start m2 slave from the start of the m2 binlogs, CHANGE MASTER 'm2' TO master_log_file='', master_log_pos=0; Then it will download the correct gtid position and update it > It's already sad, but will be even sadder if I have 10 sources, or 20... Yes, perhaps a bit sad. I did at one point consider that MASTER_GTID_POS would Hm, a lot longer reply than I indended. But hopefully we are getting closer to |
| Comment by Kristian Nielsen [ 2013-03-27 ] |
|
> Of course this is untested, but it should work, I will add a test case for And of course this did not work. I'm fixing right now.
|
| Comment by Kristian Nielsen [ 2013-03-27 ] |
|
> > Of course this is untested, but it should work, I will add a test case for > And of course this did not work. I'm fixing right now. I pushed a fix for this. Test case at the end of rpl_gtid_startpos.test. |
| Comment by Elena Stepanova [ 2013-03-31 ] |
|
>> Do you think it will be possible to explain this to I have no doubt that it will be possible to explain everything to users who are planning to run complicated configurations or workflow (switching servers on regular basis, etc.). I'm more concerned about the part of the user base who run simple straightforward replication, and the most they might do is to promote the slave as a new master in case of a crash. I expect it to be the majority, and want to be sure that we don't make their life harder, and even more so that we don't put them in a situation where they are likely to make a critical mistake just because they do stuff as they used to, while we changed the way things work. I expect this category of users won't read deep into the GTID documentation, exactly because they don't need the complicated setup; they are likely to follow instructions similar to 'First steps' or 'Quick setup'. If we manage to eventually explain the important things in a few words, then we should be fine. I know that so far we are not quite there yet, because even although I'm trying to understand things, I keep making mistakes which could have been fatal for a production environment. >> Now with your analysis, I am thinking that I did this incorrectly with CHANGE Do we really need the new syntax? I'd think, if GTID position is a global value, we could just make it a global dynamic variable. Then, SHOW GTID STATUS would also be not needed, since it would only return a single value – we can just as well do SHOW VARIABLES or SELECT @@gtid_position (or whatever it's called). Is there any reason why it wouldn't work? >> replaces the Gtid_Pos field in SHOW ALL I think that showing the value in SHOW ALL SLAVES STATUS doesn't hurt, and maybe even beneficial from the usability perspective, so, if it comes for a low price, it could stay there as well. >> This makes it clear that GTID state is global on the server, separate from any Yes, if my current understanding of how things are meant to work is any close to the truth, the proposed changes sound quite logical. >> > RESET SLAVE is supposed to do just that, it is defined as >> I just read the documentation, indeed that is what it says. But it's rubbish, Possibly, but that's how things used to work, and I'm pretty sure a number of people used it in their own, however tricky, ways – either in "toy" (in fact, just low-traffic) setups, or in conjuction with RESET MASTER (on master), etc. It wouldn't be very kind to make radical changes in the way things are supposed to work, especially because it's not easy to explain on high level why the algorithm has to be different with and without GTID. >> I won't say this is good behaviour, but it at least seems consistent with how Yes, I think it's better to keep it consistent with the old behavior for the time being. >> I would prefer giving an error in case no position specified, but that Personally, I don't see a big tragedy in doing RESET SLAVE without providing a master position afterwards. I mean, it seems natural to consider it as a shortcut of master_pos/master_log_file=<start from the beginning of whatever we have>. The absence of massive complaints about slave starting after RESET from a non-zero position due to previously purged master binlogs indirectly confirms that people don't have a problem with this. So, I'd rather keep it as is. >> At least, if we can educate the user that GTID state is set separately with Right. Although, same way as in the previous note about old-style master position, I wouldn't find it wrong if we considered the "empty" GTID value default and use it if nothing else was previously set; but if you prefer insisting on always setting it, either manually or through automatic discovery, I don't have strong objections against it, either. >> Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because Hm.. Actually, I don't see anything weird in showing GTID position in SHOW ALL SLAVES STATUS (as opposed to SHOW SLAVE STATUS), exactly because it's global for all slaves. >> > decided to move my master server to >> user >> Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or >> The "recommended" way to do the above would be to copy the binlog files along That's exactly the case when I'm concerned about owners of simple setups, and how things become somewhat more complicated for them, or at least different. >> > As a User3, I want to create a multi-source setup. >> You do not need to specify master_gtid_pos='' in the second CHANGE >> CHANGE GTID TO ''; Yes, it's much clearer this way. My point was, I'd expect slaves to be symmetrical, while it was very much not so before. >> It would be nice if I could implement that one Is it difficult to implement? Frankly, I thought that auto means pretty much that... Even more so if we have CHANGE MASTER .. master_using_gtid=1|0, where 1 throws an error when the GTID position is not set; then it would be logical to also have master_using_gtid=auto (or SET GLOBAL gtid = 'auto', whichever is more reasonable from implementation perspective), which would mean that the slave connects with an old-style position, acquires GTID position, sets it, and further connects using it. >> hopefully we are getting closer to You never know, maybe it turns out "perfect enough" at the end.. Although, of course, nothing is ever as perfect as we initially hope |
| Comment by Kristian Nielsen [ 2013-06-07 ] |
|
I believe all of these issues should be resolved, as well as possible |