> So, I experimented a bit, trying to abstract myself from implementation
> details and imagine possible user expectations.
Excellent analysis! It helped me a lot to get a better overview of where we
are.
> I start a fresh new pair and configure the slave
> CHANGE MASTER TO master_host= ..., ..., master_gtid_pos='';
> (or master_gtid_pos=auto, it shouldn't matter at this point, right?)
Right.
> For me master_gtid_pos is a parameter which defines the replication position
> – same way master_log_pos and master_log_file did before, so it's quite
> natural to have it in CHANGE MASTER (actually I don't know why I should
> provide it – I don't have to set default values of master_log_file/pos, but
> maybe it's because I need to indicate I want to use GTID now).
Yes, it is to indicate using GTID.
Actually, you do have to set default values of master_log_file/pos in normal
replication, it is a mis-feature that one can omit it. Because if master has
purged any binlogs, you get to start from whatever random position is the
first non-purged file - which will certainly and silently corrupt your
replication.
It is quite deep in the design that GTID state is a global property of the
server, not a per-slave-connection position. This is needed for example for
multi-source. It is possible with MASTER_GTID_POS=AUTO to switch eg. from
having two masters to having a single master that itself replicates from the
original two masters. Do you think it will be possible to explain this to
users, or is it hopelessly complicated and will need to be re-designed
completely?
Now with your analysis, I am thinking that I did this incorrectly with CHANGE
MASTER and GTID. Maybe it should instead be like this:
- A new command CHANGE GTID TO "0-1-2". This requires all slaves to be
stopped. It replaces CHANGE MASTER TO MASTER_GTID_POS="0-1-2".
- A new command SHOW GTID STATUS, replaces the Gtid_Pos field in SHOW ALL
SLAVES STATUS.
- In CHANGE MASTER, one must now do MASTER_USE_GTID=1. This gives an error if
no GTID position is set (either manually with CHANGE GTID, or downloaded
automatically by connecting slave to master with old non-GTID position).
This makes it clear that GTID state is global on the server, separate from any
slave connection configuration. And clear that the individual slave connection
can be using GTID to connect (MASTER_USE_GTID=1) or old style position
(MASTER_USE_GTID=0).
What do you think? I now understand that this is how I meant things to work,
though I never formulated it explicitly like this before.
> RESET SLAVE is supposed to do just that, it is defined as
> "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start",
> and it used to do just that; but it doesn't anymore.
I just read the documentation, indeed that is what it says. But it's rubbish,
isn't it? Except for toy setups where one keeps all binlogs on the master
forever, it doesn't work. Or am I missing something?
But there is clearly a bug here! RESET SLAVE should remove Using_Gtid, it does
not, shame on me. I've fixed and pushed.
Now, if user does RESET SLAVE and then START SLAVE, things will
"work". Replication will start from the first binlog file on the master,
without using Gtid.
> I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before.
> If the next statement on master does something with t1, my replication will abort (the table doesn't exist), so I will at least know about the problem.
Right, this was a bug, fixed now.
Now, replication will start without using GTID, from the first binlog file on
the master. If some binlogs were purged, the same silent corruption may occur.
If all binlogs were kept on the master, things will be ok, but it will no
longer be using GTID.
I won't say this is good behaviour, but it at least seems consistent with how
it worked before. Or what do you think?
> But if master continues with a different table
> create table t2 (i int)
> and keeps working with it, I might never know that I don't have t1 on slave anymore – until it's too late (master died, binlogs are gone, etc.)
Yeah. I would prefer giving an error in case no position specified, but that
is probably out due to backwards compatibility?
At least, if we can educate the user that GTID state is set separately with
CHANGE GTID, it should be clearer that CHANGE MASTER MASTER_USE_GTID=1 starts
from whatever SHOW GTID STATUS displays, and that RESET MASTER reverst to
MASTER_USE_GTID=0.
> CHANGE MASTER TO master_gtid_pos = '';
>
> That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER.
Yes, it is wierd. Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because
it is not per-slave it is global.
Let me hear your opinion on CHANGE GTID / SHOW GTID STATUS / MASTER_USE_GTID,
and if we agree then I will change implementation to that.
> Now, back to our side for a minute: we have clearly changed semantics of
> RESET SLAVE: earlier it would make slave forget the position, now (with
> GTID) it doesn't. But what does it do, then?
With the above bug fixed, now it sets also Using_Gtid=0.
> It's basically the same as the story for User1, only here I didn't do
> anything bad, I just at some point decided to move my master server to
> another host. Slave is fully synchronized, backups are in place, so I just
> stop replication, shut down master, move the data files (but not binlogs, I
> don't need them) to the new host, start master – effectively, it's the same
> as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the
> position and connection parameters, set up replication again, start slave...
With the above bug fixed, things should work, but you will no longer be using
GTID.
If you add MASTER_GTID_POS=AUTO to the CHANGE MASTER command, you should get
an error that master is missing the GTID requested by the slave. But user
needs to be aware that RESET MASTER (or your above equivalent) is dangerous
with GTID. Because it starts GTID generation from scratch, so now you have
duplicate GTIDs in your system, unless you carefully remove the old ones
everywhere. At least you get an error message in most cases rather than silent
corruption.
Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or
CHANGE GTID TO ''), things should work again.
The "recommended" way to do the above would be to copy the binlog files along
also (maybe purge all logs but the latest first). Then there would be no need
for RESET SLAVE, just CHANGE MASTER TO the new host and port, and GTID would
connect automatically at the correct position (that's the whole point of GTID,
to find position automatically on new master, right?). Of course this is
untested, but it should work, I will add a test case for this.
Does that sound ok? Any suggestions for improvement?
> As a User3, I want to create a multi-source setup.
> I configured m1 as
> CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos=''
> started the slave 'm1', it has been working for a while for now.
> Now I want to add another master. I do exactly the same: I run
> CHANGE MASTER 'm2' ... master_gtid_pos='';
You do not need to specify master_gtid_pos='' in the second CHANGE
MASTER. This will be clearer with the change to CHANGE GTID:
CHANGE GTID TO '';
CHANGE MASTER 'm1' ... master_using_gtid=1;
CHANGE MASTER 'm2' ... master_using_gtid=1;
> One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?).
Yes.
> Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that?
First, to use multi-source with GTID, you have to setup the two different
masters with different domain ids. Let's say gtid_domain_id=1 for m1 and
gtid_domain_id=2 for m2.
Then you need to get the current GTID state, using SHOW ALL SLAVES STATUS
(SHOW GTID STATUS). Let's say it is "1-10-100,2-11-200".
Now you want to start from the beginning of domain 2 (the domain of m2). So
you need to remove that domain from the state:
CHANGE MASTER TO MASTER_GTID_POS="1-10-100"
(or CHANGE GTID TO "1-10-100").
Alternatively, you can start m2 slave from the start of the m2 binlogs,
without using GTID:
CHANGE MASTER 'm2' TO master_log_file='', master_log_pos=0;
Then it will download the correct gtid position and update it
automatically. Then the next time you change master for m2 you can use
MASTER_GTID_POS=AUTO again. It would be nice if I could implement that one
could ask to connect the first time with old-style position, but then the next
time with GTID.
> It's already sad, but will be even sadder if I have 10 sources, or 20...
Yes, perhaps a bit sad. I did at one point consider that MASTER_GTID_POS would
only change the domains mentioned, and leave all other domains intact. And one
would need to set seq_no to zero to remove a domain
(MASTER_GTID_POS="1-10-100,2-11-0"). But I thought that was too magic, and
users could always specify the full GTID state if they wanted to keep some domains.
Hm, a lot longer reply than I indended. But hopefully we are getting closer to
something that is at least workable, if not as perfect as I had hoped
initially ...
> So, I experimented a bit, trying to abstract myself from implementation
> details and imagine possible user expectations.
Excellent analysis! It helped me a lot to get a better overview of where we
are.
> I start a fresh new pair and configure the slave
> CHANGE MASTER TO master_host= ..., ..., master_gtid_pos='';
> (or master_gtid_pos=auto, it shouldn't matter at this point, right?)
Right.
> For me master_gtid_pos is a parameter which defines the replication position
> – same way master_log_pos and master_log_file did before, so it's quite
> natural to have it in CHANGE MASTER (actually I don't know why I should
> provide it – I don't have to set default values of master_log_file/pos, but
> maybe it's because I need to indicate I want to use GTID now).
Yes, it is to indicate using GTID.
Actually, you do have to set default values of master_log_file/pos in normal
replication, it is a mis-feature that one can omit it. Because if master has
purged any binlogs, you get to start from whatever random position is the
first non-purged file - which will certainly and silently corrupt your
replication.
It is quite deep in the design that GTID state is a global property of the
server, not a per-slave-connection position. This is needed for example for
multi-source. It is possible with MASTER_GTID_POS=AUTO to switch eg. from
having two masters to having a single master that itself replicates from the
original two masters. Do you think it will be possible to explain this to
users, or is it hopelessly complicated and will need to be re-designed
completely?
Now with your analysis, I am thinking that I did this incorrectly with CHANGE
MASTER and GTID. Maybe it should instead be like this:
stopped. It replaces CHANGE MASTER TO MASTER_GTID_POS="0-1-2".
SLAVES STATUS.
no GTID position is set (either manually with CHANGE GTID, or downloaded
automatically by connecting slave to master with old non-GTID position).
This makes it clear that GTID state is global on the server, separate from any
slave connection configuration. And clear that the individual slave connection
can be using GTID to connect (MASTER_USE_GTID=1) or old style position
(MASTER_USE_GTID=0).
What do you think? I now understand that this is how I meant things to work,
though I never formulated it explicitly like this before.
> RESET SLAVE is supposed to do just that, it is defined as
> "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start",
> and it used to do just that; but it doesn't anymore.
I just read the documentation, indeed that is what it says. But it's rubbish,
isn't it? Except for toy setups where one keeps all binlogs on the master
forever, it doesn't work. Or am I missing something?
But there is clearly a bug here! RESET SLAVE should remove Using_Gtid, it does
not, shame on me. I've fixed and pushed.
Now, if user does RESET SLAVE and then START SLAVE, things will
"work". Replication will start from the first binlog file on the master,
without using Gtid.
> I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before.
> If the next statement on master does something with t1, my replication will abort (the table doesn't exist), so I will at least know about the problem.
Right, this was a bug, fixed now.
Now, replication will start without using GTID, from the first binlog file on
the master. If some binlogs were purged, the same silent corruption may occur.
If all binlogs were kept on the master, things will be ok, but it will no
longer be using GTID.
I won't say this is good behaviour, but it at least seems consistent with how
it worked before. Or what do you think?
> But if master continues with a different table
> create table t2 (i int)
> and keeps working with it, I might never know that I don't have t1 on slave anymore – until it's too late (master died, binlogs are gone, etc.)
Yeah. I would prefer giving an error in case no position specified, but that
is probably out due to backwards compatibility?
At least, if we can educate the user that GTID state is set separately with
CHANGE GTID, it should be clearer that CHANGE MASTER MASTER_USE_GTID=1 starts
from whatever SHOW GTID STATUS displays, and that RESET MASTER reverst to
MASTER_USE_GTID=0.
> CHANGE MASTER TO master_gtid_pos = '';
>
> That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER.
Yes, it is wierd. Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because
it is not per-slave it is global.
Let me hear your opinion on CHANGE GTID / SHOW GTID STATUS / MASTER_USE_GTID,
and if we agree then I will change implementation to that.
> Now, back to our side for a minute: we have clearly changed semantics of
> RESET SLAVE: earlier it would make slave forget the position, now (with
> GTID) it doesn't. But what does it do, then?
With the above bug fixed, now it sets also Using_Gtid=0.
> It's basically the same as the story for User1, only here I didn't do
> anything bad, I just at some point decided to move my master server to
> another host. Slave is fully synchronized, backups are in place, so I just
> stop replication, shut down master, move the data files (but not binlogs, I
> don't need them) to the new host, start master – effectively, it's the same
> as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the
> position and connection parameters, set up replication again, start slave...
With the above bug fixed, things should work, but you will no longer be using
GTID.
If you add MASTER_GTID_POS=AUTO to the CHANGE MASTER command, you should get
an error that master is missing the GTID requested by the slave. But user
needs to be aware that RESET MASTER (or your above equivalent) is dangerous
with GTID. Because it starts GTID generation from scratch, so now you have
duplicate GTIDs in your system, unless you carefully remove the old ones
everywhere. At least you get an error message in most cases rather than silent
corruption.
Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or
CHANGE GTID TO ''), things should work again.
The "recommended" way to do the above would be to copy the binlog files along
also (maybe purge all logs but the latest first). Then there would be no need
for RESET SLAVE, just CHANGE MASTER TO the new host and port, and GTID would
connect automatically at the correct position (that's the whole point of GTID,
to find position automatically on new master, right?). Of course this is
untested, but it should work, I will add a test case for this.
Does that sound ok? Any suggestions for improvement?
> As a User3, I want to create a multi-source setup.
> I configured m1 as
> CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos=''
> started the slave 'm1', it has been working for a while for now.
> Now I want to add another master. I do exactly the same: I run
> CHANGE MASTER 'm2' ... master_gtid_pos='';
You do not need to specify master_gtid_pos='' in the second CHANGE
MASTER. This will be clearer with the change to CHANGE GTID:
CHANGE GTID TO '';
CHANGE MASTER 'm1' ... master_using_gtid=1;
CHANGE MASTER 'm2' ... master_using_gtid=1;
> One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?).
Yes.
> Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that?
First, to use multi-source with GTID, you have to setup the two different
masters with different domain ids. Let's say gtid_domain_id=1 for m1 and
gtid_domain_id=2 for m2.
Then you need to get the current GTID state, using SHOW ALL SLAVES STATUS
(SHOW GTID STATUS). Let's say it is "1-10-100,2-11-200".
Now you want to start from the beginning of domain 2 (the domain of m2). So
you need to remove that domain from the state:
CHANGE MASTER TO MASTER_GTID_POS="1-10-100"
(or CHANGE GTID TO "1-10-100").
Alternatively, you can start m2 slave from the start of the m2 binlogs,
without using GTID:
CHANGE MASTER 'm2' TO master_log_file='', master_log_pos=0;
Then it will download the correct gtid position and update it
automatically. Then the next time you change master for m2 you can use
MASTER_GTID_POS=AUTO again. It would be nice if I could implement that one
could ask to connect the first time with old-style position, but then the next
time with GTID.
> It's already sad, but will be even sadder if I have 10 sources, or 20...
Yes, perhaps a bit sad. I did at one point consider that MASTER_GTID_POS would
only change the domains mentioned, and leave all other domains intact. And one
would need to set seq_no to zero to remove a domain
(MASTER_GTID_POS="1-10-100,2-11-0"). But I thought that was too magic, and
users could always specify the full GTID state if they wanted to keep some domains.
Hm, a lot longer reply than I indended. But hopefully we are getting closer to
something that is at least workable, if not as perfect as I had hoped
initially ...