[MDEV-9107] GTID Slave Pos of untrack domain ids being updated Created: 2015-11-09  Updated: 2018-07-24  Resolved: 2018-07-21

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.8, 10.1.10
Fix Version/s: 10.1.35

Type: Bug Priority: Major
Reporter: Rodrigo Severo Assignee: Sachin Setiya (Inactive)
Resolution: Won't Fix Votes: 1
Labels: galera

Attachments: File mdev-9107.diff    
Issue Links:
Relates
relates to MDEV-9033 Incorrect statements binlogged on sla... Closed
relates to MDEV-9108 "GTID not in master's binlog" error w... Open
relates to MDEV-20720 Galera: Replicate MariaDB GTID to oth... Closed

 Description   

Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:

SETTING: Server_id: 1 IP: 10.0.3.223

STOP ALL SLAVES;
CHANGE MASTER "S1_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
CHANGE MASTER "S1_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;

SETTING: Server_id: 2 IP: 10.0.3.136

STOP ALL SLAVES;
CHANGE MASTER "S2_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S2_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;

SETTING: Server_id: 3 IP: 10.0.3.171

STOP ALL SLAVES;
CHANGE MASTER "S3_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S3_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
START ALL SLAVES;
 

After initially starting all replications:
1. Stop replication channel S1_R2;
2. Take note of GTID Slave Pos for domain ID 2 on server 1;
3. Issue some INSERT |UPDATE|DELETE on server 2;
4. Take note of GTID Slave Pos for domain ID 2 on server 1;

Observe that the GTID from steps 2 and 4 are diferent. Replication channel S1_R3 updated the GTID Slave Pos of domain ID 2 despite having been configured to just track domain ID 3!

When replication channel S1_R2 is brought back online the changes that occured on step 3 will be lost on server 1.

The solution for this issue seems to be to make each replication channel thread to update only the GTID Slave Pos for the domain IDs it should track as defined by

{ignore|do}

_domain_ids.



 Comments   
Comment by Elena Stepanova [ 2015-11-10 ]

I think there might be some confusion here.
If I understand the meaning of do_domain_ids correctly, it does not mean that other domain IDs should not be tracked. It means that events from other domain IDs should not be executed and replicated.
Consider a much simpler scenario.
You have an ordinary master -> slave replication setup.
Master uses two different domain IDs to run its flow.
The slave is configured to do_domain_ids=(1).
That is, master does something like this:

set gtid_domain_id=1;
create table t1 (i int);
insert into t1 values (1);
set gtid_domain_id=2;
insert into t1 values (2);
insert into t1 values (3);
set gtid_domain_id=1;
insert into t1 values (4);

And slave has this:

select * from t1;
i
1
4
show binlog events;
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
slave-bin.000001	4	Format_desc	2	249	Server ver: 10.1.8-MariaDB-log, Binlog ver: 4
slave-bin.000001	249	Gtid_list	2	274	[]
slave-bin.000001	274	Binlog_checkpoint	2	313	slave-bin.000001
slave-bin.000001	313	Gtid	1	351	GTID 1-1-1
slave-bin.000001	351	Query	1	437	use `test`; create table t1 (i int)
slave-bin.000001	437	Gtid	1	475	BEGIN GTID 1-1-2
slave-bin.000001	475	Query	1	563	use `test`; insert into t1 values (1)
slave-bin.000001	563	Query	1	632	COMMIT
slave-bin.000001	632	Gtid	1	670	BEGIN GTID 1-1-3
slave-bin.000001	670	Query	1	758	use `test`; insert into t1 values (4)
slave-bin.000001	758	Query	1	827	COMMIT
select * from mysql.gtid_slave_pos;
domain_id	sub_id	server_id	seq_no
2	4	1	2
1	2	1	2
1	3	1	3

And it has been going on this way for a while.
Then, at some point you decide you want to start replicating the other domain ID as well. You change master to do_domain_ids=(), and it starts replicating from the current position.
But if it had not been tracked, it would have to start from the first event in this GTID domain, which is hardly ever desirable.
So, it seems natural that the position is tracked.

However, this is just my understanding which can be wrong. I will assign it to nirbhay_c to confirm (or object); and in any case, this point, executed vs tracked, should be clearly explained in the documentation.

Besides, while looking into this, I encountered the problem described in MDEV-9113. I would like to know if you experience it, and if not, what your configuration is that allows you to avoid it.

Comment by Rodrigo Severo [ 2015-11-10 ]

Elena, first of all thanks for taking your time to deal with this issue and for sharing your thoughts.

I see this can get much more abstract them I was expecting but you are right, there are more points of view to this issue than I initially thought about.

About the scenario you proposed I have a few observations as it's sure simplier because it involves less servers and much less replication channels but:

  1. I personally have difficulty in imagining it happening in real life situations (I'm not saying it doesn't just that I really can't picture a real life situation similar to it);
  2. you say that " if it had not been tracked, it would have to start from the first event in this GTID domain, which is hardly ever desirable. So, it seems natural that the position is tracked. " I'm not sure about that as if it had not been tracked and our user later decides to start tracking domain id 2, he would have the slave really equal to the master. This final result seems desirable to me. You say that having the slave different from the master is more natural. Which of the 2 final situations is more desirable and/or natural I can't in fact say, (specially as I can't picture this scenario in real life situations) but it seems strange to me to elect any one of them as much more desirable/natural them the other. Both seems possible and eventually desirable.
  3. last but not least, I think this scenario is really fragile as it seems to make sense with the simple sequence of commands you proposed but if we change the commands a little bit to some other rather simple and common ones as the ones below the option of latter deciding to start replicating domain id 2 gets completely unfeasible. By this I just want to point that even if there is a real life scenario where something like this would be desired, it has many more difficulties to deal with than just the untracked nature of GTIDs of unreplicated domain ids.

Alternative commands:
{{set gtid_domain_id=1;
create table t1 (id int AUTO_INCREMENT, i int, PRIMARY KEY (`id`));
insert into t1 ('i') values (1);
set gtid_domain_id=2;
delete from t1 where id=1;}}

Now our user starts to also replicate domain id 2 in the slave and then he issues the following command:
insert into t1 ('id', 'i') values (1, 4);

On the master everything will be ok, on the slave we will have a duplicate entry error.

But more more importantly to me: if

{do|ignore}

_domain_ids just sets the domain IDs as unexecuted and unreplicated but keeps them tracked I can't see how to implement something like the scenario I'm trying to deal with where I have 3 master each with 2 replication channels for the other 2 masters. It's not a question of which behaviour is more desirable. It becames impossible. Or am I missing something?

Comment by Elena Stepanova [ 2015-11-10 ]

rsevero,

First of all, regarding

I think this scenario is really fragile as it seems to make sense with the simple sequence of commands you proposed but if we change the commands a little bit to some other rather simple and common ones as the ones below the option of latter deciding to start replicating domain id 2 gets completely unfeasible

The scenario was not literal of course, it was schematic just to make a point. But certainly, all kinds of playing with replication topology and settings, apart from using simple default ones, assume that the operators know what they are doing, it's their responsibility to keep configuration consistent. If we don't assume that, forget local domain IDs, your whole replication setup is not viable. What if we start populating the same table on S1, S2, S3 with conflicting data? Obviously, you expect it will never be the case – and same goes for using several domain IDs on the same server.

I personally have difficulty in imagining it happening in real life situations (I'm not saying it doesn't just that I really can't picture a real life situation similar to it)

Okay, maybe it was too schematic. Let me create a legend about it.
Lets say you have servers S1, S2, S3.
S1 is a powerful enough server where your clients connect, so it has enough production data inflow.
It also runs some daily analytics using this data, and cleans up and re-populates an analytics table.
S2 replicates the main data inflow from S1. It is an average server which you use as a backup (and maybe tepid standby) of the production data.
S3 replicates the analytics results from S1. It is a weak server which cannot run the analytics job itself, which is why it is done on the main server, but it's enough for your backoffice to connect and pull the data, and you don't want them to connect to the main client-facing server.

So, on S1 you have the production data updates under default domain ID 1, and analytics data is updated under domain ID 100.
S2 is configured to replicate domain ID 1 from S1, and S3 replicates domain ID 100.

It has been working so for quite some time.

One shiny weekend you decide to go green and not to waste electricity on a separate server just so the backoffice could pull tiny data. You still don't want them to connect to S1, as it's production, but you decide it should be fine to have their data on S2 – after all, it's a backup server, nobody else goes there.
Remember, since it's just daily (weekly, monthly) statistics, you have no interest in historical data (and if you did, and if it was stored, you could have easily backed up the table contents from S3). So, you reconfigure S2 to do_domain_ids=(), and it does just want you expect – it starts replicating from the current moment.
Now, what would have happened if it started replicating from the beginning, lets say 5 years ago? Best case scenario, you would have gotten a lot of unnecessary garbage replicated (and immediately discarded); but in fact, it's unrealistic – your binary logs have been purged hundreds of times, so you don't really have "the beginning" in the binlogs, and you are not choosing between "from the very beginning" and "from the current moment"; Much more likely you have some intermediate state, with some binlogs purged and some not, so you would have gotten some random start point. I don't think anyone can seriously plan for that.

Disclaimer: once again, until Nirbhay confirmed it, it's just my understanding of the design of do_domain_ids, it can easily be wrong.

But in the unlikely scenario that you do have all binary logs, or that you actually know from which point you want to start, you can always set @@gtid_slave_pos to the desired value, and it will go back to that point to replicate from it.

I can't see how to implement something like the scenario I'm trying to deal with where I have 3 master each with 2 replication channels for the other 2 masters. It's not a question of which behaviour is more desirable. It becames impossible.

The problem is, I don't really know what exactly you are trying to achieve. I understand your replication topology, but I don't know why you are setting do_domain_ids. If you want to avoid the same event bouncing back and forth between the servers, setting them to gtid-ignore-duplicates seems to be much easier way, but you are not using it, so I assume there are some other considerations. If you explain what they are, maybe we could come up with some ideas...

Comment by Rodrigo Severo [ 2015-11-10 ]

Ok, you got me as what you just described looks like a real scenario

First let me explain what I'm trying to implement:

That would be a 6 servers setup. Three on one city and 3 on another. All of them working as masters.

Each three on one city would have 2 replication channels to the other 2 so changes done on any of them would be replicated with minimum delay to the others on the same city.

One server on each city would have an extra replication channel to one server on the other city to get all changes done on the other city.

This setup would:

  • minimize replication delays;
  • reduce overstress on any specific server;
  • allow me to increase the number of servers if necessary;
  • not require super hardware.

About the scenario you presented:

If

{do|ignore}

_domain_ids were to work as I understand they should - making each replication channel actually care about only the domain ids specified, i.e., asking fot GTIDs, replicating, executing and tracking only the domain ids specified - to start the replication of domain id 100 on S2 you would only have to find in S1 which is the current GTID for domain id 100 and set that value in the GTID Slave Pos variable of server S2. After that you would have the same exact result and behaviour you mentioned. Obviously for this particular situation a much more elegant solution would be to implement a master_pos option for CHANGE MASTER's MASTER_USE_GTID definition.

Comment by Elena Stepanova [ 2015-11-10 ]

Rodrigo Severo,

About the scenario you presented...
If
do|ignore_domain_ids were to work as I understand they should - making each replication channel actually care about only the domain ids specified, i.e., asking fot GTIDs, replicating, executing and tracking only the domain ids specified - to start the replication of domain id 100 on S2 you would only have to find in S1 which is the current GTID for domain id 100 and set that value in the GTID Slave Pos variable of server S2

Surely, and the same the other way round – if you want to start replication not from the current position, but from an earlier position, you would only have to set that value in the GTID Slave Pos variable; so, I don't see any problem with the current design (again, if it is a design, we are still not sure). MDEV-9033 is an obvious bug, I don't think it's caused by tracking/not tracking domains which are outside do_domain_ids. For MDEV-9108, if it confirms, it might indeed be a design problem, but I'll comment there separately, I don't want to mix up these two issues just yet.

Back to what you are trying to achieve. I still don't understand why you need do_domain_ids for this. Did you try to do the same without them (with gtid-ignore-duplicates of course)? If you did, what was the problem you encountered that made you switch to do_domain_ids?
Taking into account your ultimate goal ("minimize replication delays"), filtering by domain IDs is not even the most efficient setup. Consider this:
Inside the same city, you have S1, S2, S3 replicating from each other.
An event was executed on S1.
On some technical reason, connection from S1 to S2 is currently slow, while S1 => S3 and S3 => S2 works fine. So, the event could have reached S2 faster if it went via S3 (S1=>S3=>S2). But your setup does not allow it, the event coming from S3 will be ignored, and S2 will only get it when it makes it through the slow S1=>S2 connection, thus creating an additional replication delay.

BTW you didn't mention it, but I assume you do actually care about splitting the load so that there are no conflicting concurrent updates on the servers, right?

Comment by Rodrigo Severo [ 2015-11-11 ]

Using gtid-ignore-duplicates I could prevent being affected by the 3 issues I reported: MDEV-9033, MDEV-9107 and MDEV-9108. Great! Thanks!!

Comment by Elena Stepanova [ 2015-11-11 ]

Okay, good. So, the remaining question in this issue (MDEV-9107) is whether it is by design or not that domains outside do_domain_ids list are being tracked in gtid_slave_pos. The issue has already been assigned to nirbhay_c who can answer that.

Comment by Rodrigo Severo [ 2016-01-04 ]

Sorry, I had my tests wrong. I had "log_slave_updates" turned off.

I have just retested this issue with "log_slave_updates" on as it should and this issue still exists on MariaDB 10.1.10.

This issue should NOT be closed.

And to clarify why this issue is a problem that should be solved consider the situation described on the first report of this bug.

If after performing all the steps mentioned above on the first report I restart replication channel S1_R2, server S1 won't have the changes done on step 3 above. This is not acceptable.

Comment by Rodrigo Severo [ 2016-01-06 ]

I believe there were some expectation that fixing MDEV-9033 would fix this issue but unfortunately it didn't happen.

Comment by Esa Korhonen [ 2018-04-03 ]

I have now noticed this bug as well, although in a simpler setting. All it needs is a master server which changes its gtid_domain_id, and a slave which is only replicating the old domain (with the DO_DOMAIN_IDS-setting). The slave will update its gtid_slave_pos to include the new domain (5), even when in reality it does not add the events from the new domain:

@@gtid_binlog_pos |0-3001-7139
@@gtid_slave_pos |0-3001-7139,5-3001-10
@@gtid_current_pos |0-3001-7139,5-3001-10

This means that even gtid_current_pos cannot be trusted to be correct. gtid_binlog_pos and gtid_binlog_state do seem to be correct, but these require log_slave_updates. This has implications for the failover functionality in MaxScale. Server version: 10.2.6

Comment by Sachin Setiya (Inactive) [ 2018-07-20 ]

Hi esa.korhonen!

In your case you can just update gtid_slave_pos so that slave gets all the events from master. Lets say master added 3 events in new domain id =X, And old Domain id was Y. SO you can simple set gtid_slave_pos="Y-server-id-seq_no" , So it will forget all the tracking of X domain id. And you will get all the events.

Comment by Sachin Setiya (Inactive) [ 2018-07-21 ]

Hi rsevero

According to documentation do_domain_id is worked as expected , So I am closing this issue as wont fix.

Comment by Sachin Setiya (Inactive) [ 2018-07-21 ]

Test case patch mdev-9107.diff

Generated at Thu Feb 08 07:32:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.