[MDEV-13431] wsrep_gtid_mode uses wrong GTID for transaction committed by slave thread Created: 2017-08-02 Updated: 2020-08-25 Resolved: 2017-12-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Replication, Storage Engine - InnoDB |
| Affects Version/s: | 10.1.25 |
| Fix Version/s: | 10.1.31 |
| Type: | Bug | Priority: | Major |
| Reporter: | Geoff Montee (Inactive) | Assignee: | Sachin Setiya (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | galera, gtid, replication, wsrep | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
This is similar to When wsrep_gtid_mode is enabled, transactions that are replicated within a cluster by Galera receive a GTID where the domain_id is specified by wsrep_gtid_domain_id, the server_id is specified by server_id, and the seq_no is incremented for each transaction that is committed in the domain. It does not seem to work this way for transactions that are replicated by an asynchronous slave thread within a Galera cluster. For example, let's say that we have two clusters and one cluster replicates from the other using GTID replication. On cluster1, we see the following:
On cluster2, we see the following:
One node in cluster2 is a slave of one node in cluster1:
If we commit a transaction on cluster1, we would expect it to have the GTID 1-1-2 on cluster1, and either 2-1-3 or 2-2-3 on cluster2, depending on whether it uses the server_id of the originating cluster or replaces it with its own. Does that actually happen?: Let's say that we execute the following on cluster1:
What GTID does this transaction have on each cluster? Here's the binlog event on the node in cluster1 where the transaction originated:
And here's the binlog event on the node in cluster2 that is acting as a slave to cluster1:
And here's the binlog event on another node in cluster2:
So the transaction has the expected GTID in cluster1, and it has the expected GTID for the non-slave nodes in cluster2, but it has an unexpected GTID for the slave node in cluster2. |
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-08-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I confirm the problem with script below, which sets up two local clusters on (3307,3308) and (3310,3311) So topology is like below:
The script will also download and unpack tar 10.1.25 into _depot/m-tar/10.1.25
Please note that initially gtid is correct in cluster 2 for two first inserts into both nodes of cluster1:
Direct Slave:
And the other node in cluster2:
So far so good and gtid works as expected. (5-1-4 and 5-2-5 on both nodes in cluster2). But once we executed some transaction directly in cluster2 - the other node in cluster2 shows different gtid (below 5-1-6 vs 5-1-7): Master from cluster 1 (m1):
Direct Slave from cluster2 (m4):
The other node from cluster2 (m5):
And content of gtid pos variables at the end:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Przemek [ 2017-08-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm this problem with MariaDB Server 10.1.25, also on simpler case - 1 async master and 2 node Galera cluster, where one of the nodes is async slave. Initial state - async replication up and running, based on GTID, from MariaDB node to node0. Node0 and node2 are part of the same Galera cluster:
First insert on async master:
So far, so good - positions in sync.
Again so far "looks" good, GTID is consistent with the first node.
And now - for some reason, position on node2 related to async master notation, increased by two! While data is same on both nodes:
And the more updates we do inside the cluster, the more inconsistencies are introduced in GTID sequences here.
So it seems that when async channel is involved to replicate inside a Galera cluster, even for a brief moment, GTID sequences inside the cluster are getting quite messy. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2017-12-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Actually this is also wrong gtid 0-12-2,0-100-2 , But patch for 10715 will fix this issue. |