[MDEV-31437] (Galera + SQL_LOG_BIN=0) = GTID and state corruption Created: 2023-06-08 Updated: 2023-08-17 Resolved: 2023-07-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Replication |
| Affects Version/s: | 10.5.21 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Gordan Bobic | Assignee: | Jan Lindström |
| Resolution: | Not a Bug | Votes: | 3 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Description |
|
The behaviour of SQL_LOG_BIN is completely broken on Galera. Minimal reproducer: On the writer (1): Result: So sql_log_bin=0 renders the master writer node to corrupt it's own gtid position, by making all other nodes in the topology think themselves to be ahead of it even though it is the master writer node that actually generated the transaction. Transaction ends up written into binlogs on the secondary galera node, but is not written locally on the master node that had sql_log_bin=0 set. Worse, if you then start a new session on the writer without disabling sql_log_bin=0, and run another transaction, that transaction will then get the n+1 gtid with mater's server-id, but it will have gtid on all other nodes of n+2 gtid. So in this case global-transaction-id is no longer a globally unique identifier of a transaction id. Because the same gtid on the master node and the secondary nodes now refer to completely different transactions. This is critically dangerously broken behaviour that renders async replication from galera unsafe. Expected behaviour:
|
| Comments |
| Comment by Gordan Bobic [ 2023-06-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think what is needed here is GTID tagging of transactions in galera replication channels. A fix could be something like:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Claudio Nanni [ 2023-06-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> Result: Since you disabled binary logging it is expected not to have a GTID generated locally. Actually running a DDL with binary log disabled is usually taking a voluntary risk, you do that on replicated environment for local node operations, from what I understand you'd like the Galera cluster behave like a unity of local operations in respect to any other async Replica attached to the cluster, If I understand this: set sql_log_bin=0;
I think that cases B and C put the cluster at risk, and what's the semantyc of those? I think A&D are the most robust. Please note in case C the GTID generated locally will have the local node domain, not galeras, when wsrep_on=on galera will override the domain-id. Your case is B, you want the transaction to replicate but you don't want to log it on the local binary log, moreover you assume that remote Galera nodes will use the session setting of your local galera node so not binary logging it either. > So sql_log_bin=0 renders the master writer node to corrupt it's own gtid position, by making all other nodes in the topology think themselves to be ahead of it even though it is the master writer node that actually generated the transaction. sql_log_bin=0 doesn't corrupt any gtid, with that you asked him explicitly not to generate a local binary log transaction with the correspondent GTID. > Transaction ends up written into binlogs on the secondary galera node, but is not written locally on the master node that had sql_log_bin=0 set. As expected. > Worse, if you then start a new session on the writer without disabling sql_log_bin=0, and run another transaction, that transaction will then get the n+1 gtid with mater's server-id, but it will have gtid on all other nodes of n+2 gtid. SET sql_log_bin=0 is the implict form for SET SESSION sql_log_bin=0, and it is so expected to be valid only for the session where it was set, any other session must ignore other sessions status. > So in this case global-transaction-id is no longer a globally unique identifier of a transaction id. Because the same gtid on the master node and the secondary nodes now refer to completely different transactions. There are many ways to you can de-align GTIDs on purpose and this is one of those cases. > Expected behaviour: As said above, in my opinion this is more a feature request, you ask that every time you disable binary logging in the local session on a node that would be disabled on all node for the transactions replicated. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-06-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Whether we call this a bug or a feature, it is in all cases undesirable to have nodes generate different GTIDs for the same transaction, and have different transactions under the same GTID. I struggle to construct an argument from the user perspective where this could be called a "feature" rather than a "bug". As a "feature", it would have to provide some functionality that would under conceivable circumstances be useful, and this doesn't. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Justin Bennett [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Agree with Gordans most recent comment - that under the current architecture there is no "Global" in GTID. Because of this, an asynchronous replica is tied to a cluster node rather than being able to move between cluster nodes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It's a known fact that SQL_LOG_BIN=0 is ignored by WSREP replication. It has always been that way. Justin Bennett it is possible to repoint an async replica of a galera node by using the transaction Xid from binlog of node A and search for it in binlog of node B. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Rick, just because it is broken by design since forever doesn't make it not broken, it just makes it an old bug that has been breaking systems for a long time. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-07-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
gbobic Did you use wsrep_gtid_mode and wsrep_gtid_domain_id configuration variables? Please refer documentation in https://mariadb.com/kb/en/using-mariadb-gtids-with-mariadb-galera-cluster/. It will not work if nodes are not configured correctly. Because you have defined sql_bin_log=0 see documentation on https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#sql_log_bin i.e. operation is not written on binlog. For this reason also gtid_binlog_pos shows old position (it is based on latest gtid position on binlog) but all events from this node have increasing GTID as seen from node_2.
Note that sql_bin_log setting is local only so it will not have any effect any other nodes in cluster or in async slave. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-07-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> Note that sql_bin_log setting is local only so it will not have any effect any other nodes in cluster or in async slave. And this is the core of the matter. This means that GTID async replication out of Galera cluster is dangerous because migrating a slave to a different Galera node will result in things that shouldn't replicate replicating. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
gbobic I'm redundant to change current behavior because I do not know real use cases of sql_bin_log=0 on a cluster. If you really need to create DDL on one node only in a cluster I suggest using wsrep_osu_method="RSU" instead. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Feature works as designed currently. We could restrict setting of sql_bin_log=0 on a cluster but it should be done in an user controlled way to avoid breaking user software and tools. One way to do so in 10.6+ is to introduce a new wsrep_mode = WSREP_ENFORCE_SQL_BIN_LOG where we would issue a error if user tries to set sql_bin_log=0. However, I see this currently a special case and current documentation is quite clear. Therefore, I will not open a new TODO to introduce new option and will close this MDEV as a not a bug. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So sql_log_bin=0 resulting in binlogs being different across cluster nodes is "by design" rather than an anomalous side effect? One obvious use case is managing data retention and keeping more data downstream on an archiving replica. For example, you might want to keep 3 monthly partitions in the cluster but keep 12 monthly partitions on the replica. But you can't do that because if you omit the partition management statements from the the binlog, they will still be in the binlog on other nodes. If your primary node then fails and you move the replica to another Galera node, it will re-read the binlogs and execute the partition dropping that it shouldn't do. I can think of many use cases where the current, inconsistent behaviour will break things. I can think of 0 cases where the current inconsistent behaviour is a useful side effect. Can you think of a genuine use case where the current behaviour is actually desirable? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
gbobic if the node where the replica is replicating from fails and you move it to another node it will NOT re-read binlogs, as it will start from where it was at the time of the crash; so your use case is covered. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Unless that replica was offline for some reason. Or configured with lagged replication with limited max_relay_log_size on a busy cluster. The reason I even discovered this anomalous behaviour is because I observed anomalies with GTID sequence desync when moving async replicas between Galera nodes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
SET commands are not replicated across cluster nodes, this is by design. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-07-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That won't work with multiple replicas. NDB cluster, for example, does this correctly, last time I checked it ensures binlogs are identical on all SQL nodes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
valerii Can you firstly explain me why user would use setting sql_log_bin=0 in a cluster, i.e. what is the use case (here remember that most of DDL can be executed using RSU)? I could disable using it on a cluster (easy) but as I do not know how widely this is used I'm more careful this time. I do not yet know a way to replicate that local setting. How this feature is used on master-slave replication? I know that on master-slave replication DDL is not replicated but if I'm correct actual GTID-event is.
As we can see while sql_bin_log=0 we do not see gtid increment but as soon as setting is restored gtid is incremented and there is no holes i.e. both gtid events 5-1-1 (create database) and 5-1-2 (create table) are used . Furthermore, as we can see from second galera node actual gtids and events are replicated and gtids are exactly as it should be i.e. from master node and increasing (no holes etc) and exactly the same as in master.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valerii Kravchuk [ 2023-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As far as I understand, the idea is to run some statement on a node that is NOT replicated to async slave, this is what sql_log_bin =0 is typically used for. It can be running mysql_upgrade, fixing some data inconsistency, some FLUSH command, and yes, maybe some DDL that is not filtered out by current replication filters. Bottom line is that "Galera cluster" as a whole should act as a master for async replication the same way, as much as possible, as a simple non-Galera instance. Can you reproduce the problem with CREATE DATABASE as suggested here? I can imagine creating some local, "temporary" database for tools like Percona Toolkit or some scripts some Support engineers prefer to use. If it breaks GTID consistency among Galera nodes when somebody tried to "care" to not write anything into the binary log, then it may be a surprise to a poor customer later... That's what I have from the top of my head, had not checked other related bugs or all customer reasoning in the recent issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is an example. Say I have a large table with limited data retention of, say, 3 months. I want to handle data retention management using monthly partitions, so that I can just create a new partition each month and drop the expired partition instead of running expensive deletes that bog down the whole cluster. That means I have to run the analytics replica off the galera writer node to make the upstream pruning not propagate. That in itself is an undesirable enough, but if the analytics server is down for maintenance and the Galera writer fails, I cannot just hang the analytics server off another node because there is a risk that the full prone to 3 months retention will propagate to the analytics replica and thus discard the 9 additional months of the data that I need to keep. The same example works exactly the same if I were to use DELETEs for pruning that I don't want to propagate downstream, I still don't want the other 9 months of data to be deleted downstream when I specifically set sql_log_bin=0 or the session that does the deleting. Inconsistent behaviour across Galera nodes results in a dangerous edge case. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is how I would drop and add partition on Galera cluster. Please remember that in Galera cluster database state needs to be exactly the same in normal operation. You can temporally break this consistency if you do it carefully and know what you are doing.
In my example both drop partition and reorganize does not replicate to other nodes. But to make Galera cluster databases consistent you need to do these on all nodes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gordan Bobic [ 2023-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Except that isn't what is desired. We want all Galera nodes to remain the same as each other and consistent, and we want all async replicated nodes to remain the same and honour the sql_log_bin=0 regardless of which Galera node they are replicating from. It is very difficult to see a situation where the state of the downstream async node will depend purely on which node in Galera cluster it is replicating from as anything other than a bug. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Claudio Nanni [ 2023-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Except that isn't what is desired. We want all Galera nodes to remain the same as each other and consistent" |