[MDEV-31437] (Galera + SQL_LOG_BIN=0) = GTID and state corruption Created: 2023-06-08  Updated: 2023-08-17  Resolved: 2023-07-27

Status: Closed
Project: MariaDB Server
Component/s: Galera, Replication
Affects Version/s: 10.5.21
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Gordan Bobic Assignee: Jan Lindström
Resolution: Not a Bug Votes: 3
Labels: None

Issue Links:
Relates
relates to MDEV-9037 DML statements on a Galera Cluster no... Closed
relates to MDEV-20087 Galera + SET SQL_LOG_BIN=0 on binlog ... Closed

 Description   

The behaviour of SQL_LOG_BIN is completely broken on Galera.

Minimal reproducer:
2 galera nodes, minimal config, one reader (1), one writer (2)
1 async replica (3) replicating from (2).
gtid_strict_mode=ON on all 3

On the writer (1):
set sql_log_bin=0;
create database test2;

Result:
test2 database will be created on all 3 nodes
GTID will be incremented to 0-1-n+1 on nodes 2 (galera) and 3 (async), but NOT on node 1 that wrote the transaction!

So sql_log_bin=0 renders the master writer node to corrupt it's own gtid position, by making all other nodes in the topology think themselves to be ahead of it even though it is the master writer node that actually generated the transaction.

Transaction ends up written into binlogs on the secondary galera node, but is not written locally on the master node that had sql_log_bin=0 set.

Worse, if you then start a new session on the writer without disabling sql_log_bin=0, and run another transaction, that transaction will then get the n+1 gtid with mater's server-id, but it will have gtid on all other nodes of n+2 gtid.

So in this case global-transaction-id is no longer a globally unique identifier of a transaction id. Because the same gtid on the master node and the secondary nodes now refer to completely different transactions.

This is critically dangerously broken behaviour that renders async replication from galera unsafe.

Expected behaviour:

  • Transaction should replicate to other galera nodes and have the same gtid on all of them, but it should be OMITTED from the binlogs on ALL galera nodes.


 Comments   
Comment by Gordan Bobic [ 2023-06-09 ]

I think what is needed here is GTID tagging of transactions in galera replication channels. A fix could be something like:

  • Pass null GTID for sql_log_bin=0 transactions down Galera
  • If GTID=NULL via Galera, then don't bump GTID on target and don't write transaction to binlog
Comment by Claudio Nanni [ 2023-06-16 ]

> Result:
> test2 database will be created on all 3 nodes
> GTID will be incremented to 0-1-n+1 on nodes 2 (galera) and 3 (async), but NOT on node 1 that wrote the transaction!

Since you disabled binary logging it is expected not to have a GTID generated locally.
While, since you did not disable Galera replication the transaction was sent to the other nodes and since they have completely independent binary log settings they generated their GTID.

Actually running a DDL with binary log disabled is usually taking a voluntary risk, you do that on replicated environment for local node operations, from what I understand you'd like the Galera cluster behave like a unity of local operations in respect to any other async Replica attached to the cluster, If I understand this:

set sql_log_bin=0;
create database test2;

  sql_log_bin	wsrep_on	local[gtid,seqno]	remote[gtid,seqno]
A 0		0	-,-		-,-
B 0		1	-,+1		+1,+1
C 1		0	+1[local gtid],-	-,-
D 1		1	+1[gal gtid],+1	+1[gal gtid],+1

I think that cases B and C put the cluster at risk, and what's the semantyc of those?
B: I want to replicate this transaction but not write it in any binary log [I can understand this use case]
C: I want the local transaction to be binary logged but not sent to other Galera nodes [I can't find a useful case yet]

I think A&D are the most robust.

Please note in case C the GTID generated locally will have the local node domain, not galeras, when wsrep_on=on galera will override the domain-id.

Your case is B, you want the transaction to replicate but you don't want to log it on the local binary log, moreover you assume that remote Galera nodes will use the session setting of your local galera node so not binary logging it either.
Given that Galera and Async replica are two different replication subsystems, given that Galera works also without binary logging enabled, given that it assumes that a local session variables is used remotely, imho the current behaviour is the one I'd expect before any feature request for it (like this seems to be).

> So sql_log_bin=0 renders the master writer node to corrupt it's own gtid position, by making all other nodes in the topology think themselves to be ahead of it even though it is the master writer node that actually generated the transaction.

sql_log_bin=0 doesn't corrupt any gtid, with that you asked him explicitly not to generate a local binary log transaction with the correspondent GTID.

> Transaction ends up written into binlogs on the secondary galera node, but is not written locally on the master node that had sql_log_bin=0 set.

As expected.

> Worse, if you then start a new session on the writer without disabling sql_log_bin=0, and run another transaction, that transaction will then get the n+1 gtid with mater's server-id, but it will have gtid on all other nodes of n+2 gtid.

SET sql_log_bin=0 is the implict form for SET SESSION sql_log_bin=0, and it is so expected to be valid only for the session where it was set, any other session must ignore other sessions status.

> So in this case global-transaction-id is no longer a globally unique identifier of a transaction id. Because the same gtid on the master node and the secondary nodes now refer to completely different transactions.

There are many ways to you can de-align GTIDs on purpose and this is one of those cases.
Note Galera replication will keep working as expected.

> Expected behaviour:
> Transaction should replicate to other galera nodes and have the same gtid on all of them, but it should be OMITTED from the binlogs on ALL galera nodes.

As said above, in my opinion this is more a feature request, you ask that every time you disable binary logging in the local session on a node that would be disabled on all node for the transactions replicated.
It makes sense but maybe there are other reasons not to so better think thoroughly before!
I'd love to hear others' opinions.

Comment by Gordan Bobic [ 2023-06-16 ]

Whether we call this a bug or a feature, it is in all cases undesirable to have nodes generate different GTIDs for the same transaction, and have different transactions under the same GTID.
Otherwise they are TIDs rather than GTIDs. GTIDs are supposed to be unique in a replication tree that Galera is a part of.

I struggle to construct an argument from the user perspective where this could be called a "feature" rather than a "bug". As a "feature", it would have to provide some functionality that would under conceivable circumstances be useful, and this doesn't.

Comment by Justin Bennett [ 2023-06-27 ]

Agree with Gordans most recent comment - that under the current architecture there is no "Global" in GTID. Because of this, an asynchronous replica is tied to a cluster node rather than being able to move between cluster nodes.

Comment by Rick Pizzi [ 2023-06-27 ]

It's a known fact that SQL_LOG_BIN=0 is ignored by WSREP replication. It has always been that way.

Justin Bennett it is possible to repoint an async replica of a galera node by using the transaction Xid from binlog of node A and search for it in binlog of node B.
Here's a small script that does just that, that I wrote few years ago: https://github.com/RickPizzi/pztools/blob/master/magic_change_master.sh

Comment by Gordan Bobic [ 2023-06-27 ]

@Rick, just because it is broken by design since forever doesn't make it not broken, it just makes it an old bug that has been breaking systems for a long time.

Comment by Jan Lindström [ 2023-07-26 ]

gbobic Did you use wsrep_gtid_mode and wsrep_gtid_domain_id configuration variables? Please refer documentation in https://mariadb.com/kb/en/using-mariadb-gtids-with-mariadb-galera-cluster/. It will not work if nodes are not configured correctly.

Because you have defined sql_bin_log=0 see documentation on https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#sql_log_bin i.e. operation is not written on binlog. For this reason also gtid_binlog_pos shows old position (it is based on latest gtid position on binlog) but all events from this node have increasing GTID as seen from node_2.

connection node_2;
show variables like 'gtid_binlog_pos';
Variable_name	Value
gtid_binlog_pos	5-1-5
SHOW BINLOG EVENTS IN 'mysqld-bin.000003';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
mysqld-bin.000003	4	Format_desc	2	256	Server ver: 10.5.22-MariaDB-debug-log, Binlog ver: 4
mysqld-bin.000003	256	Gtid_list	2	285	[]
mysqld-bin.000003	285	Binlog_checkpoint	2	329	mysqld-bin.000003
mysqld-bin.000003	329	Gtid	1	371	GTID 5-1-1
mysqld-bin.000003	371	Query	1	495	use `test`; CREATE TABLE t2(a int not null primary key) engine=innodb
mysqld-bin.000003	495	Gtid	1	537	BEGIN GTID 5-1-2
mysqld-bin.000003	537	Annotate_rows	1	585	INSERT INTO t2 values (1)
mysqld-bin.000003	585	Table_map	1	630	table_id: 38 (test.t2)
mysqld-bin.000003	630	Write_rows_v1	1	668	table_id: 38 flags: STMT_END_F
mysqld-bin.000003	668	Xid	1	699	COMMIT /* xid=4 */
mysqld-bin.000003	699	Gtid	1	741	GTID 5-1-3
mysqld-bin.000003	741	Query	1	830	create database test2
mysqld-bin.000003	830	Gtid	1	872	GTID 5-1-4
mysqld-bin.000003	872	Query	1	997	use `test2`; create table t1(a int not null primary key) engine=innodb
mysqld-bin.000003	997	Gtid	1	1039	BEGIN GTID 5-1-5
mysqld-bin.000003	1039	Annotate_rows	1	1119	insert into t1 values (1),(2),(3),(4),(5),(6),(7),(8),(9)
mysqld-bin.000003	1119	Table_map	1	1165	table_id: 42 (test2.t1)
mysqld-bin.000003	1165	Write_rows_v1	1	1243	table_id: 42 flags: STMT_END_F
mysqld-bin.000003	1243	Xid	1	1274	COMMIT /* xid=7 */

Note that sql_bin_log setting is local only so it will not have any effect any other nodes in cluster or in async slave.

Comment by Gordan Bobic [ 2023-07-26 ]

> Note that sql_bin_log setting is local only so it will not have any effect any other nodes in cluster or in async slave.

And this is the core of the matter. This means that GTID async replication out of Galera cluster is dangerous because migrating a slave to a different Galera node will result in things that shouldn't replicate replicating.
At the very least sql_log_bin should be hard-enabled and not user-adjustable when Galera is active, to prevent errant GTIDs.
Ideally Galera should be augmented to be aware that transactions executed while sql_log_bin is set to 0 on the master should NOT be binlogged on any other node in the cluster either.

Comment by Jan Lindström [ 2023-07-27 ]

gbobic I'm redundant to change current behavior because I do not know real use cases of sql_bin_log=0 on a cluster. If you really need to create DDL on one node only in a cluster I suggest using wsrep_osu_method="RSU" instead.

Comment by Jan Lindström [ 2023-07-27 ]

Feature works as designed currently. We could restrict setting of sql_bin_log=0 on a cluster but it should be done in an user controlled way to avoid breaking user software and tools. One way to do so in 10.6+ is to introduce a new wsrep_mode = WSREP_ENFORCE_SQL_BIN_LOG where we would issue a error if user tries to set sql_bin_log=0. However, I see this currently a special case and current documentation is quite clear. Therefore, I will not open a new TODO to introduce new option and will close this MDEV as a not a bug.

Comment by Gordan Bobic [ 2023-07-27 ]

So sql_log_bin=0 resulting in binlogs being different across cluster nodes is "by design" rather than an anomalous side effect?
Just because a bug is documented doesn't mean it is not a bug.

One obvious use case is managing data retention and keeping more data downstream on an archiving replica. For example, you might want to keep 3 monthly partitions in the cluster but keep 12 monthly partitions on the replica. But you can't do that because if you omit the partition management statements from the the binlog, they will still be in the binlog on other nodes. If your primary node then fails and you move the replica to another Galera node, it will re-read the binlogs and execute the partition dropping that it shouldn't do.

I can think of many use cases where the current, inconsistent behaviour will break things. I can think of 0 cases where the current inconsistent behaviour is a useful side effect. Can you think of a genuine use case where the current behaviour is actually desirable?

Comment by Rick Pizzi [ 2023-07-27 ]

gbobic if the node where the replica is replicating from fails and you move it to another node it will NOT re-read binlogs, as it will start from where it was at the time of the crash; so your use case is covered.

Comment by Gordan Bobic [ 2023-07-27 ]

Unless that replica was offline for some reason. Or configured with lagged replication with limited max_relay_log_size on a busy cluster.
There are obviously use cases where the binlog anomaly we are discussing here will cause problems.

The reason I even discovered this anomalous behaviour is because I observed anomalies with GTID sequence desync when moving async replicas between Galera nodes.

Comment by Rick Pizzi [ 2023-07-27 ]

SET commands are not replicated across cluster nodes, this is by design.
A possible trick to avoid dropping the partitions on the replica downstairs would be to use SET SESSION server_id = <id of replica> before issuing the DDL.

Comment by Gordan Bobic [ 2023-07-27 ]

That won't work with multiple replicas.

NDB cluster, for example, does this correctly, last time I checked it ensures binlogs are identical on all SQL nodes.

Comment by Jan Lindström [ 2023-08-01 ]

valerii Can you firstly explain me why user would use setting sql_log_bin=0 in a cluster, i.e. what is the use case (here remember that most of DDL can be executed using RSU)? I could disable using it on a cluster (easy) but as I do not know how widely this is used I'm more careful this time. I do not yet know a way to replicate that local setting. How this feature is used on master-slave replication? I know that on master-slave replication DDL is not replicated but if I'm correct actual GTID-event is.

mysql> set sql_log_bin=0;
--------------
set sql_log_bin=0
--------------
 
Query OK, 0 rows affected (0,00 sec)
 
mysql> show variables like 'gtid_binlog_pos';
--------------
show variables like 'gtid_binlog_pos'
--------------
 
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| gtid_binlog_pos |       |
+-----------------+-------+
1 row in set (0,01 sec)
 
mysql> create database test2;
--------------
create database test2
--------------
 
Query OK, 1 row affected (0,02 sec)
 
mysql> show variables like 'gtid_binlog_pos';
--------------
show variables like 'gtid_binlog_pos'
--------------
 
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| gtid_binlog_pos |       |
+-----------------+-------+
1 row in set (0,00 sec)
 
mysql> set sql_log_bin=1;
--------------
set sql_log_bin=1
--------------
 
Query OK, 0 rows affected (0,00 sec)
 
mysql> create table test2.t2(a int not null primary key) engine=innodb;
--------------
create table test2.t2(a int not null primary key) engine=innodb
--------------
 
Query OK, 0 rows affected (0,03 sec)
 
mysql> show variables like 'gtid_binlog_pos';
--------------
show variables like 'gtid_binlog_pos'
--------------
 
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| gtid_binlog_pos | 5-1-2 |
+-----------------+-------+
1 row in set (0,00 sec)

As we can see while sql_bin_log=0 we do not see gtid increment but as soon as setting is restored gtid is incremented and there is no holes i.e. both gtid events 5-1-1 (create database) and 5-1-2 (create table) are used . Furthermore, as we can see from second galera node actual gtids and events are replicated and gtids are exactly as it should be i.e. from master node and increasing (no holes etc) and exactly the same as in master.

#230801 10:52:03 server id 2  end_log_pos 285 CRC32 0xe662a374 	Gtid list []
# at 285
#230801 10:52:03 server id 2  end_log_pos 329 CRC32 0x60015052 	Binlog checkpoint mysqld-bin.000003
# at 329
#230801 10:53:08 server id 1  end_log_pos 371 CRC32 0xad7f552b 	GTID 5-1-1 ddl
/*!100101 SET @@session.skip_parallel_replication=0*//*!*/;
/*!100001 SET @@session.gtid_domain_id=5*//*!*/;
/*!100001 SET @@session.server_id=1*//*!*/;
/*!100001 SET @@session.gtid_seq_no=1*//*!*/;
# at 371
#230801 10:53:08 server id 1  end_log_pos 460 CRC32 0xeaeff605 	Query	thread_id=11	exec_time=0error_code=0
SET TIMESTAMP=1690876388/*!*/;
SET @@session.pseudo_thread_id=11/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1, @@session.check_constraint_checks=1, @@session.sql_if_exists=0, @@session.explicit_defaults_for_timestamp=0/*!*/;
SET @@session.sql_mode=1411383296/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C latin1 *//*!*/;
SET @@session.character_set_client=latin1,@@session.collation_connection=8,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
create database test2
/*!*/;
# at 460
#230801 10:54:09 server id 1  end_log_pos 502 CRC32 0x67f448d5 	GTID 5-1-2 ddl
/*!100001 SET @@session.gtid_seq_no=2*//*!*/;
# at 502
#230801 10:54:09 server id 1  end_log_pos 628 CRC32 0x8e7e5bc0 	Query	thread_id=11	exec_time=0error_code=0
SET TIMESTAMP=1690876449/*!*/;
create table test2.t2(a int not null primary key) engine=innodb
/*!*/;
DELIMITER ;
# End of log file

Comment by Valerii Kravchuk [ 2023-08-01 ]

As far as I understand, the idea is to run some statement on a node that is NOT replicated to async slave, this is what sql_log_bin =0 is typically used for. It can be running mysql_upgrade, fixing some data inconsistency, some FLUSH command, and yes, maybe some DDL that is not filtered out by current replication filters.

Bottom line is that "Galera cluster" as a whole should act as a master for async replication the same way, as much as possible, as a simple non-Galera instance.

Can you reproduce the problem with CREATE DATABASE as suggested here? I can imagine creating some local, "temporary" database for tools like Percona Toolkit or some scripts some Support engineers prefer to use. If it breaks GTID consistency among Galera nodes when somebody tried to "care" to not write anything into the binary log, then it may be a surprise to a poor customer later...

That's what I have from the top of my head, had not checked other related bugs or all customer reasoning in the recent issue.

Comment by Gordan Bobic [ 2023-08-01 ]

Here is an example. Say I have a large table with limited data retention of, say, 3 months. I want to handle data retention management using monthly partitions, so that I can just create a new partition each month and drop the expired partition instead of running expensive deletes that bog down the whole cluster.
But I have a need to keep this data on the downstream analytics server for 12 months.
The way I would normally do this is by having an event that does the partition discard part after 3 months with sql_log_bin=0, and have a separate partition discard event running independently on the analytics replica with discard after 12 months.

That means I have to run the analytics replica off the galera writer node to make the upstream pruning not propagate. That in itself is an undesirable enough, but if the analytics server is down for maintenance and the Galera writer fails, I cannot just hang the analytics server off another node because there is a risk that the full prone to 3 months retention will propagate to the analytics replica and thus discard the 9 additional months of the data that I need to keep.

The same example works exactly the same if I were to use DELETEs for pruning that I don't want to propagate downstream, I still don't want the other 9 months of data to be deleted downstream when I specifically set sql_log_bin=0 or the session that does the deleting.

Inconsistent behaviour across Galera nodes results in a dangerous edge case.

Comment by Jan Lindström [ 2023-08-01 ]

Here is how I would drop and add partition on Galera cluster. Please remember that in Galera cluster database state needs to be exactly the same in normal operation. You can temporally break this consistency if you do it carefully and know what you are doing.

CREATE TABLE data_table (tr_date DATE)
    ENGINE=INNODB
    PARTITION BY RANGE( TO_DAYS(tr_date) ) (
    PARTITION p20220401 VALUES LESS THAN (TO_DAYS('2022-02-01')),
    PARTITION p20220402 VALUES LESS THAN (TO_DAYS('2022-03-01')),
    PARTITION p20220403 VALUES LESS THAN (TO_DAYS('2022-04-01')),
    PARTITION p20220404 VALUES LESS THAN (TO_DAYS('2022-05-01')),
    PARTITION p20220405 VALUES LESS THAN (TO_DAYS('2022-06-01')),
    PARTITION p20220406 VALUES LESS THAN (TO_DAYS('2022-07-01')),
    PARTITION p20220407 VALUES LESS THAN (TO_DAYS('2022-08-01')),
    PARTITION p20220408 VALUES LESS THAN (TO_DAYS('2022-09-01')),
    PARTITION p20220409 VALUES LESS THAN (TO_DAYS('2022-10-01')),
    PARTITION p20220410 VALUES LESS THAN (TO_DAYS('2022-11-01')),
    PARTITION p20220411 VALUES LESS THAN (TO_DAYS('2022-12-01')),
    PARTITION p20220412 VALUES LESS THAN (TO_DAYS('2023-01-01')),
    PARTITION future VALUES LESS THAN MAXVALUE);
 
INSERT INTO data_table VALUES ('2022-01-01'),('2022-02-01'),
('2022-03-01'),('2022-04-01'),('2022-05-01'),('2022-06-01'),
('2022-07-01'),('2022-08-01'),
('2022-09-01'),('2022-10-01'),('2022-11-01'),('2022-12-01');
 
SET GLOBAL WSREP_OSU_METHOD='RSU';
ALTER TABLE data_table DROP PARTITION p20220412;
SHOW CREATE TABLE data_table;
ALTER TABLE data_table
REORGANIZE PARTITION future INTO (
        PARTITION p20220512 VALUES LESS THAN (TO_DAYS('2023-01-01')),
        PARTITION future  VALUES LESS THAN MAXVALUE);  
SHOW CREATE TABLE data_table;

In my example both drop partition and reorganize does not replicate to other nodes. But to make Galera cluster databases consistent you need to do these on all nodes.

Comment by Gordan Bobic [ 2023-08-01 ]

Except that isn't what is desired. We want all Galera nodes to remain the same as each other and consistent, and we want all async replicated nodes to remain the same and honour the sql_log_bin=0 regardless of which Galera node they are replicating from.

It is very difficult to see a situation where the state of the downstream async node will depend purely on which node in Galera cluster it is replicating from as anything other than a bug.

Comment by Claudio Nanni [ 2023-08-01 ]

"Except that isn't what is desired. We want all Galera nodes to remain the same as each other and consistent"
RSU is part of Galera cluster normal operations and it's actually usually recommended whenever possible(the schema difference is compatible for the rolling upgrade time).
Jan solution(RSU) works for your use case.
You want to do something only on the Galera cluster and as long as it's working for the Galera cluster it shouldn't matter for your async replica which will not receive the change.

Generated at Thu Feb 08 10:23:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.