[MDEV-14153] Implicitly dropped temporary tables cause GTID drift in Galera cluster Created: 2017-10-26  Updated: 2021-12-14  Resolved: 2019-12-12

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Temporary, Galera
Affects Version/s: 10.2.9, 10.2.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Neil Skrypuch Assignee: Jan Lindström (Inactive)
Resolution: Duplicate Votes: 5
Labels: None
Environment:

CentOS 7.4


Issue Links:
Relates
relates to MDEV-20720 Galera: Replicate MariaDB GTID to oth... Closed

 Description   

I have a Galera cluster running with the usual required configuration as well as the following:

wsrep_gtid_mode = 1
log_slave_updates = 1

If I bootstrap the cluster with (rsync) SST, I can see that the initial GTID state is copied over from the donor, which works as expected. In general, the GTID state will stay synced across all masters as they execute queries, but it is possible to cause drift in normal use cases. For example, if we execute the following:

echo 'CREATE TEMPORARY TABLE `t5` (`c1` INT, PRIMARY KEY(`c1`)) ENGINE=InnoDB;' | mysql -u root test
echo "SHOW VARIABLES LIKE 'gtid_binlog_pos';" | mysql -u root

On the master that executes the statement, you will see the GTID increment once for every execution, this is because a "DROP TEMPORARY TABLE IF EXISTS `t5`" will end up in the binary log when the connection is closed. However, that binary log entry will only exist on that master, no other masters in the Galera cluster will see it, and so the GTID values will drift by one every time this happens.

One very interesting observation is that if we explicitly drop the temporary table, then nothing ends up in the binlog. Indeed, you can run this all day and the GTID value won't budge:

echo 'CREATE TEMPORARY TABLE `t5` (`c1` INT, PRIMARY KEY(`c1`)) ENGINE=InnoDB; DROP TEMPORARY TABLE `t5`;' | mysql -u root test
echo "SHOW VARIABLES LIKE 'gtid_binlog_pos';" | mysql -u root

I'm not sure if it's correct behaviour for the aforementioned query to magically appear in the binlog on an implicit table drop, but it's definitely not correct behaviour for the query to only exist in the original master's binlog and not have it replicated to any other Galera masters (as this causes GTID drift).

This GTID drift doesn't break the Galera cluster on it's own, but it makes it impractical to have a traditional replication slave replicating from the Galera cluster, as the slave's replication will break when trying to switch to another master.



 Comments   
Comment by Neil Skrypuch [ 2018-01-30 ]

I was hoping to see this fixed now that #10715 is resolved with 10.2.12, but after testing I can confirm that this is still an issue in 10.2.12 as originally described.

Comment by Arjen Lentz [ 2018-07-06 ]

I can confirm this behaviour occurring in 10.2.12, and have a binlog examples which I'll post separately in a private comment.

The GTID is the same across all the replicants, but drifts relative to the committing node.
To me, this indicates a fundamental error: it appears the GTID is getting independently incremented somewhere in the wsrep space, rather than passed as-is from the committing node. That's very dangerous, and in this case obviously fatal. There may be other cases where it just won't fly either. It's the wrong way of doing this.

The objective of GTID is to be global, and consistent. In this case it's neither, and faulty.
If we were to run an async slave off the cluster, it would break miserably.

Comment by Arjen Lentz [ 2018-07-06 ]

neiltembo
> I'm not sure if it's correct behaviour for the aforementioned query to magically appear
> in the binlog on an implicit table drop, but it's definitely not correct behaviour for the
> query to only exist in the original master's binlog and not have it replicated to any
> other Galera masters (as this causes GTID drift).

If the GTID were passed through the cluster, it can't drift.
Any individual transaction would just get whatever GTID it had on the committing node.
Any DDL statements that aren't sent through the cluster would not be able to affect this.

I conclude that GTID is not actually passed through the cluster, but updated "magically" somewhere in wsrep space, and that's wrong and very dangerous.
Anyhow, I've provided binlog snippets that illustrate this in action.
I trust the devs will look at it with urgency.

Comment by Jan Lindström (Inactive) [ 2019-12-12 ]

WIll be fixed on MDEV-20720

Generated at Thu Feb 08 08:11:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.