[MDEV-16877] system_versioning_alter_history with galera corruption Created: 2018-08-01 Updated: 2023-06-06 Resolved: 2023-06-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Definition - Alter Table, Galera, Versioned Tables |
| Affects Version/s: | 10.3.8 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Vincent Milum Jr | Assignee: | Nikita Malyavin |
| Resolution: | Won't Fix | Votes: | 2 |
| Labels: | None | ||
| Environment: |
10.3.8-MariaDB-1:10.3.8+maria~stretch-log mariadb.org binary distribution |
||
| Issue Links: |
|
||||||||
| Description |
|
ALTER TABLE is not propagating over Galera cluster for tables with SYSTEM VERSIONING enabled. Even if the other nodes in the cluster have system_versioning_alter_history set to "KEEP" globally, they're not receiving the table schema alterations. Attempting to modify data in one of these tables after the schema mismatch causes a whole world of problems, worse than any other time I've seen inconsistent schemas in the past. Every node in the Galera cluster shuts down (this is to be expected). However, afterwards, it was near impossible to bring the cluster back online. All but one node in the cluster lost their "uuid" info. Their grastate.dat contained the following contents:
Attempting to modify the safe_to_bootstrap value to "1" and running galera_new_cluster or service mysql start would fail. Attempting to manually run mysqld_safe --wsrep_new_cluster would show the following on the console, followed by 100% CPU usage on all cores, while never actually getting anywhere, even after letting it sit several minutes (no disk activity at all btw)
Luckily, there was one single node in the cluster that still had proper "uuid" information and I was able to bootstrap the cluster from there, then one by one have every other node run a full SST to get their contents back (but this is extremely time consuming) |
| Comments |
| Comment by Elena Stepanova [ 2018-08-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please provide the table structure, example of the ALTER statement which is not replicated, error logs from the node which performs the ALTER and the one that should receive the change bug doesn't, and configuration files for both. I can't reproduce it on a basic example:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2018-08-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The bug happens if system_versioning_alter_history is changed while MariaDB is running. The setting is unset in the cnf file, therefor defaulting to "ERROR" Installations are fresh Debian 9 VMs with official MariaDB binaries running with stock configuration with the following file added. Config on both nodes:
On each node, the following is executed:
Node 1:
Node 2:
Node 2 Log:
Because the Galera session is already started, I believe it is inheriting, and thus storing a copy of system_versioning_alter_history='error', so when the global variable is changed while the instance is running, the command execution will still fail for child nodes. Only having the system_versioning_alter_history='keep' in the cnf file followed by a cluster (rolling) restart seems to work properly.
If it is indeed an issue with the Galera session's instance of this particular variable, it might be wise to just force it to 'keep' always for that particular session. if it is set to 'error' globally or for a user session, the alter table statement will fail before even being propagated to other galera nodes, so there shoulnd't be any conflicts going on. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2018-09-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is expectedly true. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Christian Ciach [ 2021-08-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The worst thing about this issue is that an unprivileged user can destroy the whole cluster just by setting an otherwise harmless session variable. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-08-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That's a very interesting thought I didn't even have. This should probably be escalated as a security issue then, since a minimally privileged user could essentially "crash" an entire MariaDB cluster taking the entire thing offline. And again, this EXACT same issue existed for non-Galera, traditional replication. That EXACT same fix can be applied here. There is little reason why this should literally be taking years to address!? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-06-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
10.3 is EOL soon. |