Status: Closed (View Workflow)
Affects Version/s: 10.5.4, 10.5.2, 10.5.3, 10.5.5, 10.5.6, 10.5.7, 10.5.8, 10.5.9, 10.5.10, 10.6.0, 10.6.1
Environment:Operating System: Amazon Linux 2 AMI
Hardware: Amazon AWS t3a.medium
3-Node Galera Cluster
We have several production databases that we migrated early 2021 from single node setup to a 3-node Galera Cluster for high availability. During the migration project we implemented a tool for managing the cluster and one of the features of that tool is "cycle-db-cluster" functionality that replaces the oldest node of the cluster with a totally new one. Since the project we have kept the cluster up to date by replacing nodes periodically using this automated process.
At 2021-05-11 the replace process failed as a new 10.5.10 node could not join our cluster that consisted of 10.5.9 nodes. In mysqld.log the new node reported [ERROR] InnoDB: Not applying INSERT_REUSE_REDUNDANT due to corruption on [page id: space=0, page number=14408]. We tried to repeat the process several times also using MariaDB version 10.5.9 and a new virtual server each time but the issue persisted. Our mysqldumps worked just fine so we scheduled a service break for 2021-05-12 and started a new cluster using MariaDB 10.5.10 and data from mysqldumps. An important detail is that we imported mysqldumps to a 1-node cluster and added 2 nodes after imports had finished – so joining definitely worked shortly after import. We also replaced the first node of the cluster by using the "cycle-db-cluster" functionality ~1 hour after the service break.
At 2021-05-20 we tried to replace a node in the cluster for the first time since the day we set it up. Again joining a new node to the cluster failed and mysqld.log in the new joining node contains [ERROR] InnoDB: Not applying INSERT_REUSE_REDUNDANT due to corruption on [page id: space=0, page number=2366]. So now we have almost new 3-node 10.5.10 cluster where we can't join new nodes. The cluster itself continues to perform just fine and mysqldumps succeed. For me it seems that the system detects a corruption for some reason while the data inside databases is actually totally OK.
Contains log of the automated cycle-db-cluster process. It contains details how a new db server was set up from scratch all the way to attempting to start the mysqld process.
Database configuration files that we use for our nodes. These have a few variables that are replaced automatically during node setup process.
mysqld.log from the node join failure at 2021-05-20.