Issue Summary: 3 problems encountered when node failure during Galera fragmented transaction running Problem #1: SST triggered to recover failed node but IST is expected Problem #2: in some test, failed node encounters crash with signal 11 repeatedly until node 1 commit Problem #3: local node state of donor node changed to "Donor/Desynced" unexpectedly after failed recovered #################################################################################################### # Variables - set IP of your env export NODE1_IP=192.168.159.129 export NODE2_IP=192.168.159.130 export NODE3_IP=192.168.159.131 #################################################################################################### # Env setup # Note: running on refresh install of 3 VM with Redhat/CentOS 7 #### node 1 rm -rf /etc/my.cnf.d/galera.cnf touch /var/log/mysqld.log ; chown mysql:mysql /var/log/mysqld.log echo "[mysqld]" >> /etc/my.cnf.d/galera.cnf echo "log_error=/var/log/mysqld.log" >> /etc/my.cnf.d/galera.cnf echo "default_storage_engine=InnoDB" >> /etc/my.cnf.d/galera.cnf echo "binlog_format=row" >> /etc/my.cnf.d/galera.cnf echo "innodb_autoinc_lock_mode=2" >> /etc/my.cnf.d/galera.cnf echo "" >> /etc/my.cnf.d/galera.cnf echo "# Galera cluster configuration" >> /etc/my.cnf.d/galera.cnf echo "wsrep_on=ON" >> /etc/my.cnf.d/galera.cnf echo "wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so" >> /etc/my.cnf.d/galera.cnf echo "wsrep_cluster_address=\"gcomm://${NODE1_IP},${NODE2_IP},${NODE3_IP}\" " >> /etc/my.cnf.d/galera.cnf echo "wsrep_cluster_name=mariadb-galera-cluster" >> /etc/my.cnf.d/galera.cnf echo "wsrep_sst_method=mariabackup" >> /etc/my.cnf.d/galera.cnf echo "wsrep_sst_auth=\"mariabackup:mypassword123$\" " >> /etc/my.cnf.d/galera.cnf echo "" >> /etc/my.cnf.d/galera.cnf echo "# Cluster node configuration" >> /etc/my.cnf.d/galera.cnf echo "wsrep_node_address=${NODE1_IP}" >> /etc/my.cnf.d/galera.cnf echo "wsrep_node_name=galera-db-01" >> /etc/my.cnf.d/galera.cnf systemctl stop firewalld systemctl disable firewalld systemctl stop mariadb systemctl set-environment _WSREP_NEW_CLUSTER='--wsrep-new-cluster' systemctl start mariadb systemctl unset-environment _WSREP_NEW_CLUSTER mysql CREATE USER 'mariabackup'@'localhost' IDENTIFIED BY 'mypassword123$'; GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'mariabackup'@'localhost'; #### node 2 rm -rf /etc/my.cnf.d/galera.cnf touch /var/log/mysqld.log ; chown mysql:mysql /var/log/mysqld.log echo "[mysqld]" >> /etc/my.cnf.d/galera.cnf echo "log_error=/var/log/mysqld.log" >> /etc/my.cnf.d/galera.cnf echo "default_storage_engine=InnoDB" >> /etc/my.cnf.d/galera.cnf echo "binlog_format=row" >> /etc/my.cnf.d/galera.cnf echo "innodb_autoinc_lock_mode=2" >> /etc/my.cnf.d/galera.cnf echo "" >> /etc/my.cnf.d/galera.cnf echo "# Galera cluster configuration" >> /etc/my.cnf.d/galera.cnf echo "wsrep_on=ON" >> /etc/my.cnf.d/galera.cnf echo "wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so" >> /etc/my.cnf.d/galera.cnf echo "wsrep_cluster_address=\"gcomm://${NODE1_IP},${NODE2_IP},${NODE3_IP}\" " >> /etc/my.cnf.d/galera.cnf echo "wsrep_cluster_name=mariadb-galera-cluster" >> /etc/my.cnf.d/galera.cnf echo "wsrep_sst_method=mariabackup" >> /etc/my.cnf.d/galera.cnf echo "wsrep_sst_auth=\"mariabackup:mypassword123$\" " >> /etc/my.cnf.d/galera.cnf echo "" >> /etc/my.cnf.d/galera.cnf echo "# Cluster node configuration" >> /etc/my.cnf.d/galera.cnf echo "wsrep_node_address=${NODE2_IP}" >> /etc/my.cnf.d/galera.cnf echo "wsrep_node_name=galera-db-02" >> /etc/my.cnf.d/galera.cnf systemctl stop firewalld systemctl disable firewalld systemctl start mariadb #### node 3 - arbitrator systemctl stop firewalld systemctl disable firewalld rm -rf /etc/garbd.cnf echo "group=\"mariadb-galera-cluster\" " >> /etc/garbd.cnf echo "address=\"gcomm://${NODE1_IP},${NODE2_IP},${NODE3_IP}\" " >> /etc/garbd.cnf echo "log=\"/var/log/garbd.log\" " >> /etc/garbd.cnf garbd --cfg /etc/garbd.cnf #################################################################################################### # Test Case ########## at node 1 - enable transaction fragment and run a fragmented transaction Note: set autocommit=OFF to easier simulate the case mysql create database testdb1 ; create table testdb1.t1 (c1 int , primary key (c1) ) ; set SESSION autocommit=OFF ; SET SESSION wsrep_trx_fragment_unit='bytes' ; SET SESSION wsrep_trx_fragment_size=10485760 ; DELIMITER | BEGIN NOT ATOMIC DECLARE i INTEGER; SET i = 1; WHILE i <= 1000000 DO insert into testdb1.t1 values (i) ; SET i = i + 1; END WHILE; end| DELIMITER ; select count(*) from testdb1.t1 ; show status like 'wsrep%state%' ; # do NOT commit ########## at node 2 - kill DB instance process and let DB instance restart and recover kill -9 $(ps -ef | grep mariadbd | grep -v grep | awk '{ print $2 }') Problem #1: SST triggered to recover failed node but IST is expected Problem #2: in some test, failed node encounters crash with signal 11 repeatedly until node 1 commit MariaDB [(none)]> show status like 'wsrep_local_state_comment' ; +---------------------------+--------+ | Variable_name | Value | +---------------------------+--------+ | wsrep_local_state_comment | Synced | +---------------------------+--------+ 1 row in set (0.000 sec) ########## at node 1 - local node state changed to "Donor/Desynced" unexpectedly ########## expect local state is "Synced" Note: changes in node 1 is still sent to node 2 and vice verse Problem #3: local node state of donor node changed to "Donor/Desynced" unexpectedly after failed recovered mysql MariaDB [(none)]> show status like 'wsrep%state%' ; +---------------------------+--------------------------------------+ | Variable_name | Value | +---------------------------+--------------------------------------+ | wsrep_local_state_uuid | 092fcc57-601e-11ec-8371-1eb4af1ceb2f | | wsrep_local_state | 2 | | wsrep_local_state_comment | Donor/Desynced | | wsrep_evs_state | OPERATIONAL | | wsrep_cluster_state_uuid | 092fcc57-601e-11ec-8371-1eb4af1ceb2f | +---------------------------+--------------------------------------+ 5 rows in set (0.001 sec) Dec 19 04:17:06 localhost mariadbd: 2021-12-19 4:17:06 11 [Note] WSREP: Desyncing and pausing the provider Dec 19 04:17:06 localhost mariadbd: 2021-12-19 4:17:06 11 [Note] WSREP: pause Dec 19 04:17:06 localhost mariadbd: 2021-12-19 4:17:06 11 [Note] WSREP: Provider paused at 092fcc57-601e-11ec-8371-1eb4af1ceb2f:100628 (61) Dec 19 04:17:06 localhost mariadbd: 2021-12-19 4:17:06 11 [Note] WSREP: Provider paused at: 100628 Dec 19 04:17:07 localhost mariadbd: 2021-12-19 4:17:07 11 [Note] WSREP: Resuming and resyncing the provider Dec 19 04:17:07 localhost mariadbd: 2021-12-19 4:17:07 11 [Note] WSREP: resume Dec 19 04:17:07 localhost mariadbd: 2021-12-19 4:17:07 11 [Note] WSREP: resuming provider at 61 Dec 19 04:17:07 localhost mariadbd: 2021-12-19 4:17:07 11 [Note] WSREP: Provider resumed. Dec 19 04:17:07 localhost mariadbd: 2021-12-19 4:17:07 0 [Note] WSREP: SST sent: 092fcc57-601e-11ec-8371-1eb4af1ceb2f:100627 Dec 19 04:17:07 localhost mariadbd: 2021-12-19 4:17:07 0 [Note] WSREP: Server status change donor -> joined ####################################################################################################