[MDEV-27528] Cluster nodes become unstable when a transaction is replicated through Async replication from Galera Primary node Created: 2022-01-17  Updated: 2022-02-01  Resolved: 2022-02-01

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.8
Fix Version/s: 10.8.1

Type: Bug Priority: Critical
Reporter: Ramesh Sivaraman Assignee: Andrei Elkin
Resolution: Fixed Votes: 0
Labels: None

Attachments: File my.cnf     File n1.cnf     File n2.cnf    
Issue Links:
Blocks
blocks MDEV-11675 Lag Free Alter On Slave Closed
Relates
relates to MDEV-11675 Lag Free Alter On Slave Closed

 Description   

Cluster nodes become unstable when a transaction is replicated through Async replication from Galera Primary node.

Testcase

GALERA_BASE=/home/ramesh/rpl/mariadb-10.8.0-linux-x86_64_slave
MASTER_BASE=/home/ramesh/rpl/mariadb-10.8.0-linux-x86_64_master
DATADIR=/home/ramesh/rpl
 
rm -Rf $DATADIR/node* $DATADIR/data
 
$GALERA_BASE/scripts/mariadb-install-db --no-defaults --force --auth-root-authentication-method=normal  --basedir=$GALERA_BASE --datadir=$DATADIR/node1
$GALERA_BASE/scripts/mariadb-install-db --no-defaults --force --auth-root-authentication-method=normal  --basedir=$GALERA_BASE --datadir=$DATADIR/node2
$MASTER_BASE/scripts/mariadb-install-db --no-defaults --force --auth-root-authentication-method=normal  --basedir=$MASTER_BASE --datadir=$DATADIR/data
 
 
$GALERA_BASE/bin/mysqld --defaults-file=$DATADIR/n1.cnf --wsrep-new-cluster > $DATADIR/node1/node1.err 2>&1 & 
sleep 2
$GALERA_BASE/bin/mysqladmin  -uroot -S$DATADIR/node1/node1_socket.sock ping
 
$GALERA_BASE/bin/mysqld --defaults-file=$DATADIR/n2.cnf > $DATADIR/node2/node2.err 2>&1 &
 
$GALERA_BASE/bin/mysqld --defaults-file=$DATADIR/n3.cnf > $DATADIR/node3/node3.err 2>&1 &
 
 
$GALERA_BASE/bin/mysql -uroot -S$DATADIR/node1/node1_socket.sock -e 'CREATE DATABASE IF NOT EXISTS test;'
 
 
 
$MASTER_BASE/bin/mysqld --defaults-file=$DATADIR/my.cnf  > $DATADIR/data/mysql.err 2>&1 & 
 
 
-- connection master server
$MASTER_BASE/bin/mysql -uroot --socket=/home/ramesh/rpl/data/socket.sock
delete from mysql.user where user='';
create user repl@'%' identified by 'repl';
grant all on *.* to  repl@'%';
flush privileges;
 
-- connection galera async slave node1
$GALERA_BASE/bin/mysql -uroot -S$DATADIR/node1/node1_socket.sock
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=4040, MASTER_USER='repl', MASTER_PASSWORD='repl', MASTER_USE_GTID=slave_pos; START SLAVE; SHOW SLAVE STATUS \G
 
-- connection master server
 
CREATE TABLE t1 (a INT) ENGINE=innodb;
ALTER TABLE t1 ADD COLUMN b int;
 
-- connection galera async slave node1
FLUSH TABLES WITH READ LOCK;
 
-- connection master server
INSERT INTO t1 VALUES(1,1);
 
-- connection galera async slave node1
UNLOCk TABLES;
 
check galera node2 wsrep state.
 
MariaDB [test]> select * from t1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
MariaDB [test]> show status like '%wsrep%st%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | 00000000-0000-0000-0000-000000000000 |
| wsrep_last_committed         | -1                                   |
| wsrep_flow_control_requested | false                                |
| wsrep_cert_deps_distance     | 1                                    |
| wsrep_local_state            | 5                                    |
| wsrep_local_state_comment    | Inconsistent                         |
| wsrep_cluster_capabilities   |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 0                                    |
| wsrep_cluster_state_uuid     | 2602eed2-7790-11ec-8bb1-0ac228ebd8fe |
| wsrep_cluster_status         | Disconnected                         |
+------------------------------+--------------------------------------+
11 rows in set (0.001 sec)
 
MariaDB [test]>

Error info

2022-01-17 17:05:58 0 [Note] WSREP: Member 1.0 (ramesh) synced with group.
2022-01-17 17:05:58 6 [ERROR] mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
2022-01-17 17:05:58 6 [ERROR] WSREP: Failed to apply write set: gtid: d49fdaa9-7787-11ec-a527-fbc118afef8b:15 server_id: d49f2223-7787-11ec-bc33-23776de67a0b client_id: 14 trx_id: 104 flags: 3 (start_transaction 
| commit)
2022-01-17 17:05:58 6 [Note] WSREP: Closing send monitor...
2022-01-17 17:05:58 6 [Note] WSREP: Closed send monitor.
2022-01-17 17:05:58 6 [Note] WSREP: gcomm: terminating thread
2022-01-17 17:05:58 6 [Note] WSREP: gcomm: joining thread
2022-01-17 17:05:58 6 [Note] WSREP: gcomm: closing backend
2022-01-17 17:05:59 6 [Note] WSREP: view(view_id(NON_PRIM,0a3c40bc-ae30,3) memb {
        e6188585-893a,0
} joined {
} left {
} partitioned {
        0a3c40bc-ae30,0
        d49f2223-bc33,0
})
2022-01-17 17:05:59 6 [Note] WSREP: PC protocol downgrade 1 -> 0
2022-01-17 17:05:59 6 [Note] WSREP: view((empty))
2022-01-17 17:05:59 6 [Note] WSREP: gcomm: closed
2022-01-17 17:05:59 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-01-17 17:05:59 0 [Note] WSREP: Flow-control interval: [16, 16]
2022-01-17 17:05:59 0 [Note] WSREP: Received NON-PRIMARY.
2022-01-17 17:05:59 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 15)
2022-01-17 17:05:59 0 [Note] WSREP: New SELF-LEAVE.
2022-01-17 17:05:59 0 [Note] WSREP: Flow-control interval: [0, 0]
2022-01-17 17:05:59 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2022-01-17 17:05:59 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 15)



 Comments   
Comment by Ramesh Sivaraman [ 2022-01-21 ]

The issue is also reproduced without FTWRL
Testcase (run these SQLs on stanalone master node and check galera node2 status)

CREATE TABLE t1 (f1 INTEGER PRIMARY KEY AUTO_INCREMENT, f2 INTEGER);
ALTER TABLE t1 ADD COLUMN f3 INTEGER; INSERT INTO t1 (f1, f2) VALUES (DEFAULT, 123);

Comment by Andrei Elkin [ 2022-01-21 ]

ramesh, I run these three lines, and it's fine when I use galera.galera_as_master test to bootstrap the cluster of two nodes and one slave.
Specifically on slave:

 
              Master_Log_File: mysqld-bin.000002
           Read_Master_Log_Pos: 1116
                Relay_Log_File: mysqld-relay-bin.000004
                 Relay_Log_Pos: 1348
         Relay_Master_Log_File: mysqld-bin.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
 

19:48:46 [test]> select * from t1;
+----+------+------+
| f1 | f2   | f3   |
+----+------+------+
|  1 |  123 | NULL |
+----+------+------+

Could you please arrange your report into a mtr test, I suggest use the mentioned one as a template.
Thank you.

Comment by Ramesh Sivaraman [ 2022-01-24 ]

Elkin I'm not so familiar with MTR test creation, the failure is in the galera secondary nodes, not in the async replication node (galera primary node). The situation of failure is as follows
Master (Stand alone Server) A -------> Galera Cluster B (2node cluster, where B_node1 is configured as async slave to Master A)

Once the SQLs are executed on Master, Galera cluster node2 (B_node2) become unstable. SQLs will be replicated on async slave node1 (B_node1) without any issues

Comment by Andrei Elkin [ 2022-01-26 ]

The latest commit is in bb-10.8-andrei.

Comment by Brandon Nesterenko [ 2022-01-27 ]

Patch Review

Comment by Andrei Elkin [ 2022-01-27 ]

New commit is pushed to bb-10.8-andrei.

Comment by Brandon Nesterenko [ 2022-01-27 ]

Reviewed latest patch

Comment by Brandon Nesterenko [ 2022-01-27 ]

Looks good/approved!

Comment by Andrei Elkin [ 2022-02-01 ]

Fixed as a part of MDEV-11675 commit.

Generated at Thu Feb 08 09:53:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.