[MDEV-28461] semisync-slave server recovery fails to rollback prepared transaction Created: 2022-05-03  Updated: 2023-11-27  Resolved: 2022-05-17

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.6.2, 10.7, 10.8, 10.9
Fix Version/s: 10.6.8, 10.7.4, 10.8.3, 10.9.1

Type: Bug Priority: Major
Reporter: Allen Lee (Inactive) Assignee: Andrei Elkin
Resolution: Fixed Votes: 1
Labels: None

Attachments: File maxscale.cnf     File primary-server.cnf     File rep_master_semi.7z     File rep_slave_semi.7z     File replica-server.cnf     File sysbench_insert_only_10.6.jmx    

 Description   
  • Test config
  • Primary(master) : 192.168.254.24
  • Primary(master) : 192.168.254.25
  • MaxScale : 192.168.254.26
  • test table schema and query.
  • max client sessions : 5000 concurrent user
  • workload generator and test project file : Jmeter [^sysbench_insert_only_10.6.jmx]

MariaDB [test]> show create table au_test\G
*************************** 1. row ***************************
       Table: au_test
Create Table: CREATE TABLE `au_test` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `b` varchar(10) NOT NULL COMMENT '@@hostname',
  `c` datetime DEFAULT NULL,
  `d` datetime DEFAULT NULL,
  `e` bigint(20) DEFAULT NULL,
  `f` varchar(500) DEFAULT NULL,
  `g` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=457948 DEFAULT CHARSET=latin1;
 
insert into au_test (b,c,d,e,f,g) values (@@hostname,now(), now(), last_insert_id(), (select variable_value from information_schema.session_status where variable_name = 'Rpl_semi_sync_slave_status'), cast(@@gtid_binlog_state as varchar(100)));

Once master joins as a master once again after `kill -9 $(pidof mariadbd)` against master node, slave status changes to `running` instead of 'slave, running'

Here is the output of `maxctrl list status`:

 
{"timestamp":"2022-05-03 14:26:04.907","command":"maxctrl list servers","status":true,"output":"
┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ primary │ 192.168.254.2433064195        │ Master, Running │ 0-1-7533546
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ replica │ 192.168.254.2533064072        │ Slave, Running  │ 0-2-7496224
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
 
_this is the time when master server was forcefully killed:
----------------------------------------------------------------
[root@primary my.cnf.d]# date
Tue May  3 14:26:04 KST 2022
[root@primary my.cnf.d]# kill -9 $(pidof mariadbd)
----------------------------------------------------------------_
 
{"timestamp":"2022-05-03 14:26:07.547","command":"maxctrl list servers","status":true,"output":"
┌─────────┬────────────────┬──────┬─────────────┬────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State          │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
│ primary │ 192.168.254.2433060           │ Down           │ 0-1-7533546
├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
│ replica │ 192.168.254.2533064303        │ Slave, Running │ 0-2-7496224
└─────────┴────────────────┴──────┴─────────────┴────────────────┴─────────────┘
 
 
{"timestamp":"2022-05-03 14:26:10.969","command":"maxctrl list servers","status":true,"output":"
┌─────────┬────────────────┬──────┬─────────────┬────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State          │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
│ primary │ 192.168.254.2433060           │ Down           │ 0-1-7533546
├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
│ replica │ 192.168.254.2533064344        │ Slave, Running │ 0-2-7496224
└─────────┴────────────────┴──────┴─────────────┴────────────────┴─────────────┘
 
 
{"timestamp":"2022-05-03 14:26:13.606","command":"maxctrl list servers","status":true,"output":" => master was successfully rejoin as master again.
┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ primary │ 192.168.254.2433062275        │ Master, Running │ 0-1-7534828
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ replica │ 192.168.254.2533064344        │ Slave, Running  │ 0-2-7496224
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
 
 
*{"timestamp":"2022-05-03 14:26:15.677","command":"maxctrl list servers","status":true,"output":"  => At this point  Slave got changed into Running state due to Duplicate key error.
┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ primary │ 192.168.254.2433063069        │ Master, Running │ 0-1-7534828
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ replica │ 192.168.254.2533060           │ Running         │ 0-1-7496290
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
*
 
{"timestamp":"2022-05-03 14:26:26.720","command":"maxctrl list servers","status":true,"output":"
┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ primary │ 192.168.254.2433062927        │ Master, Running │ 0-1-7534828
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ replica │ 192.168.254.2533060           │ Running         │ 0-1-7496290
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘

I've run this test for quite some time, but still hitting this error quite a lot.

-slave server error log

2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 816
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1577
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 643
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 981
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 812
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1398
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 803
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 857
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 833
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 894
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1788
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 846
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 879
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1118
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 490
2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1330
2022-05-03 14:25:50 0 [Note] Successfully truncated binlog file:./mariadb-bin.000007 from previous file size 1358 to pos:360 to remove transactions starting from GTID 0-2-7496224
2022-05-03 14:25:50 0 [Note] Server socket created on IP: '0.0.0.0'.
2022-05-03 14:25:50 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
2022-05-03 14:25:50 0 [Note] Reading of all Master_info entries succeeded
2022-05-03 14:25:50 0 [Note] Added new Master_info '' to hash table
2022-05-03 14:25:50 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
2022-05-03 14:25:53 7 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
2022-05-03 14:25:53 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='192.168.254.24', master_port='3306', master_log_file='', master_log_pos='4'.
2022-05-03 14:25:53 7 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
2022-05-03 14:25:53 10 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.24:3306' in log '' at position 4
2022-05-03 14:25:53 11 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-2-7496223'
2022-05-03 14:25:53 10 [Note] Slave I/O thread: connected to master 'repl@192.168.254.24:3306',replication starts at GTID position '0-2-7496223'
2022-05-03 14:26:04 10 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
2022-05-03 14:26:04 10 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000021' at position 41641390; GTID position '0-1-7534857'
2022-05-03 14:26:04 10 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003
2022-05-03 14:26:13 11 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000009, pos: 904), semi-sync up to file , position 0.
2022-05-03 14:26:13 11 [Note] Semi-sync replication switched OFF.
2022-05-03 14:26:13 11 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.au_test; Duplicate entry '417427' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000021, end_log_pos 22320629, Gtid 0-1-7496291, Internal MariaDB error code: 1062
2022-05-03 14:26:13 11 [Warning] Slave: Duplicate entry '417427' for key 'PRIMARY' Error_code: 1062
2022-05-03 14:26:13 11 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000021' position 22320161; GTID position '0-1-7496290'
2022-05-03 14:26:13 11 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000021' at position 22320161; GTID position '0-1-7496290', master: 192.168.254.24:3306
2022-05-03 14:27:04 10 [Note] Slave: connected to master 'repl@192.168.254.24:3306',replication resumed in log 'mariadb-bin.000021' at position 41641390
2022-05-03 14:31:14 2854 [Warning] Aborted connection 2854 to db: 'test' user: 'allen' host: '192.168.254.26' (Got timeout reading communication packets)
2022-05-03 14:31:14 2876 [Warning] Aborted connection 2876 to db: 'test' user: 'allen' host: '192.168.254.26' (Got timeout reading communication packets)

nothing much in master's error log.

2022-05-03 14:26:10 0 [Note] Server socket created on IP: '0.0.0.0'.
2022-05-03 14:26:10 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
2022-05-03 14:26:10 0 [Note] Reading of all Master_info entries succeeded
2022-05-03 14:26:10 0 [Note] Added new Master_info '' to hash table
2022-05-03 14:26:10 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
2022-05-03 14:26:14 802 [Note] InnoDB: Number of pools: 2
2022-05-03 14:26:21 1851 [Note] InnoDB: Number of pools: 3
2022-05-03 14:26:32 354 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000022, pos: 859), semi-sync up to file , position 0.
2022-05-03 14:26:32 354 [Note] Semi-sync replication switched OFF.
2022-05-03 14:27:04 1991 [Note] Start binlog_dump to slave_server(2), pos(mariadb-bin.000021, 41641390), using_gtid(1), gtid('0-1-7534857')
2022-05-03 14:27:04 1991 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(./mariadb-bin.000022, 4)
2022-05-03 14:27:04 1 [Note] Semi-sync replication switched ON with slave (server_id: 2) at (mariadb-bin.000022, 972296)

rep_master_semi.7z - master(192.168.254.24) server's semi_sync status and show slave status
rep_slave_semi.7z - slave(192.168.254.25) server's semi_sync status and show slave status



 Comments   
Comment by Andrei Elkin [ 2022-05-05 ]

allen.lee@mariadb.com, howdy.

I examined the error log files to find in the primary's mariadb.err some strange looking lines suggesting that
the restarted KILL-9 old master may still be executing some local transactions:

2022-04-25 19:21:05 0 [Note] Successfully truncated binlog file:./mariadb-bin.000008 from previous file size 47372610 to pos:47371145 to remove transactions starting from GTID 0-1-1437105
2022-04-25 19:21:05 0 [Note] Server socket created on IP: '0.0.0.0'.
2022-04-25 19:21:05 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
2022-04-25 19:21:05 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
2022-04-25 19:21:05 5 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
2022-04-25 19:21:05 5 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='192.168.254.25', master_port='3306', master_log_file='', master_log_pos='4'.
2022-04-25 19:21:05 5 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
2022-04-25 19:21:05 6 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.25:3306' in log '' at position 4
2022-04-25 19:21:05 6 [Note] Slave I/O thread: connected to master 'repl@192.168.254.25:3306',replication starts at GTID position '0-1-1437104'
2022-04-25 19:21:05 7 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-1-1437104'
 
2022-04-25 19:21:16 23 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000010, pos: 876), semi-sync up to file , position 0.
 
2022-04-25 19:21:16 23 [Note] Semi-sync replication switched OFF.
2022-04-25 19:21:16 20 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.au_test; Duplicate entry '340208' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000001, end_log_pos 166317905, Gtid 0-2-1437108, Internal MariaDB error code: 1062
 
2022-04-25 19:21:16 20 [Warning] Slave: Duplicate entry '340208' for key 'PRIMARY' Error_code: 1062

With empty lines emphasized

[Warning] Timeout waiting for reply of binlog

can be only on a master-role server. It could have been that the timeout warning followed with that transaction commit which ensues the following duplicate error by the slave applier.
The warning's file: mariadb-bin.000010, pos: 876 simply suggests where to find out what was doing that transaction on the slave.

Could you please investigate that?

Cheers,

Andrei

Comment by Andrei Elkin [ 2022-05-06 ]

allen.lee@mariadb.com,

to '2022-05-03 14:0x', you must mean '14:2x', right?
Because this slave server session

2022-05-03 13:59:15 6 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.25:3306' in log '' at position 4
2022-05-03 13:59:15 7 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-1-5986869'
2022-05-03 13:59:15 6 [Note] Slave I/O thread: connected to master 'repl@192.168.254.25:3306',replication starts at GTID position '0-1-5986869'
2022-05-03 13:59:15 6 [ERROR] Error reading packet from server: Error: connecting slave requested to start from GTID 0-1-5986869, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions (server_errno=1236)
2022-05-03 13:59:15 6 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-5986869, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236
2022-05-03 13:59:15 6 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-1-5986869, master 192.168.254.25:3306
2022-05-03 14:23:11 7 [Note] Error reading relay log event: slave SQL thread was killed
2022-05-03 14:23:11 7 [Note] Slave SQL thread exiting, replication stopped in log 'FIRST' at position 4; GTID position '0-2-5437670', master: 192.168.254.25:3306

does not any duplicate key error.
That session ends with

2022-05-03 14:23:41 0 [Note] /usr/sbin/mariadbd: Shutdown complete

The next one hits the dup error having the same timeout pattern:

 2022-05-03 14:23:42 6 [Note] Slave I/O thread: connected to master 'repl@192.168.254.25:3306',replication starts at GTID position '0-2-7451561'
2022-05-03 14:24:26 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
2022-05-03 14:24:26 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000003' at position 16955153; GTID position '0-2-7485404'
2022-05-03 14:24:26 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.25:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.25' (111 "Connection refused"), Internal MariaDB error code: 2003
 
2022-05-03 14:24:28 7 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000021, pos: 859), semi-sync up to file , position 0.
 
2022-05-03 14:24:28 7 [Note] Semi-sync replication switched OFF.
2022-05-03 14:25:26 6 [Note] Slave: connected to master 'repl@192.168.254.25:3306',replication resumed in log 'mariadb-bin.000003' at position 16955153
2022-05-03 14:25:29 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
2022-05-03 14:25:29 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000006' at position 5516244; GTID position '0-2-7496274'
2022-05-03 14:25:29 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.25:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.25' (111 "Connection refused"), Internal MariaDB error code: 2003
2022-05-03 14:25:48 7 [Note] Error reading relay log event: slave SQL thread was killed
2022-05-03 14:25:48 7 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000006' at position 5516244; GTID position '0-2-7496274', master: 192.168.254.25:3306
2022-05-03 14:25:48 6 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read

Comment by Andrei Elkin [ 2022-05-10 ]

There's a way to reproduce a dup key error without MXS and it's still at analysis.

Comment by Angelique Sklavounos (Inactive) [ 2022-05-12 ]

Looks okay (barring any new changes needed from the review).

Generated at Thu Feb 08 10:00:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.