[MDEV-31177] SHOW SLAVE STATUS Last_SQL_Errno Race Condition on Errored Slave Restart Created: 2023-05-03  Updated: 2023-11-01  Resolved: 2023-09-13

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.5, 10.6, 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Critical
Reporter: Angelique Sklavounos (Inactive) Assignee: Brandon Nesterenko
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mysqld.1.err.17     File mysqld.2.err.17    
Issue Links:
Problem/Incident
causes MDEV-32651 Lost Debug_sync signal in rpl_sql_thd... Closed

 Description   

From knielsen:

There is a small window between when the SQL thread reports itself running, and when it clears any error set by a previous error stop. Thus a race exists where include/rpl_end.inc can see the previous error being set, causing it to fail the test.

This can be observed by the following test failure with rpl.rpl_xa_prepare_gtid_fail:

https://buildbot.mariadb.org/#/builders/208/builds/12453

10.6 fe89df426

rpl.rpl_xa_prepare_gtid_fail 'innodb,row' w17 [ fail ]
        Test ended at 2023-05-02 10:11:23
 
CURRENT_TEST: rpl.rpl_xa_prepare_gtid_fail
mysqltest: In included file "./include/check_slave_param.inc": 
included from ./include/check_slave_no_error.inc at line 24:
included from ./include/rpl_end.inc at line 74:
included from /home/buildbot/amd64-ubuntu-1804-bigtest/build/mysql-test/suite/rpl/t/rpl_xa_prepare_gtid_fail.test at line 106:
At line 31: Wrong value for slave parameter Last_SQL_Errno
 
The result from queries just before the failure was:
< snip >
master-bin.000001	590	Table_map	1	636	table_id: 32 (test.t1)
master-bin.000001	636	Write_rows_v1	1	678	table_id: 32 flags: STMT_END_F
master-bin.000001	678	Xid	1	709	COMMIT /* xid=154 */
master-bin.000001	709	Gtid	1	754	XA START X'31',X'',1 GTID 0-1-100
master-bin.000001	754	Annotate_rows	1	807	update t1 set b=b+10 where a=1
master-bin.000001	807	Table_map	1	853	table_id: 32 (test.t1)
master-bin.000001	853	Update_rows_v1	1	905	table_id: 32 flags: STMT_END_F
master-bin.000001	905	Query	1	990	XA END X'31',X'',1
master-bin.000001	990	XA_prepare	1	1027	XA PREPARE X'31',X'',1
master-bin.000001	1027	Gtid	1	1070	GTID 0-1-101
master-bin.000001	1070	Query	1	1158	XA COMMIT X'31',X'',1
master-bin.000001	1158	Gtid	1	1200	GTID 0-1-102
master-bin.000001	1200	Query	1	1317	use `test`; DROP TABLE `t1` /* generated by server */
 
**** SHOW RELAYLOG EVENTS on server_1 ****
relaylog_name = 'No such row'
SHOW RELAYLOG EVENTS IN 'No such row';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
connection server_2;
Wrong value for Last_SQL_Errno. Expected '0', got '1942'

The slave error log shows an abnormal server shutdown with "initiated by: unknown":

2023-05-02 10:11:23 0 [Note] /home/buildbot/amd64-ubuntu-1804-bigtest/build/sql/mariadbd (initiated by: unknown): Normal shutdown
2023-05-02 10:11:23 24 [Note] Error reading relay log event: slave SQL thread was killed
2023-05-02 10:11:23 24 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 1317; GTID position '0-1-102', master: 127.0.0.1:16300
2023-05-02 10:11:23 23 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
2023-05-02 10:11:23 23 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 1317; GTID position 0-1-102, master 127.0.0.1:16300
2023-05-02 10:11:23 0 [Note] InnoDB: FTS optimize thread exiting.
2023-05-02 10:11:23 0 [Note] InnoDB: Starting shutdown...



 Comments   
Comment by Brandon Nesterenko [ 2023-08-21 ]

Prioritized to critical because of the high frequency that this test failure is seen.

Comment by Kristian Nielsen [ 2023-08-24 ]

This appears to be a real (if small) bug in the code. There is a small window between when the SQL thread reports itself running, and when it clears any error set by a previous error stop. Thus a race exists where include/rpl_end.inc can see the previous error being set, causing it to fail the test.

The failure can be reproduced reliably with this small patch:

diff --git a/sql/slave.cc b/sql/slave.cc
index 08ac2e65779..9e9d9185ad1 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -5468,6 +5468,7 @@ pthread_handler_t handle_slave_sql(void *arg)
     now.
     But the master timestamp is reset by RESET SLAVE & CHANGE MASTER.
   */
+my_sleep(250000);
   rli->clear_error();
   rli->parallel.reset();
 

Comment by Brandon Nesterenko [ 2023-08-28 ]

Hey Andrei!

This is ready for review: PR-2741

Comment by Andrei Elkin [ 2023-08-29 ]

Review done on GH.

Comment by Brandon Nesterenko [ 2023-09-13 ]

Pushed into 10.4 as 1407f9996.

No merge conflicts or test failures observed by cherry-picking into 11.3.

Generated at Thu Feb 08 10:21:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.