[MDEV-31339] Applier got stuck for minutes Created: 2023-05-25  Updated: 2023-05-25  Resolved: 2023-05-25

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Jan Lindström Assignee: Seppo Jaakola
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: File bf_stuck.diff    

 Description   

With attached change and test case it seems that node_2 gets unresponsive when node_1 is executing DROP TABLE.

jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6/mysql-test$ ./mtr galera.galera_bf_kill_newtrx --rr --mysqld=--wsrep-debug=1
Logging: ./mtr  galera.galera_bf_kill_newtrx --rr --mysqld=--wsrep-debug=1
VS config: 
vardir: /home/jan/work/mariadb/10.6/mysql-test/var
Checking leftover processes...
 - found old pid 35550 in 'mysqld.2.pid', killing it...
   process did not exist!
Removing old var directory...
Creating var directory '/home/jan/work/mariadb/10.6/mysql-test/var'...
Checking supported features...
MariaDB Version 10.6.14-MariaDB-debug
 - SSL connections supported
 - binaries are debug compiled
 - binaries built with wsrep patch
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
connection node_2;
connection node_1;
CREATE TABLE t1(a int not null primary key, b int,c int,d int,e int) engine=innodb;
INSERT INTO t1 VALUES (1,1,0,0,0),(2,2,0,0,0),(3,3,0,0,0),(4,4,0,0,0),(5,5,0,0,0),(6,6,0,0,0),(7,7,0,0,0);
connection node_2;
BEGIN;
UPDATE t1 SET b = 100, c = 300 where a = 5;
connect node_2b, 127.0.0.1, root, , test, $NODE_MYPORT_2;
connection node_2b;
set session wsrep_sync_wait=0;
SET GLOBAL debug_dbug = 'd,sync.wsrep_before_thd_abort';
connection node_1;
SELECT * FROM t1;
a	b	c	d	e
1	1	0	0	0
2	2	0	0	0
3	3	0	0	0
4	4	0	0	0
5	5	0	0	0
6	6	0	0	0
7	7	0	0	0
UPDATE t1 SET b = 200, d=200 where a = 5;
connection node_2b;
SET DEBUG_SYNC='now WAIT_FOR sync.wsrep_before_thd_abort_reached';
connect node_2c, 127.0.0.1, root, , test, $NODE_MYPORT_2;
connection node_2c;
SET SESSION wsrep_sync_wait=0;
UPDATE t1 SET b = 400, e = 100 where a = 5;;
connection node_2b;
SELECT SLEEP(5);
SLEEP(5)
0
SET DEBUG_SYNC='now SIGNAL signal.wsrep_before_thd_abort_continue';
SET DEBUG_SYNC='RESET';
SET GLOBAL debug_dbug = '';
connection node_2c;
COMMIT;
connection node_1;
disconnect node_2b;
disconnect node_2c;
connection node_2;
COMMIT;
ERROR 40001: Deadlock found when trying to get lock; try restarting transaction
connection node_1;
COMMIT;
SELECT COUNT(*) AS EXCEPT_1 FROM t1 where (a = 5 and b = 400 and d = 200 and e = 100) OR
(a = 5 and b = 200 and d = 200 and e = 0);
EXCEPT_1
1
DROP TABLE t1;
galera.galera_bf_kill_newtrx 'innodb'    [ pass ]  52134
 
MTR's internal check of the test case 'galera.galera_bf_kill_newtrx' failed.
This means that the test case does not preserve the state that existed
before the test case was executed.  Most likely the test case did not
do a proper clean-up. It could also be caused by the previous test run
by this thread, if the server wasn't restarted.
This is the diff of the states of the servers before and after the
test case was executed:
mysqltest: Logging to '/home/jan/work/mariadb/10.6/mysql-test/var/tmp/check-mysqld_2.log'.
mysqltest: Results saved in '/home/jan/work/mariadb/10.6/mysql-test/var/tmp/check-mysqld_2.result'.
mysqltest: Connecting to server localhost:16001 (socket /home/jan/work/mariadb/10.6/mysql-test/var/tmp/mysqld.2.sock) as 'root', connection 'default', attempt 0 ...
mysqltest: ... Connected.
mysqltest: Start processing test commands from './include/check-testcase.test' ...
mysqltest: At line 19: query 'let $tmp= `SELECT '$tmp' = 'No such row'`' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction
not ok
 
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 52.134 of 314 seconds executing testcases
 
Check of testcase failed for: galera.galera_bf_kill_newtrx



 Comments   
Comment by Jan Lindström [ 2023-05-25 ]
  • branch 10.6-bf-stuck-problem commit 19313490e
  • how to reproduce : ./mtr galera.galera_bf_kill_newtrx -rr --mysqld=-wsrep-debug=1
  • In my laptop it reproduces every time (if not add --repeat=n)
Comment by Marko Mäkelä [ 2023-05-25 ]

I can reproduce this after applying bf_stuck.diff to a recent 10.6 branch. It would appear to hang during the final DROP TABLE statement:

./mtr galera.galera_bf_kill_newtrx --rr --mysqld=--wsrep-debug=1
# wait about 1 minute
killall -ABRT mariadbd

10.6 7737f15f8740ab994d704200a58bf62aa2f3c875 with patch

EXCEPT_1
1
DROP TABLE t1;
galera.galera_bf_kill_newtrx 'innodb'    [ fail ]

I suppose that the bulk of the 52134-millisecond execution time in the Description can be attributed to the default innodb_lock_wait_timeout=50 (seconds).

Comment by Marko Mäkelä [ 2023-05-25 ]

I may have misunderstood something, but it seems to me that the test is setting and clearing a signal without ensuring that the signal was received by the blocked thread (connection node_2c) in between. With the following change to the test, the test passes for me in 7 seconds, with or without rr:

@@ -55,7 +55,6 @@
 --connection node_2b
 SELECT SLEEP(5);
 SET DEBUG_SYNC='now SIGNAL signal.wsrep_before_thd_abort_continue';
-SET DEBUG_SYNC='RESET';
 SET GLOBAL debug_dbug = '';
 
 --connection node_2c

Comment by Seppo Jaakola [ 2023-05-25 ]

It looks like the BF abort is partial here, victim is rolled back in server side, but victim's locks in innodb are not released. Note that rollback for the victim is carried out by background rollbacker thread, in this scenario

Comment by Jan Lindström [ 2023-05-25 ]

Test case was not exactly correct and that left BF thread stuck on DBUG_EXECUTE_IF, after adding wait_condition between BF release signal and RESET no problems seen anymore.

Generated at Thu Feb 08 10:23:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.