[MDEV-13876] galera.MW-328A failed in buildbot with wrong result or timeout Created: 2017-09-22  Updated: 2018-09-14  Resolved: 2018-09-14

Status: Closed
Project: MariaDB Server
Component/s: Galera, Tests
Affects Version/s: 10.1, 10.2
Fix Version/s: 10.2.18, 10.3.10, 10.1.37

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: HTML File diff    
Issue Links:
Blocks
blocks MDEV-13549 Galera 3 test failures Closed
Duplicate
duplicates MDEV-14145 galera.MW-328B failed in buildbot wit... Closed
Relates
relates to MDEV-13549 Galera 3 test failures Closed

 Description   

http://buildbot.askmonty.org/buildbot/builders/kvm-deb-xenial-amd64/builds/2004

galera.MW-328A 'xtradb'                  w1 [ fail ]
        Test ended at 2017-09-18 09:06:01
 
CURRENT_TEST: galera.MW-328A
--- /usr/share/mysql/mysql-test/suite/galera/r/MW-328A.result	2017-09-17 07:21:02.000000000 -0400
+++ /dev/shm/var/1/log/MW-328A.reject	2017-09-18 09:06:01.106345871 -0400
@@ -14,7 +14,7 @@
 have_successes
 1
 have_deadlocks
-1
+0
 Got one of the listed errors
 DROP PROCEDURE proc_update;
 DROP TABLE t1, t2;
 
mysqltest: Result content mismatch

It can also time out:
http://buildbot.askmonty.org/buildbot/builders/kvm-deb-stretch-x86/builds/1429/steps/mtr/logs/stdio

galera.MW-328A 'innodb'                  w1 [ fail ]  timeout after 900 seconds
        Test ended at 2017-10-26 03:20:30
 
Test case timeout after 900 seconds
 
== /dev/shm/var/1/log/MW-328A.log == 
CREATE TABLE t1 (f1 INTEGER AUTO_INCREMENT PRIMARY KEY, f2 CHAR(20) DEFAULT 'abc') ENGINE=InnoDB;
INSERT INTO t1 (f1) VALUES (1);
CREATE TABLE t2 (f1 CHAR(20)) ENGINE=InnoDB;
CREATE PROCEDURE proc_update ()
BEGIN
DECLARE CONTINUE HANDLER FOR SQLEXCEPTION BEGIN END;
SET SESSION wsrep_sync_wait = 0;
WHILE 1 DO
UPDATE t1 SET f2 = LEFT(MD5(RAND()), 4);
END WHILE;
END|
connect node_1X, 127.0.0.1, root, , test, $NODE_MYPORT_1;
connection node_1X;
CALL proc_update();;
connection node_2;
SET SESSION wsrep_retry_autocommit = 0;



 Comments   
Comment by Michael Widenius [ 2017-12-06 ]

The above happens from time to time. When applying the included patch, which is unrelated and mainly decreases time that the LOCK_thd_kill mutex is hold, this happens much more frequently.
diff

Comment by Elena Stepanova [ 2017-12-11 ]

Still fails in the similar fashion on the latest 10.1:

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
galera.MW-328A 'xtradb'                  [ fail ]
        Test ended at 2017-12-11 17:23:40
 
CURRENT_TEST: galera.MW-328A
--- /data/bld/10.1/mysql-test/suite/galera/r/MW-328A.result	2017-05-26 19:41:52.000000000 +0300
+++ /data/bld/10.1/mysql-test/suite/galera/r/MW-328A.reject	2017-12-11 17:23:39.725358461 +0200
@@ -12,7 +12,7 @@
 CALL proc_update();;
 SET SESSION wsrep_retry_autocommit = 0;
 have_successes
-1
+0
 have_deadlocks
 1
 Got one of the listed errors
 
mysqltest: Result content mismatch

Comment by Julius Goryavsky [ 2017-12-20 ]

I made a new experimental version of patch which should fix the problems with the MW-328A test. This is a very difficult bug, because it is heavily tied to the speed of the computer and rarely reproduced in my environment. From my point of view, it caused by the memory write order inversion between the updates of mysys_var->current_mutex and mysys_var->current_cond. In the main branch of the code (the ::awake() method in the sql_class.cc file) there are additional checks that should protect us from problems caused by the zero current_mutex value while we have nonzero current_cond. However, this protection method is somewhat heuristic and requires a loop with sleep() call, which can lead to a sudden freezing of the server and hinders the work in real time. Also, we have many other places in the code, where broadcast made without additional checks. Also, the update of the mysys_var->abort variable (which is checked in the mysys routines) sometimes occurs before the capture of mutex, sometimes after. In the second case, we too may have problems with write order iversion / inconsistent memory read on some architectures. However, the nulling of mysys_var->current_mutex and mysys_var->current_cond is already protected with mutex. I propose to get rid of all potential pitfalls by adding protection across writing to these variables using the same mutex. In addition, I found a lot of places where there is no check for thd->killed flag, which can lead to loss of the broadcast and may cause server hangup for a long time. I tried everywhere to follow the test for thd->killed after entern_cond().

My pull request for this patch is here: https://github.com/MariaDB/server/pull/520

Comment by Seppo Jaakola [ 2017-12-20 ]

MW-328A.test shows sporadic failures also in the upstream MySQL version. The test has been fixed to be deterministic with commits tagged with MW-418, and this work should be merged to MariaDB side as well.

The patch by Julius may need a separate mtr test for verification.

Generated at Thu Feb 08 08:09:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.