Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13876

galera.MW-328A failed in buildbot with wrong result or timeout

Details

    Description

      http://buildbot.askmonty.org/buildbot/builders/kvm-deb-xenial-amd64/builds/2004

      galera.MW-328A 'xtradb'                  w1 [ fail ]
              Test ended at 2017-09-18 09:06:01
       
      CURRENT_TEST: galera.MW-328A
      --- /usr/share/mysql/mysql-test/suite/galera/r/MW-328A.result	2017-09-17 07:21:02.000000000 -0400
      +++ /dev/shm/var/1/log/MW-328A.reject	2017-09-18 09:06:01.106345871 -0400
      @@ -14,7 +14,7 @@
       have_successes
       1
       have_deadlocks
      -1
      +0
       Got one of the listed errors
       DROP PROCEDURE proc_update;
       DROP TABLE t1, t2;
       
      mysqltest: Result content mismatch
      

      It can also time out:
      http://buildbot.askmonty.org/buildbot/builders/kvm-deb-stretch-x86/builds/1429/steps/mtr/logs/stdio

      galera.MW-328A 'innodb'                  w1 [ fail ]  timeout after 900 seconds
              Test ended at 2017-10-26 03:20:30
       
      Test case timeout after 900 seconds
       
      == /dev/shm/var/1/log/MW-328A.log == 
      CREATE TABLE t1 (f1 INTEGER AUTO_INCREMENT PRIMARY KEY, f2 CHAR(20) DEFAULT 'abc') ENGINE=InnoDB;
      INSERT INTO t1 (f1) VALUES (1);
      CREATE TABLE t2 (f1 CHAR(20)) ENGINE=InnoDB;
      CREATE PROCEDURE proc_update ()
      BEGIN
      DECLARE CONTINUE HANDLER FOR SQLEXCEPTION BEGIN END;
      SET SESSION wsrep_sync_wait = 0;
      WHILE 1 DO
      UPDATE t1 SET f2 = LEFT(MD5(RAND()), 4);
      END WHILE;
      END|
      connect node_1X, 127.0.0.1, root, , test, $NODE_MYPORT_1;
      connection node_1X;
      CALL proc_update();;
      connection node_2;
      SET SESSION wsrep_retry_autocommit = 0;
      

      Attachments

        Issue Links

          Activity

            The above happens from time to time. When applying the included patch, which is unrelated and mainly decreases time that the LOCK_thd_kill mutex is hold, this happens much more frequently.
            diff

            monty Michael Widenius added a comment - The above happens from time to time. When applying the included patch, which is unrelated and mainly decreases time that the LOCK_thd_kill mutex is hold, this happens much more frequently. diff
            elenst Elena Stepanova added a comment - - edited

            Still fails in the similar fashion on the latest 10.1:

            worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
            galera.MW-328A 'xtradb'                  [ fail ]
                    Test ended at 2017-12-11 17:23:40
             
            CURRENT_TEST: galera.MW-328A
            --- /data/bld/10.1/mysql-test/suite/galera/r/MW-328A.result	2017-05-26 19:41:52.000000000 +0300
            +++ /data/bld/10.1/mysql-test/suite/galera/r/MW-328A.reject	2017-12-11 17:23:39.725358461 +0200
            @@ -12,7 +12,7 @@
             CALL proc_update();;
             SET SESSION wsrep_retry_autocommit = 0;
             have_successes
            -1
            +0
             have_deadlocks
             1
             Got one of the listed errors
             
            mysqltest: Result content mismatch
            

            elenst Elena Stepanova added a comment - - edited Still fails in the similar fashion on the latest 10.1: worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 galera.MW-328A 'xtradb' [ fail ] Test ended at 2017-12-11 17:23:40   CURRENT_TEST: galera.MW-328A --- /data/bld/10.1/mysql-test/suite/galera/r/MW-328A.result 2017-05-26 19:41:52.000000000 +0300 +++ /data/bld/10.1/mysql-test/suite/galera/r/MW-328A.reject 2017-12-11 17:23:39.725358461 +0200 @@ -12,7 +12,7 @@ CALL proc_update();; SET SESSION wsrep_retry_autocommit = 0; have_successes -1 +0 have_deadlocks 1 Got one of the listed errors   mysqltest: Result content mismatch

            I made a new experimental version of patch which should fix the problems with the MW-328A test. This is a very difficult bug, because it is heavily tied to the speed of the computer and rarely reproduced in my environment. From my point of view, it caused by the memory write order inversion between the updates of mysys_var->current_mutex and mysys_var->current_cond. In the main branch of the code (the ::awake() method in the sql_class.cc file) there are additional checks that should protect us from problems caused by the zero current_mutex value while we have nonzero current_cond. However, this protection method is somewhat heuristic and requires a loop with sleep() call, which can lead to a sudden freezing of the server and hinders the work in real time. Also, we have many other places in the code, where broadcast made without additional checks. Also, the update of the mysys_var->abort variable (which is checked in the mysys routines) sometimes occurs before the capture of mutex, sometimes after. In the second case, we too may have problems with write order iversion / inconsistent memory read on some architectures. However, the nulling of mysys_var->current_mutex and mysys_var->current_cond is already protected with mutex. I propose to get rid of all potential pitfalls by adding protection across writing to these variables using the same mutex. In addition, I found a lot of places where there is no check for thd->killed flag, which can lead to loss of the broadcast and may cause server hangup for a long time. I tried everywhere to follow the test for thd->killed after entern_cond().

            My pull request for this patch is here: https://github.com/MariaDB/server/pull/520

            sysprg Julius Goryavsky added a comment - I made a new experimental version of patch which should fix the problems with the MW-328A test. This is a very difficult bug, because it is heavily tied to the speed of the computer and rarely reproduced in my environment. From my point of view, it caused by the memory write order inversion between the updates of mysys_var->current_mutex and mysys_var->current_cond. In the main branch of the code (the ::awake() method in the sql_class.cc file) there are additional checks that should protect us from problems caused by the zero current_mutex value while we have nonzero current_cond. However, this protection method is somewhat heuristic and requires a loop with sleep() call, which can lead to a sudden freezing of the server and hinders the work in real time. Also, we have many other places in the code, where broadcast made without additional checks. Also, the update of the mysys_var->abort variable (which is checked in the mysys routines) sometimes occurs before the capture of mutex, sometimes after. In the second case, we too may have problems with write order iversion / inconsistent memory read on some architectures. However, the nulling of mysys_var->current_mutex and mysys_var->current_cond is already protected with mutex. I propose to get rid of all potential pitfalls by adding protection across writing to these variables using the same mutex. In addition, I found a lot of places where there is no check for thd->killed flag, which can lead to loss of the broadcast and may cause server hangup for a long time. I tried everywhere to follow the test for thd->killed after entern_cond(). My pull request for this patch is here: https://github.com/MariaDB/server/pull/520
            seppo Seppo Jaakola added a comment -

            MW-328A.test shows sporadic failures also in the upstream MySQL version. The test has been fixed to be deterministic with commits tagged with MW-418, and this work should be merged to MariaDB side as well.

            The patch by Julius may need a separate mtr test for verification.

            seppo Seppo Jaakola added a comment - MW-328A.test shows sporadic failures also in the upstream MySQL version. The test has been fixed to be deterministic with commits tagged with MW-418, and this work should be merged to MariaDB side as well. The patch by Julius may need a separate mtr test for verification.

            People

              jplindst Jan Lindström (Inactive)
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.