Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28725

Sequence of Galera tests fails on Debian 11 (uring builds) with timeout or long semaphore wait

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Duplicate
    • 10.8(EOL)
    • 10.7.5, 10.6.9, 10.8.4, 10.9.2
    • Galera, Tests
    • None
    • Debian 11, build with liburing

    Description

      The sequence of tests below fails fairly reliably (although not 100% deterministically) with either timeout or innodb_fatal_semaphore_wait_threshold on the last test. It may be not the shortest sequence but I couldn't reduce it any further in reasonable time.
      In CIs and CI VMs, it only happens on Debian 11, both x86_64 and aarch64. It is also reproducible on other Debian 11 machines.
      It appears to be related to the use of liburing: I cannot reproduce it on a build without liburing but it happens reliably on a build with liburing, otherwise same build options, same 10.8 revision.
      Reproducible both in shm and on disk.

      Normally the last test takes only a few seconds, so you don't have to wait very long to know that you have hit the problem.

      10.8 0e0a3580

      perl mysql-test-run.pl --noreorder  galera.galera_strict_require_innodb galera.galera_strict_require_primary_key galera.galera_suspend_slave galera.galera_toi_alter_auto_increment galera.galera_toi_ddl_sequential galera.galera_toi_drop_database galera.galera_toi_ftwrl galera.galera_toi_lock_exclusive galera.galera_toi_lock_shared galera.galera_transaction_read_only galera.galera_truncate galera.galera_var_auto_inc_control_off galera.galera_var_certify_nonPK_off galera.galera_var_desync_on --testcase-timeout=2
       
      galera.galera_var_desync_on 'innodb'     [ fail ]  timeout after 120 seconds
              Test ended at 2022-06-02 01:29:22
       
      Test case timeout after 120 seconds
       
      == /mnt8t/bld/10.8-uring/mysql-test/var/log/galera_var_desync_on.log == 
      INSERT INTO t1 VALUES (9);
      INSERT INTO t1 VALUES (10);
      connection node_2;
      SET SESSION wsrep_sync_wait = 0;
      SELECT COUNT(*) = 1 FROM t1;
      COUNT(*) = 1
      1
      UNLOCK TABLES;
      SET SESSION wsrep_sync_wait = 1;
      SELECT COUNT(*) = 10 FROM t1;
      COUNT(*) = 10
      1
      connection node_1;
      INSERT INTO t1 VALUES (11);
      connection node_2;
      SELECT COUNT(*) = 11 FROM t1;
      COUNT(*) = 11
      1
      CALL mtr.add_suppression("Protocol violation");
      DROP TABLE t1;
      

      Attachments

        Issue Links

          Activity

            Is this report a duplicate of MDEV-28665? I have now merged the fix to 10.7 and 10.8.

            marko Marko Mäkelä added a comment - Is this report a duplicate of MDEV-28665 ? I have now merged the fix to 10.7 and 10.8 .

            For the 10.8 branch, the first recent failure of this kind was
            http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-fedora36-amd64/builds/27/steps/mtr-galera/logs/stdio
            Among the stack traces, we can observe that no thread is executing aio_uring::thread_routine. That is a sure sign of MDEV-28665.

            Between that and the merge of the MDEV-28665 fix, every single mtr-galera run of 10.8 failed either with an InnoDB hang with stack trace output, always missing aio_uring::thread_routine, and twice without an InnoDB crash but with the following:

            CURRENT_TEST: galera.galera_var_auto_inc_control_off
            mysqltest: At line 56: query 'SHOW CREATE TABLE t1' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction
            

            For the merged the fix of MDEV-28665, all tests passed:
            http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-fedora36-amd64/builds/150/steps/mtr-galera/logs/stdio

            10.8 600751e7693dcf6236d3e6b64fa24d19fd57f088

            The servers were restarted 122 times
            Spent 1337.407 of 1457 seconds executing testcases
             
            Completed: All 350 tests were successful.
             
            149 tests were skipped, 53 by the test itself.
            

            I think that we can conclude that this report is a duplicate of MDEV-28665.

            marko Marko Mäkelä added a comment - For the 10.8 branch, the first recent failure of this kind was http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-fedora36-amd64/builds/27/steps/mtr-galera/logs/stdio Among the stack traces, we can observe that no thread is executing aio_uring::thread_routine . That is a sure sign of MDEV-28665 . Between that and the merge of the MDEV-28665 fix, every single mtr-galera run of 10.8 failed either with an InnoDB hang with stack trace output, always missing aio_uring::thread_routine , and twice without an InnoDB crash but with the following: CURRENT_TEST: galera.galera_var_auto_inc_control_off mysqltest: At line 56: query 'SHOW CREATE TABLE t1' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction For the merged the fix of MDEV-28665 , all tests passed: http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-fedora36-amd64/builds/150/steps/mtr-galera/logs/stdio 10.8 600751e7693dcf6236d3e6b64fa24d19fd57f088 The servers were restarted 122 times Spent 1337.407 of 1457 seconds executing testcases   Completed: All 350 tests were successful.   149 tests were skipped, 53 by the test itself. I think that we can conclude that this report is a duplicate of MDEV-28665 .

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.