Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22543

Galera SST donation fails, FLUSH TABLES WITH READ LOCK times out

    XMLWordPrintable

Details

    Description

      SST donation fails occasionally under heavy load due to FLUSH TABLES WITH READ LOCK timing out after one second of waiting:

      2020-05-13 10:18:44 139871772944128 [Note] WSREP: sst_donor_thread signaled with 0
      2020-05-13 10:18:45 139871226160896 [Note] WSREP: Flushing tables for SST...
      2020-05-13 10:18:46 139871226160896 [Warning] WSREP: Error executing 'FLUSH TABLES WITH READ LOCK': 1205 (Lock wait timeout exceeded; try restarting transaction)
      2020-05-13 10:18:46 139871226160896 [ERROR] WSREP: Failed to flush and lock tables
      

      The following MTR test demonstrates the issue by issuing an UPDATE on donor node and stopping the UPDATE execution at sync point after some MDL locks have been taken. When node_2 tries to join with SST, the lock wait time out happens.

      # The test verifies that the FLUSH TABLES WITH READ LOCK does not
      # time out if it needs to wait for another MDL lock for short duration
      # during SST donation.
       
      --source include/galera_cluster.inc
       
      --let $node_1 = node_1
      --let $node_2 = node_2
      --source include/auto_increment_offset_save.inc
       
      --let $galera_connection_name = node_1_ctrl
      --let $galera_server_number = 1
      --source include/galera_connect.inc
       
      #
      # Run UPDATE on node_1 and make it block before table locks are taken.
      # This should block FTWRL.
      #
      --connection node_1
      CREATE TABLE t1 (f1 INT PRIMARY KEY, f2 INT);
      INSERT INTO t1 VALUES (1, 1);
      SET DEBUG_SYNC = "before_lock_tables_takes_lock SIGNAL sync_point_reached WAIT_FOR sync_point_continue";
      --send UPDATE t1 SET f2 = 2 WHERE f1 = 1
       
      --connection node_1_ctrl
      SET DEBUG_SYNC = "now WAIT_FOR sync_point_reached";
       
      #
      # Restart node_2, force SST.
      #
      --connection node_2
      --source include/shutdown_mysqld.inc
      --remove_file $MYSQLTEST_VARDIR/mysqld.2/data/grastate.dat
      # Restart without waiting. The UPDATE should block FTWRL on node_1,
      # so the SST cannot be completed and node_2 cannot join before
      # UPDATE connection is signalled to continue.
      --exec echo "restart:$start_mysqld_params" > $_expect_file_name
      # If the bug is present, FTWRL times out on node_1 in couple of
      # seconds and node_2 fails to join.
      --sleep 10
       
      --connection node_1_ctrl
      SET DEBUG_SYNC = "now SIGNAL sync_point_continue";
       
      --connection node_1
      --reap
      SET DEBUG_SYNC = "RESET";
       
      --connection node_2
      --enable_reconnect
      --source include/wait_until_connected_again.inc
       
      --connection node_1
      DROP TABLE t1;
       
      --source include/auto_increment_offset_restore.inc
      

      Apparently the reason for early time out is the following in MDL_context::acquire_lock

          /* Check if the client is gone while we were waiting. */
          if (! thd_is_connected(m_owner->get_thd()))
          {
            /*
             * The client is disconnected. Don't wait forever:
             * assume it's the same as a wait timeout, this
             * ensures all error handling is correct.
             */
            wait_status= MDL_wait::TIMEOUT;
            break;
          }
      

      The call to thd_is_connected() always returns false for SST donor THD, so if the lock wait lasts more than one second, it will bail out with timeout.

      Attachments

        Issue Links

          Activity

            People

              jplindst Jan Lindström (Inactive)
              teemu.ollakka Teemu Ollakka
              Votes:
              3 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.