Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32096

Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait

    XMLWordPrintable

Details

    Description

      MDEV-24671 introduced a race condition in the function innobase_kill_query(), which is responsible for interrupting a lock wait for the target of a KILL QUERY or KILL CONNECTION statement.

      This can severely affect optimistic (and aggressive) parallel replication. If the race is triggered, conflicts are not resolved correctly and parallel replication will be blocked until --innodb-lock-wait-timeout. This will be seen in SHOW PROCESSLIST as one worker being in the "killed" state and some other worker stuck in a query.

      A user reported a hang of parallel replication due to this, and knielsen spotted the data race: If the target transaction starts a lock wait roughly at the same time as innobase_kill_query() is invoked, then trx->lock.wait_lock could be read as nullptr and the lock wait would not be interrupted. Therefore, we need to acquire lock_sys.wait_mutex before checking if a lock wait needs to be aborted.

      Attached mdev32096_testcase.patch is an (ugly) ./mtr testcase that triggers the problem.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.