Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28043

Race condition between mtr_t::commit() and checkpoint

Details

    Description

      When MDEV-27774 replaced log_sys.mutex with log_sys.latch, it introduced a race condition in mtr_t::do_write():

          if (!ex)
          {
            log_sys.latch.rd_unlock();
            log_sys.latch.wr_lock(SRW_LOCK_CALL);
            if (UNIV_LIKELY(!m_user_space->max_lsn))
              name_write();
            std::pair<lsn_t,mtr_t::page_flush_ahead> p{finish_write(len, true)};
            log_sys.latch.wr_unlock();
            log_sys.latch.rd_lock(SRW_LOCK_CALL);
            return p;
          }
      

      It is not safe to release the exclusive log_sys.latch between finish_write() and ReleaseBlocks. Because we have no portable operation that would downgrade the latch from exclusive to shared, we must retain that exclusive latch until the end of the critical section in mtr_t::commit().

      I debugged an rr replay trace of this:

      ssh pluto
      rr replay /data/results/1647008467/TBR-1420/dev/shm/rqg/1647008467/53/1/rr/latest-trace
      

      continue
      watch -l log_sys.last_checkpoint_lsn.m._M_i
      watch -l buf_pool.flush_list.count
      reverse-continue
      reverse-continue
      reverse-continue
      thread apply 24 backtrace
      

      From the end of the start, we have Thread 3 hitting an assertion failure:

      mysqld: /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1877: bool log_checkpoint_low(lsn_t, lsn_t): Assertion `oldest_lsn > log_sys.last_checkpoint_lsn' failed.
      

      Before that, we had Thread 24 inserting the unexpectedly old block to buf_pool.flush_list, and before that, Thread 3 updating the checkpoint LSN to the too new value.

      Attachments

        Issue Links

          Activity

            origin/bb-10.8-MDEV-28043 d8dd388f5b549000fcd2af0b576bb24154914368 2022-03-14T14:26:09+02:00
            performed well in RQG testing.

            mleich Matthias Leich added a comment - origin/bb-10.8- MDEV-28043 d8dd388f5b549000fcd2af0b576bb24154914368 2022-03-14T14:26:09+02:00 performed well in RQG testing.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.