[MDEV-28043] Race condition between mtr_t::commit() and checkpoint Created: 2022-03-11  Updated: 2022-03-17  Resolved: 2022-03-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: N/A
Fix Version/s: 10.9.0, 10.8.3

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: corruption, race, recovery, rr-profile-analyzed
Environment:

GNU/Linux with mmap() based redo log on /dev/shm


Issue Links:
Problem/Incident
is caused by MDEV-27774 Reduce scalability bottlenecks in mtr... Closed

 Description   

When MDEV-27774 replaced log_sys.mutex with log_sys.latch, it introduced a race condition in mtr_t::do_write():

    if (!ex)
    {
      log_sys.latch.rd_unlock();
      log_sys.latch.wr_lock(SRW_LOCK_CALL);
      if (UNIV_LIKELY(!m_user_space->max_lsn))
        name_write();
      std::pair<lsn_t,mtr_t::page_flush_ahead> p{finish_write(len, true)};
      log_sys.latch.wr_unlock();
      log_sys.latch.rd_lock(SRW_LOCK_CALL);
      return p;
    }

It is not safe to release the exclusive log_sys.latch between finish_write() and ReleaseBlocks. Because we have no portable operation that would downgrade the latch from exclusive to shared, we must retain that exclusive latch until the end of the critical section in mtr_t::commit().

I debugged an rr replay trace of this:

ssh pluto
rr replay /data/results/1647008467/TBR-1420/dev/shm/rqg/1647008467/53/1/rr/latest-trace

continue
watch -l log_sys.last_checkpoint_lsn.m._M_i
watch -l buf_pool.flush_list.count
reverse-continue
reverse-continue
reverse-continue
thread apply 24 backtrace

From the end of the start, we have Thread 3 hitting an assertion failure:

mysqld: /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1877: bool log_checkpoint_low(lsn_t, lsn_t): Assertion `oldest_lsn > log_sys.last_checkpoint_lsn' failed.

Before that, we had Thread 24 inserting the unexpectedly old block to buf_pool.flush_list, and before that, Thread 3 updating the checkpoint LSN to the too new value.



 Comments   
Comment by Matthias Leich [ 2022-03-14 ]

origin/bb-10.8-MDEV-28043 d8dd388f5b549000fcd2af0b576bb24154914368 2022-03-14T14:26:09+02:00
performed well in RQG testing.

Generated at Thu Feb 08 09:57:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.