Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
N/A
-
GNU/Linux with mmap() based redo log on /dev/shm
Description
When MDEV-27774 replaced log_sys.mutex with log_sys.latch, it introduced a race condition in mtr_t::do_write():
if (!ex) |
{
|
log_sys.latch.rd_unlock();
|
log_sys.latch.wr_lock(SRW_LOCK_CALL);
|
if (UNIV_LIKELY(!m_user_space->max_lsn)) |
name_write();
|
std::pair<lsn_t,mtr_t::page_flush_ahead> p{finish_write(len, true)}; |
log_sys.latch.wr_unlock();
|
log_sys.latch.rd_lock(SRW_LOCK_CALL);
|
return p; |
}
|
It is not safe to release the exclusive log_sys.latch between finish_write() and ReleaseBlocks. Because we have no portable operation that would downgrade the latch from exclusive to shared, we must retain that exclusive latch until the end of the critical section in mtr_t::commit().
I debugged an rr replay trace of this:
ssh pluto
|
rr replay /data/results/1647008467/TBR-1420/dev/shm/rqg/1647008467/53/1/rr/latest-trace
|
continue
|
watch -l log_sys.last_checkpoint_lsn.m._M_i
|
watch -l buf_pool.flush_list.count
|
reverse-continue
|
reverse-continue
|
reverse-continue
|
thread apply 24 backtrace
|
From the end of the start, we have Thread 3 hitting an assertion failure:
mysqld: /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1877: bool log_checkpoint_low(lsn_t, lsn_t): Assertion `oldest_lsn > log_sys.last_checkpoint_lsn' failed.
|
Before that, we had Thread 24 inserting the unexpectedly old block to buf_pool.flush_list, and before that, Thread 3 updating the checkpoint LSN to the too new value.
Attachments
Issue Links
- is caused by
-
MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()
- Closed