[MDEV-24142] rw_lock_t has unnecessarily complex wait logic Created: 2020-11-05  Updated: 2023-09-01  Resolved: 2020-12-03

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.6.0

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Blocks
blocks MDEV-14659 Innodb scalibility issue found in Mar... Closed
blocks MDEV-18976 Implement a CHECKSUM redo log record ... Closed
blocks MDEV-21452 Use condition variables and normal mu... Closed
is blocked by MDEV-24167 InnoDB unnecessarily uses complex rw-... Closed
is blocked by MDEV-24240 Pessimistic operations on SPATIAL IND... Open
Problem/Incident
causes MDEV-24642 Assertion `r->emplace(pthread_self())... Closed
causes MDEV-24643 Assertion `(l & (WRITER | UPDATER)) =... Closed
causes MDEV-24834 Assertion `mtr->memo_contains_flagged... Closed
causes MDEV-25512 Deadlock between sux_lock::u_x_upgrad... Closed
causes MDEV-29336 mysqld: storage/innobase/include/sux_... Closed
Relates
relates to MDEV-7109 Add support for INFORMATION_SCHEMA.IN... Closed
relates to MDEV-7399 Add support for INFORMATION_SCHEMA.IN... Closed
relates to MDEV-24410 Deadlock between freeing and allocati... Closed
relates to MDEV-24786 Assertion `!have_x()' failed in sux_l... Closed
relates to MDEV-25267 Reported latching order violation in ... Open
relates to MDEV-26467 Unnecessary compare-and-swap loop in ... Closed
relates to MDEV-27058 Buffer page descriptors are too large Closed
relates to MDEV-15798 Mutex leak on accessing INFORMATION_S... Closed
relates to MDEV-21330 Lock monitor doesn't print a semaphor... Closed
relates to MDEV-24630 MY_RELAX_CPU assembly instruction upg... Closed
relates to MDEV-25404 read-only performance regression in 10.6 Closed
relates to MDEV-32065 Always check whether lock is free at ... Open

 Description   

The InnoDB custom implementation of a read-write latch (rw_lock_t) encapsulates two os_event_t. Both os_event_t encapsulate a mutex and a condition variable. One mutex would suffice. There also are some data fields to control the waiting, although the rw_lock_t::lock_word alone is sufficient for that.

As part of MDEV-21452, we would replace all os_event_t with condition variables. Simplifying the read-write latch implementation is a prerequisite for that.



 Comments   
Comment by Marko Mäkelä [ 2020-11-06 ]

I suspect that the redundancy of os_event_t was masking some problems related to SX-latches. The use of volatile as poor man’s std::atomic might have worked only thanks to the redundant state in os_event_t. The test innodb.innodb_wl6326_big in CMAKE_BUILD_TYPE=RelWithDebInfo seems to be a good indicator for the problems on AMD64.

I think that more of rw_lock_t must be rewritten to make this work. I started working on folding 3 fields into one std::atomic<int64_t>. The changes are so large that extensive stress testing on multiple platforms (at least AMD64, ARMv8, POWER) will be needed.

Comment by Marko Mäkelä [ 2020-11-09 ]

It turns out that the test innodb.innodb_wl6326_big is not a good indicator for anything, because it reliably fails on my system for the RelWithDebInfo build even without these modifications.

I think that it would be cleanest to implement the logic with one std::atomic<uint32_t>, a mutex, two condition variables, and a shared field that identifies the X or SX latch holder. If we remove the use of recursive X and SX latches, it should be possible to allocate only 1 bit for each special mode, and 1 bit to indicate that a request is waiting.

Comment by Marko Mäkelä [ 2020-11-09 ]

After MDEV-24167, we can do the following:

  • Remove rw_lock_list.
  • Remove support for recursive X or SX latch requests.
  • Implement the spinning, waiting and wakeup around a single std::atomic<uint32_t> field, to have clean semantics.
Comment by Marko Mäkelä [ 2020-11-09 ]

I think that we could use a single std::atomic<uint32_t> lock_word and mysql_rwlock_t as follows:

void s_lock()
{
  if (X & lock_word.fetch_add(1, std::memory_order_acquire))
  {
    mysql_rwlock_rdlock(&lock);
    mysql_rwlock_unlock(&lock);
  }
}
void s_unlock() { lock_word.fetch_sub(1, std::memory_order_release); }
void sx_lock()
{
  uint32_t l= lock_word.fetch_or(SX, std::memory_order_acquire);
  if (l & (SX | X))
    mysql_rwlock_wrlock(&lock);
  else
    mysql_rwlock_rdlock(&lock);
  // an sx_unlock() might have cleared the flag meanwhile
  l= lock_word.fetch_or(SX, std::memory_order_acquire);
  ut_ad(!(l & X));
}
void sx_unlock()
{
  lock_word.fetch_and(~SX, std::memory_order_release);
  mysql_rwlock_unlock(&lock);
}
void x_lock()
{
  lock_word.fetch_or(X, std::memory_order_acquire);
  mysql_rwlock_wrlock(&lock);
  while (~X & lock_word.fetch_or(X, std::memory_order_acquire))
    ;// TODO: wait somehow (sleep?) for s_unlock()
}
void x_unlock()
{
  lock_word.fetch_and(~X, std::memory_order_release);
  mysql_rwlock_unlock(&lock);
}

Keeping the count_os_wait instrumentation would complicate it a bit. I was thinking to use a global counter for that for buf_block_t::lock but it seems that it could become a serious source of false sharing. So, maybe we should use an atomic 32-bit lock word and a 32-bit non-atomic (race-prone) counter, which would shrink the object size to 4+4+64=72 bytes on my Debian GNU/Linux AMD64 system.

Comment by Marko Mäkelä [ 2020-11-16 ]

An even better idea could be to replace the implementation with the following:

  • Microsoft Windows: SRWLOCK (or two, to emulate SX lock mode)
  • Linux: std::atomic<uint32_t> and futex
  • Anything else: std:atomic<uint32_t> and mysql_rwlock_t

Removing recursive X or SX latch requests is tricky. The biggest challenges have been around BLOB operations, where the individual BLOB pages are being initialized or freed in a separate mini-transaction from the main one that is holding an X latch on the page that contains the BLOB pointer.

Comment by Marko Mäkelä [ 2020-11-18 ]

Removing recursive X or SX latch requests is indeed very tricky, could be risky and make code harder to debug, because the modfication history of a page would be split between two mini-transactions. We only need those (as well as the SX mode) for buf_block_t::lock and dict_index_t::lock. For that usage, it could be simplest to add an additional std::atomic<uint32_t> for counting recursive locks. I think that it should be practical to prohibit a thread from acquiring SX lock while already holding the lock in X mode, or vice versa.

Comment by Marko Mäkelä [ 2020-11-26 ]

I have a promising solution based on two srw_lock (MDEV-24167), a std::atomic<os_thread_id_t> indicating the current write lock holder, and a uint32_t recursive field that is protected by the write lock. What is still missing is some debug code to keep track of S lock holders, PERFORMANCE_SCHEMA instrumentation, and some LatchDebug instrumentation if we still consider it useful. (It has never really worked for index pages, as demonstrated by MDEV-14637. It might have worked for other types of buffer pages, but the last benefit of using that must have been years ago.)

This and MDEV-21452 would allow us to remove os_event_t.

Comment by Marko Mäkelä [ 2020-11-27 ]

With the approach based on two srw_lock there is a potential deadlock if s_lock() is acquired after u_lock(). To solve this, I unconditionally implemented srw_lock always based on rw_lock and its std::atomic<uint32_t>, and in the end implemented an update lock in rw_lock and srw_lock. The only additional ‘service’ of sux_lock is that it allows recursive U and X locks, as required by InnoDB dict_index_t::lock and dict_block_t::lock. Compared to old rw_lock_t, if an U lock is upgraded to X lock, then all pre-existing U lock in the mtr_t::m_memo must be edited to X lock. This was necessary, because a u_unlock() after an x_unlock() is not allowed.

Comment by Marko Mäkelä [ 2020-12-03 ]

I pushed this in multiple commits, to keep some mechanical changes separate so that the changes are easier to follow. The commits in the git log order (newest to oldest) are as follows:

ba2d45dc540 MDEV-24142: Remove INFORMATION_SCHEMA.INNODB_MUTEXES
9702be2c73c MDEV-24142: Remove __FILE__,__LINE__ related to buf_block_t::lock
ac028ec5d85 MDEV-24142: Remove the LatchDebug interface to rw-locks
06efef4be39 MDEV-24308: Windows improvements
03ca6495df3 MDEV-24142: Replace InnoDB rw_lock_t with sux_lock
d46b42489a6 MDEV-24142 preparation: Add srw_mutex and srw_lock::u_lock()

There are some user-visible changes:

  • INFORMATION_SCHEMA.INNODB_MUTEXES (MDEV-7399) was removed. Starting with MariaDB 10.2.2, it reported the number of waits in rw-locks and no information on mutexes.
  • SHOW ENGINE INNODB MUTEX will not report any information about rw-locks.
  • PERFORMANCE_SCHEMA will distinguish lock waits. The try_ events will be logged for wait-free rw-lock acquisitions. Actual high-level ‘try’ operations in the code will not be logged in PERFORMANCE_SCHEMA at all.
  • The SEMAPHORES section of SHOW ENGINE INNODB STATUS will only list information about TTASEventMutex (if building with cmake -DMUTEXTYPE=event), subject to removal in MDEV-21452.

The last commit, removes my reimplementation of SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES output that did not rely on rw_lock_list, which we removed. The reimplemented output was better than original, because the mutex name was `index_name`(`database_name`.`schema_name`) rather than being some obscure constant for all dict_index_t::lock instances. The reason why we removed the output and the related sux_lock::waits was that the bookkeeping incurred a significant overhead. This was also the reason why I decided to distinguish immediately granted latch requests from blocking ones in PERFORMANCE_SCHEMA, so that the information about waits would be available via that interface.

Comment by Marko Mäkelä [ 2020-12-03 ]

I made one more tweak to reduce sizeof(block_lock) from 24 to 16 bytes on 64-bit systems. The sizeof(index_lock) still incurs alignment losses of 4+4 bytes unless cmake -DPLUGIN_PERFSCHEMA=NO is used. With PERFORMANCE_SCHEMA support enabled, I have sizeof(dict_index_t)==400.

Thanks to MDEV-15053 removing buf_block_t::mutex earlier, we now have sizeof(buf_page_t)==112 and sizeof(buf_block_t)==184 on my AMD64 GNU/Linux system. I remember that I got sizeof(buf_page_t) down to exactly 64 bytes on 32-bit systems, back when I implemented ROW_FORMAT=COMPRESSED for the InnoDB Plugin for MySQL 5.1. Since then, we replaced some bit fields with std::atomic. Maybe we should try using ‘manual’ std::atomic::fetch_or() and friends to shrink it further.

For reference, the MariaDB Server 10.2 with the same build options has sizeof(buf_page_t)==136, sizeof(buf_block_t)==368, sizeof(rw_lock_t)==112, sizeof(dict_index_t)==496. We have exactly halved sizeof(buf_block_t) thanks to MDEV-15053 and MDEV-24142.

Comment by Marko Mäkelä [ 2021-04-15 ]

As noted in MDEV-25404, we will need separate wait queues for shared and exclusive requests. The sizeof(sux_lock) and thus sizeof(buf_block_t) and sizeof(dict_index_t) will increase.

Comment by Marko Mäkelä [ 2021-04-20 ]

MDEV-25404 on Linux, OpenBSD and Microsoft Windows doubled sizeof(ssux_lock_low) from 4 to 8, or from 1 to 2 futex-backed 32-bit words, so that we can minimize the system calls and have a separate queue for waiting exclusive requests that will be notified when the last shared request is released.

Unless cmake -DPLUGIN_PERFSCHEMA=NO, ssux_lock will extend ssux_lock_low with a pointer, for instrumentation, used by dict_index_t::lock.

The sux_lock extends ssux_lock_low or ssux_lock with a 4-byte recursion count and a thread identifier of the U or X holder.

On other systems (or if SUX_LOCK_GENERIC is defined), the layout of all these remains unchanged even with MDEV-25404: instead of 2*4 bytes in ssux_lock_low, we have a 4-byte lock word, 1 mutex and 2 condition variables. I estimate that to be somewhere between 100 and 200 bytes.

Generated at Thu Feb 08 09:27:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.