[MDEV-24142] rw_lock_t has unnecessarily complex wait logic Created: 2020-11-05 Updated: 2023-09-01 Resolved: 2020-12-03 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2, 10.3, 10.4, 10.5, 10.6 |
| Fix Version/s: | 10.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The InnoDB custom implementation of a read-write latch (rw_lock_t) encapsulates two os_event_t. Both os_event_t encapsulate a mutex and a condition variable. One mutex would suffice. There also are some data fields to control the waiting, although the rw_lock_t::lock_word alone is sufficient for that. As part of |
| Comments |
| Comment by Marko Mäkelä [ 2020-11-06 ] | |||||||||||||||||||||||||||||||||||||
|
I suspect that the redundancy of os_event_t was masking some problems related to SX-latches. The use of volatile as poor man’s std::atomic might have worked only thanks to the redundant state in os_event_t. The test innodb.innodb_wl6326_big in CMAKE_BUILD_TYPE=RelWithDebInfo seems to be a good indicator for the problems on AMD64. I think that more of rw_lock_t must be rewritten to make this work. I started working on folding 3 fields into one std::atomic<int64_t>. The changes are so large that extensive stress testing on multiple platforms (at least AMD64, ARMv8, POWER) will be needed. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-09 ] | |||||||||||||||||||||||||||||||||||||
|
It turns out that the test innodb.innodb_wl6326_big is not a good indicator for anything, because it reliably fails on my system for the RelWithDebInfo build even without these modifications. I think that it would be cleanest to implement the logic with one std::atomic<uint32_t>, a mutex, two condition variables, and a shared field that identifies the X or SX latch holder. If we remove the use of recursive X and SX latches, it should be possible to allocate only 1 bit for each special mode, and 1 bit to indicate that a request is waiting. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-09 ] | |||||||||||||||||||||||||||||||||||||
|
After
| |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-09 ] | |||||||||||||||||||||||||||||||||||||
|
I think that we could use a single std::atomic<uint32_t> lock_word and mysql_rwlock_t as follows:
Keeping the count_os_wait instrumentation would complicate it a bit. I was thinking to use a global counter for that for buf_block_t::lock but it seems that it could become a serious source of false sharing. So, maybe we should use an atomic 32-bit lock word and a 32-bit non-atomic (race-prone) counter, which would shrink the object size to 4+4+64=72 bytes on my Debian GNU/Linux AMD64 system. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-16 ] | |||||||||||||||||||||||||||||||||||||
|
An even better idea could be to replace the implementation with the following:
Removing recursive X or SX latch requests is tricky. The biggest challenges have been around BLOB operations, where the individual BLOB pages are being initialized or freed in a separate mini-transaction from the main one that is holding an X latch on the page that contains the BLOB pointer. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-18 ] | |||||||||||||||||||||||||||||||||||||
|
Removing recursive X or SX latch requests is indeed very tricky, could be risky and make code harder to debug, because the modfication history of a page would be split between two mini-transactions. We only need those (as well as the SX mode) for buf_block_t::lock and dict_index_t::lock. For that usage, it could be simplest to add an additional std::atomic<uint32_t> for counting recursive locks. I think that it should be practical to prohibit a thread from acquiring SX lock while already holding the lock in X mode, or vice versa. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-26 ] | |||||||||||||||||||||||||||||||||||||
|
I have a promising solution based on two srw_lock ( This and | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-27 ] | |||||||||||||||||||||||||||||||||||||
|
With the approach based on two srw_lock there is a potential deadlock if s_lock() is acquired after u_lock(). To solve this, I unconditionally implemented srw_lock always based on rw_lock and its std::atomic<uint32_t>, and in the end implemented an update lock in rw_lock and srw_lock. The only additional ‘service’ of sux_lock is that it allows recursive U and X locks, as required by InnoDB dict_index_t::lock and dict_block_t::lock. Compared to old rw_lock_t, if an U lock is upgraded to X lock, then all pre-existing U lock in the mtr_t::m_memo must be edited to X lock. This was necessary, because a u_unlock() after an x_unlock() is not allowed. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-03 ] | |||||||||||||||||||||||||||||||||||||
|
I pushed this in multiple commits, to keep some mechanical changes separate so that the changes are easier to follow. The commits in the git log order (newest to oldest) are as follows: ba2d45dc540 MDEV-24142: Remove INFORMATION_SCHEMA.INNODB_MUTEXES There are some user-visible changes:
The last commit, removes my reimplementation of SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES output that did not rely on rw_lock_list, which we removed. The reimplemented output was better than original, because the mutex name was `index_name`(`database_name`.`schema_name`) rather than being some obscure constant for all dict_index_t::lock instances. The reason why we removed the output and the related sux_lock::waits was that the bookkeeping incurred a significant overhead. This was also the reason why I decided to distinguish immediately granted latch requests from blocking ones in PERFORMANCE_SCHEMA, so that the information about waits would be available via that interface. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-03 ] | |||||||||||||||||||||||||||||||||||||
|
I made one more tweak to reduce sizeof(block_lock) from 24 to 16 bytes on 64-bit systems. The sizeof(index_lock) still incurs alignment losses of 4+4 bytes unless cmake -DPLUGIN_PERFSCHEMA=NO is used. With PERFORMANCE_SCHEMA support enabled, I have sizeof(dict_index_t)==400. Thanks to For reference, the MariaDB Server 10.2 with the same build options has sizeof(buf_page_t)==136, sizeof(buf_block_t)==368, sizeof(rw_lock_t)==112, sizeof(dict_index_t)==496. We have exactly halved sizeof(buf_block_t) thanks to | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-04-15 ] | |||||||||||||||||||||||||||||||||||||
|
As noted in | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-04-20 ] | |||||||||||||||||||||||||||||||||||||
|
Unless cmake -DPLUGIN_PERFSCHEMA=NO, ssux_lock will extend ssux_lock_low with a pointer, for instrumentation, used by dict_index_t::lock. The sux_lock extends ssux_lock_low or ssux_lock with a 4-byte recursion count and a thread identifier of the U or X holder. On other systems (or if SUX_LOCK_GENERIC is defined), the layout of all these remains unchanged even with |