lock_sys is one of three major InnoDB scalability bottlenecks. Scalability issues are especially obvious under sysbench OLTP update index/non-index benchmarks.
There's no clarity on how exactly it should be optimised yet.
Attachments
Issue Links
blocks
MDEV-21452Use condition variables and normal mutexes instead of InnoDB os_event and mutex
Closed
causes
MDEV-24861Assertion `trx->rsegs.m_redo.rseg' failed in innodb_prepare_commit_versioned
Closed
MDEV-35708lock_rec_get_prev() returns only the first record lock
Closed
includes
MDEV-24731Excessive mutex contention in DeadlockChecker::check_and_resolve()
Closed
is blocked by
MDEV-24671Assertion failure in lock_wait_table_reserve_slot()
Closed
relates to
MDEV-11392AliSQL: [perf] Issue#31 OPTIMIZE CHECK/GRANT OF INNODB TABLE LOCK
As noted in MDEV-20483, the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex. Maybe we should simply omit table locks from trx_locks, and keep them in table_locks only?
Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts.
Similarly, maybe we can extend MDEV-16406 Refactor the InnoDB record locks
by using a lock-free hash table that maps
(trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.
Marko Mäkelä
added a comment - As noted in MDEV-20483 , the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex . Maybe we should simply omit table locks from trx_locks , and keep them in table_locks only?
Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts.
Similarly, maybe we can extend
MDEV-16406 Refactor the InnoDB record locks
by using a lock-free hash table that maps
(trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.
we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf
zhai weixiang
added a comment - we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf
I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE. Table IX lock creation would still cause contention on lock_sys.
Marko Mäkelä
added a comment - - edited MySQL 8.0.21 implemented a sharded lock_sys mutex.
Edit: they seem to have replaced the mutex with a partitioned rw-latch, using page_id_t as the key . For deadlock detection, they appear to employ a global rw-latch.
I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE . Table IX lock creation would still cause contention on lock_sys .
In MDEV-21452 I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread(). That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls.
The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t.
I think that our performance would improve a little if we cleaned this up a little.
Marko Mäkelä
added a comment - In MDEV-21452 I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread() . That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls.
The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t .
I think that our performance would improve a little if we cleaned this up a little.
It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock.
To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id. Each shard is protected by a combination of shared global rw-lock and the mutex.
Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB.
One complication is that lock_wait_suspend_thread() in MDEV-21452 will use a condition variable in combination with lock_sys.mutex, which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex. Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE.
Marko Mäkelä
added a comment - I reviewed the MySQL 8.0.21 WL#10314 again.
It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock.
To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id . Each shard is protected by a combination of shared global rw-lock and the mutex.
Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB.
One complication is that lock_wait_suspend_thread() in MDEV-21452 will use a condition variable in combination with lock_sys.mutex , which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex . Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE .
I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal().
There is no need for a separate lock_wait_timeout_task, which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.
Marko Mäkelä
added a comment - I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal() .
There is no need for a separate lock_wait_timeout_task , which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch ) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.
Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.
Marko Mäkelä
added a comment - Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.
Marko Mäkelä
added a comment - - edited zhaiwx1987 , I adapted the MDEV-11392 idea from MySQL Bug #72948 , but I introduced a single counter dict_table_t::n_lock_x_or_s . There is actually quite a bit of room for improvement in lock_sys , in addition to what was done in MySQL 8.0.21 WL#10314 .
The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in MDEV-24671. As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay).
It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker. It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.
Marko Mäkelä
added a comment - - edited The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in MDEV-24671 . As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay).
It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker . It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.
As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait(). A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex.
Marko Mäkelä
added a comment - As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait() . A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex .
We replaced lock_sys.mutex with a lock_sys.latch (MDEV-24167) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used.
The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode.
Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash. The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*), then we could happily use that.)
Until MDEV-24738 has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF.
Marko Mäkelä
added a comment - We replaced lock_sys.mutex with a lock_sys.latch ( MDEV-24167 ) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used.
The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode.
Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash , lock_sys.prdt_hash , or lock_sys.prdt_page_hash . The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*) , then we could happily use that.)
Until MDEV-24738 has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF .
As noted in
MDEV-20483, the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex. Maybe we should simply omit table locks from trx_locks, and keep them in table_locks only?Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts.
Similarly, maybe we can extend
MDEV-16406 Refactor the InnoDB record locks
by using a lock-free hash table that maps
(trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.