[MDEV-20612] Improve InnoDB lock_sys scalability Created: 2019-09-17  Updated: 2021-08-25  Resolved: 2021-02-12

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.6.0

Type: Task Priority: Major
Reporter: Sergey Vojtovich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: performance

Issue Links:
Blocks
blocks MDEV-21452 Use condition variables and normal mu... Closed
is blocked by MDEV-24671 Assertion failure in lock_wait_table_... Closed
PartOf
includes MDEV-24731 Excessive mutex contention in Deadloc... Closed
Problem/Incident
causes MDEV-24861 Assertion `trx->rsegs.m_redo.rseg' fa... Closed
Relates
relates to MDEV-11392 AliSQL: [perf] Issue#31 OPTIMIZE CHEC... Closed
relates to MDEV-16406 Refactor the InnoDB record locks Open
relates to MDEV-24813 Locking full table scan fails to use ... Confirmed
relates to MDEV-24948 thd_need_wait_reports() hurts perform... Open
relates to MDEV-24952 Simplify the locking and access of lo... Closed
relates to MDEV-26476 InnoDB is missing futex support on so... Closed
relates to MDEV-18706 ER_LOCK_DEADLOCK on concurrent read a... Stalled
relates to MDEV-21175 Remove dict_table_t::n_foreign_key_ch... Closed
relates to MDEV-24915 Galera conflict resolution is unneces... Closed
relates to MDEV-25010 Assertion `!lock_sys_t::get_first(rec... Closed

 Description   

lock_sys is one of three major InnoDB scalability bottlenecks. Scalability issues are especially obvious under sysbench OLTP update index/non-index benchmarks.

There's no clarity on how exactly it should be optimised yet.



 Comments   
Comment by Marko Mäkelä [ 2019-09-17 ]

As noted in MDEV-20483, the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex. Maybe we should simply omit table locks from trx_locks, and keep them in table_locks only?

Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts.

Similarly, maybe we can extend
MDEV-16406 Refactor the InnoDB record locks
by using a lock-free hash table that maps
(trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.

Comment by zhai weixiang [ 2019-10-09 ]

we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf

Comment by Marko Mäkelä [ 2020-10-05 ]

MySQL 8.0.21 implemented a sharded lock_sys mutex.

Edit: they seem to have replaced the mutex with a partitioned rw-latch, using page_id_t as the key. For deadlock detection, they appear to employ a global rw-latch.

I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE. Table IX lock creation would still cause contention on lock_sys.

Comment by Marko Mäkelä [ 2020-11-05 ]

In MDEV-21452 I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread(). That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls.

The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t.

I think that our performance would improve a little if we cleaned this up a little.

Comment by Marko Mäkelä [ 2020-12-16 ]

I reviewed the MySQL 8.0.21 WL#10314 again.

  • It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock.
  • To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id. Each shard is protected by a combination of shared global rw-lock and the mutex.

Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB.

One complication is that lock_wait_suspend_thread() in MDEV-21452 will use a condition variable in combination with lock_sys.mutex, which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex. Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE.

Comment by Marko Mäkelä [ 2021-01-08 ]

I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal().

There is no need for a separate lock_wait_timeout_task, which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.

Comment by Marko Mäkelä [ 2021-01-09 ]

Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.

Comment by Marko Mäkelä [ 2021-01-15 ]

zhaiwx1987, I adapted the MDEV-11392 idea from MySQL Bug #72948, but I introduced a single counter dict_table_t::n_lock_x_or_s. There is actually quite a bit of room for improvement in lock_sys, in addition to what was done in MySQL 8.0.21 WL#10314.

Comment by Marko Mäkelä [ 2021-01-19 ]

The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in MDEV-24671. As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay).

It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker. It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.

Comment by Marko Mäkelä [ 2021-01-28 ]

As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait(). A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex.

Comment by Marko Mäkelä [ 2021-02-12 ]

We replaced lock_sys.mutex with a lock_sys.latch (MDEV-24167) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used.

The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode.

Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash. The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*), then we could happily use that.)

Until MDEV-24738 has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF.

Generated at Thu Feb 08 09:00:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.