[MDEV-20612] Improve InnoDB lock_sys scalability - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.6.0
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

lock_sys is one of three major InnoDB scalability bottlenecks. Scalability issues are especially obvious under sysbench OLTP update index/non-index benchmarks.

There's no clarity on how exactly it should be optimised yet.

Attachments

Issue Links

blocks

MDEV-21452 Use condition variables and normal mutexes instead of InnoDB os_event and mutex

Closed

causes

MDEV-24861 Assertion `trx->rsegs.m_redo.rseg' failed in innodb_prepare_commit_versioned

Closed

MDEV-35708 lock_rec_get_prev() returns only the first record lock

Closed

includes

MDEV-24731 Excessive mutex contention in DeadlockChecker::check_and_resolve()

Closed

is blocked by

MDEV-24671 Assertion failure in lock_wait_table_reserve_slot()

Closed

relates to

MDEV-11392 AliSQL: [perf] Issue#31 OPTIMIZE CHECK/GRANT OF INNODB TABLE LOCK

Closed

MDEV-16406 Refactor the InnoDB record locks

Open

MDEV-24813 Locking full table scan fails to use table-level locking

In Review

MDEV-24948 thd_need_wait_reports() hurts performance

Open

MDEV-24952 Simplify the locking and access of lock hash tables

Closed

MDEV-26476 InnoDB is missing futex support on some platforms

Closed

MDEV-18706 ER_LOCK_DEADLOCK on concurrent read and insert into already locked gap

In Review

MDEV-21175 Remove dict_table_t::n_foreign_key_checks_running from InnoDB

Closed

MDEV-24915 Galera conflict resolution is unnecessarily complex

Closed

MDEV-25010 Assertion `!lock_sys_t::get_first(receiver_cell, receiver_id, receiver_heap_no)' failed in lock_rec_move

Closed

links to

WL#10314: InnoDB: Lock-sys optimization: sharded lock_sys mutex

(10 relates to, 1 links to)

Activity

Ascending order - Click to sort in descending order

Sergey Vojtovich created issue - 2019-09-17 13:17

Marko Mäkelä added a comment - 2019-09-17 13:26

As noted in ~~MDEV-20483~~, the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex. Maybe we should simply omit table locks from trx_locks, and keep them in table_locks only?

Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts.

Similarly, maybe we can extend
MDEV-16406 Refactor the InnoDB record locks
by using a lock-free hash table that maps
(trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.

Marko Mäkelä added a comment - 2019-09-17 13:26 As noted in MDEV-20483 , the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex . Maybe we should simply omit table locks from trx_locks , and keep them in table_locks only? Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts. Similarly, maybe we can extend MDEV-16406 Refactor the InnoDB record locks by using a lock-free hash table that maps (trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.

Marko Mäkelä made changes - 2019-09-17 13:26

Field	Original Value	New Value
Link		This issue relates to MDEV-16406 [ MDEV-16406 ]

zhai weixiang added a comment - 2019-10-09 03:16

we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf

zhai weixiang added a comment - 2019-10-09 03:16 we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf

Sergei Golubchik made changes - 2019-11-02 09:13

Assignee

Sergey Vojtovich [ svoj ]

Sergei Golubchik made changes - 2019-11-02 09:13

Fix Version/s

10.5 [ 23123 ]

Marko Mäkelä made changes - 2019-11-17 14:33

Link

This issue relates to ~~MDEV-11392~~ [ ~~MDEV-11392~~ ]

Marko Mäkelä made changes - 2019-11-29 06:01

Link

This issue relates to ~~MDEV-21175~~ [ ~~MDEV-21175~~ ]

Marko Mäkelä made changes - 2020-06-01 07:38

Link

This issue blocks ~~MDEV-21452~~ [ ~~MDEV-21452~~ ]

Marko Mäkelä added a comment - 2020-10-05 09:53 - edited

MySQL 8.0.21 implemented a sharded lock_sys mutex.

Edit: they seem to have replaced the mutex with a partitioned rw-latch, using page_id_t as the key. For deadlock detection, they appear to employ a global rw-latch.

I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE. Table IX lock creation would still cause contention on lock_sys.

Marko Mäkelä added a comment - 2020-10-05 09:53 - edited MySQL 8.0.21 implemented a sharded lock_sys mutex. Edit: they seem to have replaced the mutex with a partitioned rw-latch, using page_id_t as the key . For deadlock detection, they appear to employ a global rw-latch. I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE . Table IX lock creation would still cause contention on lock_sys .

Marko Mäkelä made changes - 2020-10-05 09:53

Remote Link

This issue links to "WL#10314: InnoDB: Lock-sys optimization: sharded lock_sys mutex (Web Link)" [ 30307 ]

Marko Mäkelä added a comment - 2020-11-05 14:44

In ~~MDEV-21452~~ I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread(). That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls.

The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t.

I think that our performance would improve a little if we cleaned this up a little.

Marko Mäkelä added a comment - 2020-11-05 14:44 In MDEV-21452 I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread() . That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls. The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t . I think that our performance would improve a little if we cleaned this up a little.

Marko Mäkelä added a comment - 2020-12-16 07:42

I reviewed the MySQL 8.0.21 WL#10314 again.

It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock.
To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id. Each shard is protected by a combination of shared global rw-lock and the mutex.

Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB.

One complication is that lock_wait_suspend_thread() in ~~MDEV-21452~~ will use a condition variable in combination with lock_sys.mutex, which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex. Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE.

Marko Mäkelä added a comment - 2020-12-16 07:42 I reviewed the MySQL 8.0.21 WL#10314 again. It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock. To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id . Each shard is protected by a combination of shared global rw-lock and the mutex. Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB. One complication is that lock_wait_suspend_thread() in MDEV-21452 will use a condition variable in combination with lock_sys.mutex , which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex . Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE .

Marko Mäkelä made changes - 2020-12-16 07:42

Assignee	Sergey Vojtovich [ svoj ]	Marko Mäkelä [ marko ]
Labels		performance

Marko Mäkelä made changes - 2021-01-04 11:07

Fix Version/s

10.6 [ 24028 ]

Marko Mäkelä made changes - 2021-01-04 11:07

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2021-01-08 17:45

I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal().

There is no need for a separate lock_wait_timeout_task, which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.

Marko Mäkelä added a comment - 2021-01-08 17:45 I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal() . There is no need for a separate lock_wait_timeout_task , which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch ) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.

Marko Mäkelä added a comment - 2021-01-09 17:08

Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.

Marko Mäkelä added a comment - 2021-01-09 17:08 Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.

Marko Mäkelä added a comment - 2021-01-15 16:25 - edited

zhaiwx1987, I adapted the ~~MDEV-11392~~ idea from MySQL Bug #72948, but I introduced a single counter dict_table_t::n_lock_x_or_s. There is actually quite a bit of room for improvement in lock_sys, in addition to what was done in MySQL 8.0.21 WL#10314.

Marko Mäkelä added a comment - 2021-01-15 16:25 - edited zhaiwx1987 , I adapted the MDEV-11392 idea from MySQL Bug #72948 , but I introduced a single counter dict_table_t::n_lock_x_or_s . There is actually quite a bit of room for improvement in lock_sys , in addition to what was done in MySQL 8.0.21 WL#10314 .

Marko Mäkelä added a comment - 2021-01-19 18:29 - edited

The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in ~~MDEV-24671~~. As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay).

It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker. It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.

Marko Mäkelä added a comment - 2021-01-19 18:29 - edited The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in MDEV-24671 . As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay). It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker . It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.

Marko Mäkelä made changes - 2021-01-25 06:39

Link

This issue blocks ~~MDEV-24671~~ [ ~~MDEV-24671~~ ]

Marko Mäkelä made changes - 2021-01-26 10:19

Link

This issue blocks ~~MDEV-24671~~ [ ~~MDEV-24671~~ ]

Marko Mäkelä made changes - 2021-01-26 10:22

Link

This issue is blocked by ~~MDEV-24671~~ [ ~~MDEV-24671~~ ]

Marko Mäkelä added a comment - 2021-01-28 17:32

As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait(). A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex.

Marko Mäkelä added a comment - 2021-01-28 17:32 As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait() . A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex .

Marko Mäkelä made changes - 2021-01-29 07:28

Link

This issue includes ~~MDEV-24731~~ [ ~~MDEV-24731~~ ]

Marko Mäkelä made changes - 2021-02-09 06:37

Link

This issue relates to MDEV-24813 [ MDEV-24813 ]

Marko Mäkelä added a comment - 2021-02-12 16:00

We replaced lock_sys.mutex with a lock_sys.latch (~~MDEV-24167~~) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used.

The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode.

Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash. The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*), then we could happily use that.)

Until ~~MDEV-24738~~ has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF.

Marko Mäkelä added a comment - 2021-02-12 16:00 We replaced lock_sys.mutex with a lock_sys.latch ( MDEV-24167 ) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used. The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode. Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash , lock_sys.prdt_hash , or lock_sys.prdt_page_hash . The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*) , then we could happily use that.) Until MDEV-24738 has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF .

Marko Mäkelä made changes - 2021-02-12 16:00

issue.field.resolutiondate

2021-02-12 16:00:06.0

2021-02-12 16:00:06.059

Marko Mäkelä made changes - 2021-02-12 16:00

Fix Version/s		10.6.0 [ 24431 ]
Fix Version/s	10.6 [ 24028 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2021-02-15 08:10

Link

This issue causes ~~MDEV-24861~~ [ ~~MDEV-24861~~ ]

Marko Mäkelä made changes - 2021-02-18 10:16

Link

This issue relates to ~~MDEV-24915~~ [ ~~MDEV-24915~~ ]

Marko Mäkelä made changes - 2021-02-23 07:31

Link

This issue relates to MDEV-24948 [ MDEV-24948 ]

Marko Mäkelä made changes - 2021-02-23 10:57

Link

This issue relates to ~~MDEV-24952~~ [ ~~MDEV-24952~~ ]

Roel Van de Paar made changes - 2021-04-20 22:56

Link

This issue relates to ~~MDEV-25010~~ [ ~~MDEV-25010~~ ]

Aleksey Midenkov made changes - 2021-06-21 06:28

Link

This issue relates to MDEV-18706 [ MDEV-18706 ]

Marko Mäkelä made changes - 2021-08-25 09:22

Link

This issue relates to ~~MDEV-26476~~ [ ~~MDEV-26476~~ ]

Sergei Golubchik made changes - 2021-12-06 21:24

Workflow

MariaDB v3 [ 99716 ]

MariaDB v4 [ 134092 ]

Jira Automation (IT) made changes - 2024-07-04 03:32

Zendesk Related Tickets		201658
Zendesk active tickets		201658

Vladislav Lesin made changes - 2024-12-23 05:26

Link

This issue causes ~~MDEV-35708~~ [ ~~MDEV-35708~~ ]

People

Assignee:: Marko Mäkelä

Reporter:: Sergey Vojtovich

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2019-09-17 13:17

Updated:: 2024-12-23 05:26

Resolved:: 2021-02-12 16:00

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration