Details

    Description

      lock_sys is one of three major InnoDB scalability bottlenecks. Scalability issues are especially obvious under sysbench OLTP update index/non-index benchmarks.

      There's no clarity on how exactly it should be optimised yet.

      Attachments

        Issue Links

          Activity

            svoj Sergey Vojtovich created issue -

            As noted in MDEV-20483, the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex. Maybe we should simply omit table locks from trx_locks, and keep them in table_locks only?

            Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts.

            Similarly, maybe we can extend
            MDEV-16406 Refactor the InnoDB record locks
            by using a lock-free hash table that maps
            (trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.

            marko Marko Mäkelä added a comment - As noted in MDEV-20483 , the apparent reason for table_locks to exist is that it is a cache of trx_locks that can only be accessed by the thread that is executing the transaction. This allows callers of lock_table_has() to avoid accessing trx_t::mutex . Maybe we should simply omit table locks from trx_locks , and keep them in table_locks only? Maybe we could store table_locks in a lock-free hash table, so that they can traversed by diagnostic printouts. Similarly, maybe we can extend MDEV-16406 Refactor the InnoDB record locks by using a lock-free hash table that maps (trx_id,space_id,page_number,heap_number) or a subset of it, such as (space_id,page_number), to a bitmap.
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            zhaiwx1987 zhai weixiang added a comment -

            we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf

            zhaiwx1987 zhai weixiang added a comment - we partitoned the record lock hash which works fine. the diffcult part is how to detect deadlock with multiple partition. Lock-free lock system is a good idea, AFAIK, Aws aurora did a similar job to make it lock free. There's a paper describing this: https://ts.data61.csiro.au/publications/nicta_full_text/6465.pdf
            serg Sergei Golubchik made changes -
            Assignee Sergey Vojtovich [ svoj ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.5 [ 23123 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            MySQL 8.0.21 implemented a sharded lock_sys mutex.

            Edit: they seem to have replaced the mutex with a partitioned rw-latch, using page_id_t as the key. For deadlock detection, they appear to employ a global rw-latch.

            I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE. Table IX lock creation would still cause contention on lock_sys.

            marko Marko Mäkelä added a comment - - edited MySQL 8.0.21 implemented a sharded lock_sys mutex. Edit: they seem to have replaced the mutex with a partitioned rw-latch, using page_id_t as the key . For deadlock detection, they appear to employ a global rw-latch. I think that it might be helpful to first implement MDEV-16232 to avoid the creation of explicit record locks during non-conflicting DELETE and UPDATE . Table IX lock creation would still cause contention on lock_sys .
            marko Marko Mäkelä made changes -

            In MDEV-21452 I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread(). That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls.

            The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t.

            I think that our performance would improve a little if we cleaned this up a little.

            marko Marko Mäkelä added a comment - In MDEV-21452 I was trying to use trx_t::mutex instead of lock_sys.mutex with a condition variable, to resume lock_wait_suspend_thread() . That did not quite work: we would get some race condition, most likely due to the wait ending prematurely. The change was mostly adding trx->mutex acquisition around lock_reset_lock_and_trx_wait() calls. The code around suspending and resuming threads is rather convoluted. In particular, que_thr_stop_for_mysql() and lock_wait_suspend_thread() are separated from each other, and the trx->mutex is being acquired and released multiple times while a lock wait is being registered. Also, there are multiple state fields related to lock waits, both in que_thr_t and trx_t . I think that our performance would improve a little if we cleaned this up a little.

            I reviewed the MySQL 8.0.21 WL#10314 again.

            • It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock.
            • To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id. Each shard is protected by a combination of shared global rw-lock and the mutex.

            Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB.

            One complication is that lock_wait_suspend_thread() in MDEV-21452 will use a condition variable in combination with lock_sys.mutex, which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex. Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE.

            marko Marko Mäkelä added a comment - I reviewed the MySQL 8.0.21 WL#10314 again. It replaces lock_sys.mutex with an rw-lock. Many operations, such as deadlock detection, will be protected by the exclusive global lock. To improve scalability, it introduces 256+256 mutexes, indexed by page_id_t or dict_table_t::id . Each shard is protected by a combination of shared global rw-lock and the mutex. Because implementing MDEV-16232 would require much more effort than this, I think that we must consider this approach for MariaDB. One complication is that lock_wait_suspend_thread() in MDEV-21452 will use a condition variable in combination with lock_sys.mutex , which would be replaced by rw-latch above. It might be easiest to combine the condition variable with lock_sys.wait_mutex instead of lock_sys.mutex . Only on Microsoft Windows, the native SRWLOCK could be combined with CONDITION_VARIABLE .
            marko Marko Mäkelä made changes -
            Assignee Sergey Vojtovich [ svoj ] Marko Mäkelä [ marko ]
            Labels performance
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6 [ 24028 ]
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal().

            There is no need for a separate lock_wait_timeout_task, which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.

            marko Marko Mäkelä added a comment - I replaced lock_wait_suspend_thread() with a simple lock_wait() that will use pthread_cond_timedwait() for waiting for the lock to the granted. If that call fails with an error, we know that a timeout has occurred. The wait may be interrupted by ha_kill_query() or deadlock detection, which will simply invoke pthread_cond_signal() . There is no need for a separate lock_wait_timeout_task , which would wake up once per second. Also, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex (which will be replaced with lock_sys.latch ) will be swapped. Hopefully this necessary refactoring will provide some additional performance benefit.

            Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.

            marko Marko Mäkelä added a comment - Also srv_slot_t can be removed and the locality of reference improved by storing trx->lock.wait_lock and trx->lock.cond in adjacent addresses.
            marko Marko Mäkelä added a comment - - edited

            zhaiwx1987, I adapted the MDEV-11392 idea from MySQL Bug #72948, but I introduced a single counter dict_table_t::n_lock_x_or_s. There is actually quite a bit of room for improvement in lock_sys, in addition to what was done in MySQL 8.0.21 WL#10314.

            marko Marko Mäkelä added a comment - - edited zhaiwx1987 , I adapted the MDEV-11392 idea from MySQL Bug #72948 , but I introduced a single counter dict_table_t::n_lock_x_or_s . There is actually quite a bit of room for improvement in lock_sys , in addition to what was done in MySQL 8.0.21 WL#10314 .
            marko Marko Mäkelä added a comment - - edited

            The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in MDEV-24671. As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay).

            It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker. It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.

            marko Marko Mäkelä added a comment - - edited The lock_wait() refactoring was causing some assertion failures in the start/stop que_thr_t bookkeeping. I think that it is simplest to remove that bookkeeping along with removing some unnecessary data members or enum values. Edit: This was done in MDEV-24671 . As an added bonus, innodb_lock_wait_timeout is enforced more timely (no extra 1-second delay). It turns out that the partitioned lock_sys.mutex will not work efficiently with the old DeadlockChecker . It must be refactored, similar to what was done in Oracle Bug #29882690 in MySQL 8.0.18.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait(). A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex.

            marko Marko Mäkelä added a comment - As a minimal change, I moved the DeadlockChecker::search() invocation to lock_wait() . A separate deadlock checker thread or task might still be useful. For that, I do not think that there is a need to introduce any blocking_trx data member. In our code, it should be safe to follow the chain of trx->lock.wait_lock->trx while holding lock_sys.wait_mutex and possibly also trx->mutex .
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            We replaced lock_sys.mutex with a lock_sys.latch (MDEV-24167) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used.

            The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode.

            Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash. The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*), then we could happily use that.)

            Until MDEV-24738 has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF.

            marko Marko Mäkelä added a comment - We replaced lock_sys.mutex with a lock_sys.latch ( MDEV-24167 ) that is 4 or 8 bytes on Linux, Microsoft Windows or OpenBSD. On other systems, a native rw-lock or a mutex and two condition variables will be used. The entire world of transactional locks can be stopped by acquiring lock_sys.latch in exclusive mode. Scalability is achieved by making most users use a combination of a shared lock_sys.latch and a lock-specific dict_table_t::lock_mutex or lock_sys_t::hash_latch that is embedded in each cache line of the lock_sys.rec_hash , lock_sys.prdt_hash , or lock_sys.prdt_page_hash . The lock_sys_t::hash_latch is always 4 or 8 bytes. On other systems than Linux, OpenBSD, and Microsoft Windows, the lock_sys_t::hash_latch::release() will always acquire a mutex and signal a condition variable. This is a known scalability bottleneck and could be improved further on such systems, by splitting the mutex and condition variable. (If such systems supported a lightweight mutex that is at most sizeof(void*) , then we could happily use that.) Until MDEV-24738 has been fixed, the deadlock detector will remain a significant bottleneck, because each lock_wait() would acquire lock_sys.latch in exclusive mode. This bottleneck can be avoided by setting innodb_deadlock_detect=OFF .
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2021-02-12 16:00:06.0 2021-02-12 16:00:06.059
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6.0 [ 24431 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Roel Roel Van de Paar made changes -
            midenok Aleksey Midenkov made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 99716 ] MariaDB v4 [ 134092 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201658
            Zendesk active tickets 201658
            vlad.lesin Vladislav Lesin made changes -

            People

              marko Marko Mäkelä
              svoj Sergey Vojtovich
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.