Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24142

rw_lock_t has unnecessarily complex wait logic

Details

    Description

      The InnoDB custom implementation of a read-write latch (rw_lock_t) encapsulates two os_event_t. Both os_event_t encapsulate a mutex and a condition variable. One mutex would suffice. There also are some data fields to control the waiting, although the rw_lock_t::lock_word alone is sufficient for that.

      As part of MDEV-21452, we would replace all os_event_t with condition variables. Simplifying the read-write latch implementation is a prerequisite for that.

      Attachments

        Issue Links

          Activity

            With the approach based on two srw_lock there is a potential deadlock if s_lock() is acquired after u_lock(). To solve this, I unconditionally implemented srw_lock always based on rw_lock and its std::atomic<uint32_t>, and in the end implemented an update lock in rw_lock and srw_lock. The only additional ‘service’ of sux_lock is that it allows recursive U and X locks, as required by InnoDB dict_index_t::lock and dict_block_t::lock. Compared to old rw_lock_t, if an U lock is upgraded to X lock, then all pre-existing U lock in the mtr_t::m_memo must be edited to X lock. This was necessary, because a u_unlock() after an x_unlock() is not allowed.

            marko Marko Mäkelä added a comment - With the approach based on two srw_lock there is a potential deadlock if s_lock() is acquired after u_lock() . To solve this, I unconditionally implemented srw_lock always based on rw_lock and its std::atomic<uint32_t> , and in the end implemented an update lock in rw_lock and srw_lock . The only additional ‘service’ of sux_lock is that it allows recursive U and X locks, as required by InnoDB dict_index_t::lock and dict_block_t::lock . Compared to old rw_lock_t , if an U lock is upgraded to X lock, then all pre-existing U lock in the mtr_t::m_memo must be edited to X lock. This was necessary, because a u_unlock() after an x_unlock() is not allowed.

            I pushed this in multiple commits, to keep some mechanical changes separate so that the changes are easier to follow. The commits in the git log order (newest to oldest) are as follows:

            ba2d45dc540 MDEV-24142: Remove INFORMATION_SCHEMA.INNODB_MUTEXES
            9702be2c73c MDEV-24142: Remove __FILE__,__LINE__ related to buf_block_t::lock
            ac028ec5d85 MDEV-24142: Remove the LatchDebug interface to rw-locks
            06efef4be39 MDEV-24308: Windows improvements
            03ca6495df3 MDEV-24142: Replace InnoDB rw_lock_t with sux_lock
            d46b42489a6 MDEV-24142 preparation: Add srw_mutex and srw_lock::u_lock()

            There are some user-visible changes:

            • INFORMATION_SCHEMA.INNODB_MUTEXES (MDEV-7399) was removed. Starting with MariaDB 10.2.2, it reported the number of waits in rw-locks and no information on mutexes.
            • SHOW ENGINE INNODB MUTEX will not report any information about rw-locks.
            • PERFORMANCE_SCHEMA will distinguish lock waits. The try_ events will be logged for wait-free rw-lock acquisitions. Actual high-level ‘try’ operations in the code will not be logged in PERFORMANCE_SCHEMA at all.
            • The SEMAPHORES section of SHOW ENGINE INNODB STATUS will only list information about TTASEventMutex (if building with cmake -DMUTEXTYPE=event), subject to removal in MDEV-21452.

            The last commit, removes my reimplementation of SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES output that did not rely on rw_lock_list, which we removed. The reimplemented output was better than original, because the mutex name was `index_name`(`database_name`.`schema_name`) rather than being some obscure constant for all dict_index_t::lock instances. The reason why we removed the output and the related sux_lock::waits was that the bookkeeping incurred a significant overhead. This was also the reason why I decided to distinguish immediately granted latch requests from blocking ones in PERFORMANCE_SCHEMA, so that the information about waits would be available via that interface.

            marko Marko Mäkelä added a comment - I pushed this in multiple commits, to keep some mechanical changes separate so that the changes are easier to follow. The commits in the git log order (newest to oldest) are as follows: ba2d45dc540 MDEV-24142: Remove INFORMATION_SCHEMA.INNODB_MUTEXES 9702be2c73c MDEV-24142: Remove __FILE__,__LINE__ related to buf_block_t::lock ac028ec5d85 MDEV-24142: Remove the LatchDebug interface to rw-locks 06efef4be39 MDEV-24308: Windows improvements 03ca6495df3 MDEV-24142: Replace InnoDB rw_lock_t with sux_lock d46b42489a6 MDEV-24142 preparation: Add srw_mutex and srw_lock::u_lock() There are some user-visible changes: INFORMATION_SCHEMA.INNODB_MUTEXES ( MDEV-7399 ) was removed. Starting with MariaDB 10.2.2, it reported the number of waits in rw-locks and no information on mutexes. SHOW ENGINE INNODB MUTEX will not report any information about rw-locks. PERFORMANCE_SCHEMA will distinguish lock waits. The try_ events will be logged for wait-free rw-lock acquisitions. Actual high-level ‘try’ operations in the code will not be logged in PERFORMANCE_SCHEMA at all. The SEMAPHORES section of SHOW ENGINE INNODB STATUS will only list information about TTASEventMutex (if building with cmake -DMUTEXTYPE=event ), subject to removal in MDEV-21452 . The last commit, removes my reimplementation of SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES output that did not rely on rw_lock_list , which we removed. The reimplemented output was better than original, because the mutex name was `index_name`(`database_name`.`schema_name`) rather than being some obscure constant for all dict_index_t::lock instances. The reason why we removed the output and the related sux_lock::waits was that the bookkeeping incurred a significant overhead. This was also the reason why I decided to distinguish immediately granted latch requests from blocking ones in PERFORMANCE_SCHEMA , so that the information about waits would be available via that interface.

            I made one more tweak to reduce sizeof(block_lock) from 24 to 16 bytes on 64-bit systems. The sizeof(index_lock) still incurs alignment losses of 4+4 bytes unless cmake -DPLUGIN_PERFSCHEMA=NO is used. With PERFORMANCE_SCHEMA support enabled, I have sizeof(dict_index_t)==400.

            Thanks to MDEV-15053 removing buf_block_t::mutex earlier, we now have sizeof(buf_page_t)==112 and sizeof(buf_block_t)==184 on my AMD64 GNU/Linux system. I remember that I got sizeof(buf_page_t) down to exactly 64 bytes on 32-bit systems, back when I implemented ROW_FORMAT=COMPRESSED for the InnoDB Plugin for MySQL 5.1. Since then, we replaced some bit fields with std::atomic. Maybe we should try using ‘manual’ std::atomic::fetch_or() and friends to shrink it further.

            For reference, the MariaDB Server 10.2 with the same build options has sizeof(buf_page_t)==136, sizeof(buf_block_t)==368, sizeof(rw_lock_t)==112, sizeof(dict_index_t)==496. We have exactly halved sizeof(buf_block_t) thanks to MDEV-15053 and MDEV-24142.

            marko Marko Mäkelä added a comment - I made one more tweak to reduce sizeof(block_lock) from 24 to 16 bytes on 64-bit systems . The sizeof(index_lock) still incurs alignment losses of 4+4 bytes unless cmake -DPLUGIN_PERFSCHEMA=NO is used. With PERFORMANCE_SCHEMA support enabled, I have sizeof(dict_index_t)==400 . Thanks to MDEV-15053 removing buf_block_t::mutex earlier, we now have sizeof(buf_page_t)==112 and sizeof(buf_block_t)==184 on my AMD64 GNU/Linux system. I remember that I got sizeof(buf_page_t) down to exactly 64 bytes on 32-bit systems, back when I implemented ROW_FORMAT=COMPRESSED for the InnoDB Plugin for MySQL 5.1. Since then, we replaced some bit fields with std::atomic . Maybe we should try using ‘manual’ std::atomic::fetch_or() and friends to shrink it further. For reference, the MariaDB Server 10.2 with the same build options has sizeof(buf_page_t)==136 , sizeof(buf_block_t)==368 , sizeof(rw_lock_t)==112 , sizeof(dict_index_t)==496 . We have exactly halved sizeof(buf_block_t) thanks to MDEV-15053 and MDEV-24142 .

            As noted in MDEV-25404, we will need separate wait queues for shared and exclusive requests. The sizeof(sux_lock) and thus sizeof(buf_block_t) and sizeof(dict_index_t) will increase.

            marko Marko Mäkelä added a comment - As noted in MDEV-25404 , we will need separate wait queues for shared and exclusive requests. The sizeof(sux_lock) and thus sizeof(buf_block_t) and sizeof(dict_index_t) will increase.

            MDEV-25404 on Linux, OpenBSD and Microsoft Windows doubled sizeof(ssux_lock_low) from 4 to 8, or from 1 to 2 futex-backed 32-bit words, so that we can minimize the system calls and have a separate queue for waiting exclusive requests that will be notified when the last shared request is released.

            Unless cmake -DPLUGIN_PERFSCHEMA=NO, ssux_lock will extend ssux_lock_low with a pointer, for instrumentation, used by dict_index_t::lock.

            The sux_lock extends ssux_lock_low or ssux_lock with a 4-byte recursion count and a thread identifier of the U or X holder.

            On other systems (or if SUX_LOCK_GENERIC is defined), the layout of all these remains unchanged even with MDEV-25404: instead of 2*4 bytes in ssux_lock_low, we have a 4-byte lock word, 1 mutex and 2 condition variables. I estimate that to be somewhere between 100 and 200 bytes.

            marko Marko Mäkelä added a comment - MDEV-25404 on Linux, OpenBSD and Microsoft Windows doubled sizeof(ssux_lock_low) from 4 to 8, or from 1 to 2 futex-backed 32-bit words, so that we can minimize the system calls and have a separate queue for waiting exclusive requests that will be notified when the last shared request is released. Unless cmake -DPLUGIN_PERFSCHEMA=NO , ssux_lock will extend ssux_lock_low with a pointer, for instrumentation, used by dict_index_t::lock . The sux_lock extends ssux_lock_low or ssux_lock with a 4-byte recursion count and a thread identifier of the U or X holder. On other systems (or if SUX_LOCK_GENERIC is defined), the layout of all these remains unchanged even with MDEV-25404 : instead of 2*4 bytes in ssux_lock_low , we have a 4-byte lock word, 1 mutex and 2 condition variables. I estimate that to be somewhere between 100 and 200 bytes.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.