Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26779

reduce lock_sys.wait_mutex contention by using spinloop construct

Details

    Description

      reduce lock_sys.wait_mutex contention by using spinloop construct

      • wait_mutex plays an important role when the workload involves conflicting transactions.
      • On a heavily contented system with increasing scalability
          quite possible that the majority of the transactions may have to wait
          before acquiring resources.
      • This causes a lot of contention of wait_mutex but most of this
          the contention is short-lived that tend to suggest the use of spin loop
          to avoid giving up compute core that in turn will involve os-scheduler  with additional latency.
      • Idea has shown promising results with performance improving up to 70-100% for write workload.

      Attachments

        Issue Links

          Activity

            krunalbauskar Krunal Bauskar created issue -
            krunalbauskar Krunal Bauskar added a comment - PR created: https://github.com/MariaDB/server/pull/1923

            Please check the attached graph.

            1. For zipfian (contention-based workload) patch has shown significant improvement (70-100% for all architecture x86 and arm).
            2. For uniform (as expected given low contention), no change in performance is observed.

            (test-case: CPU bound, 300 sec (test) + 20 sec (sleep) ... repeated 5 times)

            krunalbauskar Krunal Bauskar added a comment - Please check the attached graph. 1. For zipfian (contention-based workload) patch has shown significant improvement (70-100% for all architecture x86 and arm). 2. For uniform (as expected given low contention), no change in performance is observed. (test-case: CPU bound, 300 sec (test) + 20 sec (sleep) ... repeated 5 times)
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ]

            1. As part of some other issue (MDEV-26769), based on comment from Mark Callaghan it was discovered that we can get pthread library to spin using PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP. MySQL mutex framework could enable it by passing MY_MUTEX_INIT_FAST.

            Given the inbuilt construct, it was decided to try the said approach to use MY_MUTEX_INIT_FAST for wait_mutex.

            testing seems to suggest that the said pthread inbuilt construct is less effective.

            Please check the attached graphs.

            krunalbauskar Krunal Bauskar added a comment - 1. As part of some other issue ( MDEV-26769 ), based on comment from Mark Callaghan it was discovered that we can get pthread library to spin using PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP. MySQL mutex framework could enable it by passing MY_MUTEX_INIT_FAST. Given the inbuilt construct, it was decided to try the said approach to use MY_MUTEX_INIT_FAST for wait_mutex. testing seems to suggest that the said pthread inbuilt construct is less effective. Please check the attached graphs.
            krunalbauskar Krunal Bauskar made changes -
            marko Marko Mäkelä made changes -
            krunalbauskar Krunal Bauskar made changes -

            Adding graph for x86 that too re-proves the fact that custom spin loop is performing better than pthread based inherent spin loop.

            https://jira.mariadb.org/secure/attachment/59761/update-index%20%28512%20threads%29%20x86%20custom%20spin%20vs%20pthread%20spin%20loop.png

            krunalbauskar Krunal Bauskar added a comment - Adding graph for x86 that too re-proves the fact that custom spin loop is performing better than pthread based inherent spin loop. https://jira.mariadb.org/secure/attachment/59761/update-index%20%28512%20threads%29%20x86%20custom%20spin%20vs%20pthread%20spin%20loop.png

            A feature of the custom mutex code provided by upstream MySQL/InnoDB was explicit control over spinning (how long to spin before going to sleep). While Pthread mutex has a variant that spins, the spin duration is much less than the duration for the custom InnoDB mutex.

            This is an interesting problem. The mutex behavior, whether from pthread mutex or custom InnoDB mutex, is global (all instances of the mutex behave mostly the same). But it is likely that some mutex instances benefit from longer spinning durations, while others do not.

            mdcallag Mark Callaghan added a comment - A feature of the custom mutex code provided by upstream MySQL/InnoDB was explicit control over spinning (how long to spin before going to sleep). While Pthread mutex has a variant that spins, the spin duration is much less than the duration for the custom InnoDB mutex. This is an interesting problem. The mutex behavior, whether from pthread mutex or custom InnoDB mutex, is global (all instances of the mutex behave mostly the same). But it is likely that some mutex instances benefit from longer spinning durations, while others do not.

            mdcallag, indeed, we did not check how long PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP would spin.

            In MariaDB 10.6, the lock_sys was changed quite a bit, When the lock_wait() function was introduced in the fix of MDEV-24671, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex was swapped. Thus, some waits for lock_sys.wait_mutex would occur while we are in ‘stop the world’ mode, holding exclusive lock_sys.latch (which replaced lock_sys.mutex in MDEV-20612). My hypothesis was that spinning in such cases would be harmful. In the final version of krunalbauskar’s pull request, for each mutex acquisition we explicitly specify whether to spin.

            axel is yet to report his numbers. For the one-liner change to replace nullptr with MY_MUTEX_INIT_FAST in the mysql_mutex_init() call, he observed serious regressions. I would guess that the adaptive logic in the GNU libc would be badly misguided due to the two fundamentally different scenarios where pthread_mutex_lock() is acquired. If we are holding exclusive lock_sys.latch, other threads cannot create any locks, and therefore the contention on lock_sys.wait_mutex should disappear after the current holder releases it.

            In MariaDB, the spinning logic has evolved over time. A long time ago in MDEV-8684, danblack removed the pausing for a random duration, because the internal state of the pseudorandom number generator was a contention point. So, we pause for a fixed duration. Then, MDEV-19845 introduced a spin loop multiplier because in the Intel Skylake microarchitecture, the latency of the PAUSE instruction increased from the previous microarchitecture (Haswell) from about 10 to about 100 clock cycles. I think that in the successor Cascade Lake it was reduced again.

            While I believe that spinloops are mostly working around a bad design, this lock_sys.wait_mutex is a special case that needs special measures. I did not come up with any good idea how to split or partition that mutex. I think that the only badly contended InnoDB mutex in MariaDB 10.6 is log_sys.mutex, which I hope to fix with a redo log block format change in MDEV-14425.

            It might be interesting to check whether we would be better off without any MY_MUTEX_INIT_FAST in our version of InnoDB. We are using it since 10.5 (MDEV-23399 and MDEV-23855) for buf_pool.mutex and buf_pool.flush_list_mutex. The latter should be much less contended after MDEV-25113, and the former will be better after MDEV-26827. Apart from them, we have a mutex related to MariaDB’s binlog group commit (which I would like to see removed in MDEV-25611).

            marko Marko Mäkelä added a comment - mdcallag , indeed, we did not check how long PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP would spin. In MariaDB 10.6, the lock_sys was changed quite a bit, When the lock_wait() function was introduced in the fix of MDEV-24671 , the relative latching order of lock_sys.wait_mutex and lock_sys.mutex was swapped. Thus, some waits for lock_sys.wait_mutex would occur while we are in ‘stop the world’ mode, holding exclusive lock_sys.latch (which replaced lock_sys.mutex in MDEV-20612 ). My hypothesis was that spinning in such cases would be harmful. In the final version of krunalbauskar ’s pull request , for each mutex acquisition we explicitly specify whether to spin. axel is yet to report his numbers. For the one-liner change to replace nullptr with MY_MUTEX_INIT_FAST in the mysql_mutex_init() call, he observed serious regressions. I would guess that the adaptive logic in the GNU libc would be badly misguided due to the two fundamentally different scenarios where pthread_mutex_lock() is acquired. If we are holding exclusive lock_sys.latch , other threads cannot create any locks, and therefore the contention on lock_sys.wait_mutex should disappear after the current holder releases it. In MariaDB, the spinning logic has evolved over time. A long time ago in MDEV-8684 , danblack removed the pausing for a random duration, because the internal state of the pseudorandom number generator was a contention point. So, we pause for a fixed duration. Then, MDEV-19845 introduced a spin loop multiplier because in the Intel Skylake microarchitecture, the latency of the PAUSE instruction increased from the previous microarchitecture (Haswell) from about 10 to about 100 clock cycles. I think that in the successor Cascade Lake it was reduced again. While I believe that spinloops are mostly working around a bad design, this lock_sys.wait_mutex is a special case that needs special measures. I did not come up with any good idea how to split or partition that mutex. I think that the only badly contended InnoDB mutex in MariaDB 10.6 is log_sys.mutex , which I hope to fix with a redo log block format change in MDEV-14425 . It might be interesting to check whether we would be better off without any MY_MUTEX_INIT_FAST in our version of InnoDB. We are using it since 10.5 ( MDEV-23399 and MDEV-23855 ) for buf_pool.mutex and buf_pool.flush_list_mutex . The latter should be much less contended after MDEV-25113 , and the former will be better after MDEV-26827 . Apart from them, we have a mutex related to MariaDB’s binlog group commit (which I would like to see removed in MDEV-25611 ).
            axel Axel Schwenke made changes -
            Attachment sysbench.pdf [ 59776 ]
            axel Axel Schwenke added a comment -

            Added sysbench.pdf. marko thinks we're spinning to much.

            axel Axel Schwenke added a comment - Added sysbench.pdf . marko thinks we're spinning to much.

            @Axel,

            Seems like there are a lot of un-related branches that is creating some confusion here including a variant of the patch.
            The original patch can be found here. https://github.com/mysqlonarm/server/commit/6f7444d1498ed0ec90ee4200ed9311e688a2621a

            Can you test it once maybe a quick test with write-only-workload (please try both uniform and Zipfian) and share the result?

            If the original patch helps then more optimization as being proposed to non-inline function, one-liner could be explored.

            krunalbauskar Krunal Bauskar added a comment - @Axel, Seems like there are a lot of un-related branches that is creating some confusion here including a variant of the patch. The original patch can be found here. https://github.com/mysqlonarm/server/commit/6f7444d1498ed0ec90ee4200ed9311e688a2621a Can you test it once maybe a quick test with write-only-workload (please try both uniform and Zipfian) and share the result? If the original patch helps then more optimization as being proposed to non-inline function, one-liner could be explored.
            krunalbauskar Krunal Bauskar made changes -
            Attachment x86-wait-mutex-run.png [ 59788 ]

            please check https://jira.mariadb.org/secure/attachment/59788/x86-wait-mutex-run.png.
            I re-benchmarked the said patch this time with the latest baseline to avoid any last minutes changes to the trunk that could cause Axel to see the said difference.

            krunalbauskar Krunal Bauskar added a comment - please check https://jira.mariadb.org/secure/attachment/59788/x86-wait-mutex-run.png . I re-benchmarked the said patch this time with the latest baseline to avoid any last minutes changes to the trunk that could cause Axel to see the said difference.

            Axel,
            based on the request from Marko I re-tested the non-inline function against inline version (original patch). non-inline function is 2-6% slower than inline version.
            check the attached graph https://jira.mariadb.org/secure/thumbnail/59812/_thumb_59812.png

            krunalbauskar Krunal Bauskar added a comment - Axel, based on the request from Marko I re-tested the non-inline function against inline version (original patch). non-inline function is 2-6% slower than inline version. check the attached graph https://jira.mariadb.org/secure/thumbnail/59812/_thumb_59812.png
            wlad Vladislav Vaintroub added a comment - - edited

            I think 2% can be sacrificed, for the sake of getting accurate profiles. We did not know that ut_delay() accounts for 40% to 70% in 10.4, in some sysbench, until I uninlined it, to get the accurate picture

            wlad Vladislav Vaintroub added a comment - - edited I think 2% can be sacrificed, for the sake of getting accurate profiles. We did not know that ut_delay() accounts for 40% to 70% in 10.4, in some sysbench, until I uninlined it, to get the accurate picture

            In general, spinning tends to hide true scalability bugs. I believe if at all, spinlocks must be last measure , exception, not a rule. I think for example, that it is possible to decrease contention on log_sys.mutex, and simultaneous copying to redo log buffer without holding a mutex is doable . However once contention is hidden with spins, there will be less motivation to fix bottlenecks.

            wlad Vladislav Vaintroub added a comment - In general, spinning tends to hide true scalability bugs. I believe if at all, spinlocks must be last measure , exception, not a rule. I think for example, that it is possible to decrease contention on log_sys.mutex, and simultaneous copying to redo log buffer without holding a mutex is doable . However once contention is hidden with spins, there will be less motivation to fix bottlenecks.

            I get your point but there will always be contention and it helps to have some mutexes that can tolerate it via some spinning before they go to sleep.

            There used to be a MySQL storage engine written by someone who didn't like spin locks. That engine had a custom mutex that included busy-wait spinning, alas the code had a simple bug and the compiler optimized away the spinning. By fixing that, and bringing back spinning, I contributed a large speedup (2X) to Falcon.
            http://mysqlha.blogspot.com/2009/03/make-falcon-faster.html

            mdcallag Mark Callaghan added a comment - I get your point but there will always be contention and it helps to have some mutexes that can tolerate it via some spinning before they go to sleep. There used to be a MySQL storage engine written by someone who didn't like spin locks. That engine had a custom mutex that included busy-wait spinning, alas the code had a simple bug and the compiler optimized away the spinning. By fixing that, and bringing back spinning, I contributed a large speedup (2X) to Falcon. http://mysqlha.blogspot.com/2009/03/make-falcon-faster.html
            marko Marko Mäkelä made changes -

            After folding of the multiple changes related to buf_pool mutex optimization I re-evaluated the said patch.

            1. For uniform (as we observed before) there is no change in performance.
            2. For zipfian (contention case), for ARM there is consistent improvement in performance for higher scalability. For x86, update-non-index 1024 scalability showed some regression. update-index and lower scalability continued to perform well. (Could be due to flushing issue).

            krunalbauskar Krunal Bauskar added a comment - After folding of the multiple changes related to buf_pool mutex optimization I re-evaluated the said patch. 1. For uniform (as we observed before) there is no change in performance. 2. For zipfian (contention case), for ARM there is consistent improvement in performance for higher scalability. For x86, update-non-index 1024 scalability showed some regression. update-index and lower scalability continued to perform well. (Could be due to flushing issue).
            axel Axel Schwenke made changes -
            Attachment MDEV-26779-1.pdf [ 60029 ]
            axel Axel Schwenke added a comment -

            MDEV-26779-1.pdf shows no significant performance change. This was however run with uniform RNG and datadir on RAM-disk.

            axel Axel Schwenke added a comment - MDEV-26779-1.pdf shows no significant performance change. This was however run with uniform RNG and datadir on RAM-disk.
            axel Axel Schwenke made changes -
            Attachment MDEV-26779-2.pdf [ 60047 ]
            Attachment MDEV-26779-3.pdf [ 60048 ]
            axel Axel Schwenke added a comment -

            MDEV-26779-2.pdf completes the picture with the inlined version. MDEV-26779-3.pdf shows results for the Zipf RNG. In any case it looks like for x86 the non-inlined variant shows the better behavior.

            axel Axel Schwenke added a comment - MDEV-26779-2.pdf completes the picture with the inlined version. MDEV-26779-3.pdf shows results for the Zipf RNG. In any case it looks like for x86 the non-inlined variant shows the better behavior.

            I think that for now, we can apply a simple ARMv8-specific change of initializing lock_sys.wait_mutex with MY_MUTEX_INIT_FAST a.k.a. PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP, similar to how we did to the log_sys mutexes in MDEV-26855.

            I would not change other platforms than ARMv8, because our own tests on AMD64 do not show any significant improvement.

            Spinning is basically a hack to work around contention. The lock_sys.wait_mutex must be split in some way to properly fix this, in a future task.

            marko Marko Mäkelä added a comment - I think that for now, we can apply a simple ARMv8-specific change of initializing lock_sys.wait_mutex with MY_MUTEX_INIT_FAST a.k.a. PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP , similar to how we did to the log_sys mutexes in MDEV-26855 . I would not change other platforms than ARMv8, because our own tests on AMD64 do not show any significant improvement. Spinning is basically a hack to work around contention. The lock_sys.wait_mutex must be split in some way to properly fix this, in a future task.
            marko Marko Mäkelä made changes -
            Issue Type Task [ 3 ] Bug [ 1 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6 [ 24028 ]
            Affects Version/s 10.6 [ 24028 ]
            Environment GNU/Linux on ARMv8 (Aarch64)
            Labels performance
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2021-10-27 14:42:25.0 2021-10-27 14:42:25.002
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6.5 [ 26034 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            I think that before attempting to split lock_sys.wait_mutex, we should implement the following and re-evaluate the situation:

            • MDEV-16232 so that UPDATE and DELETE will avoid setting non-gap locks in the non-contended case
            • MDEV-16406 so that accessing the record locks will hopefully be faster and the critical sections of lock_sys.wait_mutex smaller.
            marko Marko Mäkelä added a comment - I think that before attempting to split lock_sys.wait_mutex , we should implement the following and re-evaluate the situation: MDEV-16232 so that UPDATE and DELETE will avoid setting non-gap locks in the non-contended case MDEV-16406 so that accessing the record locks will hopefully be faster and the critical sections of lock_sys.wait_mutex smaller.
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 125993 ] MariaDB v4 [ 159756 ]

            People

              marko Marko Mäkelä
              krunalbauskar Krunal Bauskar
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.