[MDEV-26779] reduce lock_sys.wait_mutex contention by using spinloop construct Created: 2021-10-07  Updated: 2021-10-27  Resolved: 2021-10-27

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: 10.6.5

Type: Bug Priority: Major
Reporter: Krunal Bauskar Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance
Environment:

GNU/Linux on ARMv8 (Aarch64)


Attachments: PDF File MDEV-26779-1.pdf     PDF File MDEV-26779-2.pdf     PDF File MDEV-26779-3.pdf     PNG File custom spin vs pthread spin.png     PNG File patch (custom spin) and patch (pthread spin).png     PDF File sysbench.pdf     PNG File update-index (512 threads) x86 custom spin vs pthread spin loop.png     PNG File update-index (uniform) - baseline vs patched - arm.png     PNG File update-index (uniform) - baseline vs patched - x86.png     PNG File update-index (zipfian) - baseline vs patched - arm.png     PNG File update-index (zipfian) - baseline vs patched - x86.png     PNG File update-index (zipfian) - inline, update-index (zipfian) - noniline, update-non-index (zipfian) - inline and update-non-index (zipfian) - noninline.png     PNG File update-non-index (uniform) - baseline vs patched - arm.png     PNG File update-non-index (uniform) - baseline vs patched - x86.png     PNG File update-non-index (zipfian) - baseline vs patched - arm.png     PNG File update-non-index (zipfian) - baseline vs patched - x86.png     PNG File x86-wait-mutex-run.png    
Issue Links:
Problem/Incident
is caused by MDEV-21452 Use condition variables and normal mu... Closed
Relates
relates to MDEV-16232 Use fewer mini-transactions Stalled
relates to MDEV-16406 Refactor the InnoDB record locks Open

 Description   

reduce lock_sys.wait_mutex contention by using spinloop construct

  • wait_mutex plays an important role when the workload involves conflicting transactions.
  • On a heavily contented system with increasing scalability
      quite possible that the majority of the transactions may have to wait
      before acquiring resources.
  • This causes a lot of contention of wait_mutex but most of this
      the contention is short-lived that tend to suggest the use of spin loop
      to avoid giving up compute core that in turn will involve os-scheduler  with additional latency.
  • Idea has shown promising results with performance improving up to 70-100% for write workload.


 Comments   
Comment by Krunal Bauskar [ 2021-10-07 ]

PR created: https://github.com/MariaDB/server/pull/1923

Comment by Krunal Bauskar [ 2021-10-07 ]

Please check the attached graph.

1. For zipfian (contention-based workload) patch has shown significant improvement (70-100% for all architecture x86 and arm).
2. For uniform (as expected given low contention), no change in performance is observed.

(test-case: CPU bound, 300 sec (test) + 20 sec (sleep) ... repeated 5 times)

Comment by Krunal Bauskar [ 2021-10-14 ]

1. As part of some other issue (MDEV-26769), based on comment from Mark Callaghan it was discovered that we can get pthread library to spin using PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP. MySQL mutex framework could enable it by passing MY_MUTEX_INIT_FAST.

Given the inbuilt construct, it was decided to try the said approach to use MY_MUTEX_INIT_FAST for wait_mutex.

testing seems to suggest that the said pthread inbuilt construct is less effective.

Please check the attached graphs.

Comment by Krunal Bauskar [ 2021-10-14 ]

Adding graph for x86 that too re-proves the fact that custom spin loop is performing better than pthread based inherent spin loop.

https://jira.mariadb.org/secure/attachment/59761/update-index%20%28512%20threads%29%20x86%20custom%20spin%20vs%20pthread%20spin%20loop.png

Comment by Mark Callaghan [ 2021-10-14 ]

A feature of the custom mutex code provided by upstream MySQL/InnoDB was explicit control over spinning (how long to spin before going to sleep). While Pthread mutex has a variant that spins, the spin duration is much less than the duration for the custom InnoDB mutex.

This is an interesting problem. The mutex behavior, whether from pthread mutex or custom InnoDB mutex, is global (all instances of the mutex behave mostly the same). But it is likely that some mutex instances benefit from longer spinning durations, while others do not.

Comment by Marko Mäkelä [ 2021-10-15 ]

mdcallag, indeed, we did not check how long PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP would spin.

In MariaDB 10.6, the lock_sys was changed quite a bit, When the lock_wait() function was introduced in the fix of MDEV-24671, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex was swapped. Thus, some waits for lock_sys.wait_mutex would occur while we are in ‘stop the world’ mode, holding exclusive lock_sys.latch (which replaced lock_sys.mutex in MDEV-20612). My hypothesis was that spinning in such cases would be harmful. In the final version of krunalbauskar’s pull request, for each mutex acquisition we explicitly specify whether to spin.

axel is yet to report his numbers. For the one-liner change to replace nullptr with MY_MUTEX_INIT_FAST in the mysql_mutex_init() call, he observed serious regressions. I would guess that the adaptive logic in the GNU libc would be badly misguided due to the two fundamentally different scenarios where pthread_mutex_lock() is acquired. If we are holding exclusive lock_sys.latch, other threads cannot create any locks, and therefore the contention on lock_sys.wait_mutex should disappear after the current holder releases it.

In MariaDB, the spinning logic has evolved over time. A long time ago in MDEV-8684, danblack removed the pausing for a random duration, because the internal state of the pseudorandom number generator was a contention point. So, we pause for a fixed duration. Then, MDEV-19845 introduced a spin loop multiplier because in the Intel Skylake microarchitecture, the latency of the PAUSE instruction increased from the previous microarchitecture (Haswell) from about 10 to about 100 clock cycles. I think that in the successor Cascade Lake it was reduced again.

While I believe that spinloops are mostly working around a bad design, this lock_sys.wait_mutex is a special case that needs special measures. I did not come up with any good idea how to split or partition that mutex. I think that the only badly contended InnoDB mutex in MariaDB 10.6 is log_sys.mutex, which I hope to fix with a redo log block format change in MDEV-14425.

It might be interesting to check whether we would be better off without any MY_MUTEX_INIT_FAST in our version of InnoDB. We are using it since 10.5 (MDEV-23399 and MDEV-23855) for buf_pool.mutex and buf_pool.flush_list_mutex. The latter should be much less contended after MDEV-25113, and the former will be better after MDEV-26827. Apart from them, we have a mutex related to MariaDB’s binlog group commit (which I would like to see removed in MDEV-25611).

Comment by Axel Schwenke [ 2021-10-15 ]

Added sysbench.pdf. marko thinks we're spinning to much.

Comment by Krunal Bauskar [ 2021-10-18 ]

@Axel,

Seems like there are a lot of un-related branches that is creating some confusion here including a variant of the patch.
The original patch can be found here. https://github.com/mysqlonarm/server/commit/6f7444d1498ed0ec90ee4200ed9311e688a2621a

Can you test it once maybe a quick test with write-only-workload (please try both uniform and Zipfian) and share the result?

If the original patch helps then more optimization as being proposed to non-inline function, one-liner could be explored.

Comment by Krunal Bauskar [ 2021-10-18 ]

please check https://jira.mariadb.org/secure/attachment/59788/x86-wait-mutex-run.png.
I re-benchmarked the said patch this time with the latest baseline to avoid any last minutes changes to the trunk that could cause Axel to see the said difference.

Comment by Krunal Bauskar [ 2021-10-19 ]

Axel,
based on the request from Marko I re-tested the non-inline function against inline version (original patch). non-inline function is 2-6% slower than inline version.
check the attached graph https://jira.mariadb.org/secure/thumbnail/59812/_thumb_59812.png

Comment by Vladislav Vaintroub [ 2021-10-19 ]

I think 2% can be sacrificed, for the sake of getting accurate profiles. We did not know that ut_delay() accounts for 40% to 70% in 10.4, in some sysbench, until I uninlined it, to get the accurate picture

Comment by Vladislav Vaintroub [ 2021-10-19 ]

In general, spinning tends to hide true scalability bugs. I believe if at all, spinlocks must be last measure , exception, not a rule. I think for example, that it is possible to decrease contention on log_sys.mutex, and simultaneous copying to redo log buffer without holding a mutex is doable . However once contention is hidden with spins, there will be less motivation to fix bottlenecks.

Comment by Mark Callaghan [ 2021-10-19 ]

I get your point but there will always be contention and it helps to have some mutexes that can tolerate it via some spinning before they go to sleep.

There used to be a MySQL storage engine written by someone who didn't like spin locks. That engine had a custom mutex that included busy-wait spinning, alas the code had a simple bug and the compiler optimized away the spinning. By fixing that, and bringing back spinning, I contributed a large speedup (2X) to Falcon.
http://mysqlha.blogspot.com/2009/03/make-falcon-faster.html

Comment by Krunal Bauskar [ 2021-10-25 ]

After folding of the multiple changes related to buf_pool mutex optimization I re-evaluated the said patch.

1. For uniform (as we observed before) there is no change in performance.
2. For zipfian (contention case), for ARM there is consistent improvement in performance for higher scalability. For x86, update-non-index 1024 scalability showed some regression. update-index and lower scalability continued to perform well. (Could be due to flushing issue).

Comment by Axel Schwenke [ 2021-10-26 ]

MDEV-26779-1.pdf shows no significant performance change. This was however run with uniform RNG and datadir on RAM-disk.

Comment by Axel Schwenke [ 2021-10-27 ]

MDEV-26779-2.pdf completes the picture with the inlined version. MDEV-26779-3.pdf shows results for the Zipf RNG. In any case it looks like for x86 the non-inlined variant shows the better behavior.

Comment by Marko Mäkelä [ 2021-10-27 ]

I think that for now, we can apply a simple ARMv8-specific change of initializing lock_sys.wait_mutex with MY_MUTEX_INIT_FAST a.k.a. PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP, similar to how we did to the log_sys mutexes in MDEV-26855.

I would not change other platforms than ARMv8, because our own tests on AMD64 do not show any significant improvement.

Spinning is basically a hack to work around contention. The lock_sys.wait_mutex must be split in some way to properly fix this, in a future task.

Comment by Marko Mäkelä [ 2021-10-27 ]

I think that before attempting to split lock_sys.wait_mutex, we should implement the following and re-evaluate the situation:

  • MDEV-16232 so that UPDATE and DELETE will avoid setting non-gap locks in the non-contended case
  • MDEV-16406 so that accessing the record locks will hopefully be faster and the critical sections of lock_sys.wait_mutex smaller.
Generated at Thu Feb 08 09:47:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.