[MDEV-26779] reduce lock_sys.wait_mutex contention by using spinloop construct - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6
Fix Version/s: 10.6.5
Component/s: Storage Engine - InnoDB
Labels:
- performance
Environment:
GNU/Linux on ARMv8 (Aarch64)

Description

reduce lock_sys.wait_mutex contention by using spinloop construct

wait_mutex plays an important role when the workload involves conflicting transactions.

On a heavily contented system with increasing scalability
quite possible that the majority of the transactions may have to wait
before acquiring resources.

This causes a lot of contention of wait_mutex but most of this
the contention is short-lived that tend to suggest the use of spin loop
to avoid giving up compute core that in turn will involve os-scheduler with additional latency.

Idea has shown promising results with performance improving up to 70-100% for write workload.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

custom spin vs pthread spin.png
97 kB
2021-10-14 10:21
MDEV-26779-1.pdf
22 kB
2021-10-26 12:19
MDEV-26779-2.pdf
15 kB
2021-10-27 14:00
MDEV-26779-3.pdf
16 kB
2021-10-27 14:00
patch (custom spin) and patch (pthread spin).png
16 kB
2021-10-14 10:21
sysbench.pdf
22 kB
2021-10-15 14:16
update-index (512 threads) x86 custom spin vs pthread spin loop.png
65 kB
2021-10-14 15:03
update-index (uniform) - baseline vs patched - arm.png
39 kB
2021-10-07 04:16
update-index (uniform) - baseline vs patched - x86.png
39 kB
2021-10-07 04:16
update-index (zipfian) - baseline vs patched - arm.png
43 kB
2021-10-07 04:16
update-index (zipfian) - baseline vs patched - x86.png
43 kB
2021-10-07 04:16
update-index (zipfian) - inline, update-index (zipfian) - noniline, update-non-index (zipfian) - inline and update-non-index (zipfian) - noninline.png
20 kB
2021-10-19 11:23
update-non-index (uniform) - baseline vs patched - arm.png
43 kB
2021-10-07 04:16
update-non-index (uniform) - baseline vs patched - x86.png
63 kB
2021-10-07 04:16
update-non-index (zipfian) - baseline vs patched - arm.png
46 kB
2021-10-07 04:16
update-non-index (zipfian) - baseline vs patched - x86.png
43 kB
2021-10-07 04:16
x86-wait-mutex-run.png
95 kB
2021-10-18 11:41

Issue Links

is caused by

MDEV-21452 Use condition variables and normal mutexes instead of InnoDB os_event and mutex

Closed

relates to

MDEV-16232 Use fewer mini-transactions

Stalled

MDEV-16406 Refactor the InnoDB record locks

Open

Activity

Ascending order - Click to sort in descending order

Krunal Bauskar created issue - 2021-10-07 04:13

Krunal Bauskar added a comment - 2021-10-07 04:16

PR created: https://github.com/MariaDB/server/pull/1923

Krunal Bauskar added a comment - 2021-10-07 04:16 PR created: https://github.com/MariaDB/server/pull/1923

Krunal Bauskar made changes - 2021-10-07 04:16

Field	Original Value	New Value
Attachment		update-non-index (uniform) - baseline vs patched - x86.png [ 59639 ]
Attachment		update-index (uniform) - baseline vs patched - x86.png [ 59640 ]
Attachment		update-non-index (uniform) - baseline vs patched - arm.png [ 59641 ]
Attachment		update-index (uniform) - baseline vs patched - arm.png [ 59642 ]
Attachment		update-index (zipfian) - baseline vs patched - x86.png [ 59643 ]
Attachment		update-non-index (zipfian) - baseline vs patched - x86.png [ 59644 ]
Attachment		update-non-index (zipfian) - baseline vs patched - arm.png [ 59645 ]
Attachment		update-index (zipfian) - baseline vs patched - arm.png [ 59646 ]

Krunal Bauskar added a comment - 2021-10-07 04:18

Please check the attached graph.

1. For zipfian (contention-based workload) patch has shown significant improvement (70-100% for all architecture x86 and arm).
2. For uniform (as expected given low contention), no change in performance is observed.

(test-case: CPU bound, 300 sec (test) + 20 sec (sleep) ... repeated 5 times)

Krunal Bauskar added a comment - 2021-10-07 04:18 Please check the attached graph. 1. For zipfian (contention-based workload) patch has shown significant improvement (70-100% for all architecture x86 and arm). 2. For uniform (as expected given low contention), no change in performance is observed. (test-case: CPU bound, 300 sec (test) + 20 sec (sleep) ... repeated 5 times)

Marko Mäkelä made changes - 2021-10-07 07:30

Assignee

Marko Mäkelä [ marko ]

Krunal Bauskar added a comment - 2021-10-14 10:21

1. As part of some other issue (~~MDEV-26769~~), based on comment from Mark Callaghan it was discovered that we can get pthread library to spin using PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP. MySQL mutex framework could enable it by passing MY_MUTEX_INIT_FAST.

Given the inbuilt construct, it was decided to try the said approach to use MY_MUTEX_INIT_FAST for wait_mutex.

testing seems to suggest that the said pthread inbuilt construct is less effective.

Please check the attached graphs.

Krunal Bauskar added a comment - 2021-10-14 10:21 1. As part of some other issue ( MDEV-26769 ), based on comment from Mark Callaghan it was discovered that we can get pthread library to spin using PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP. MySQL mutex framework could enable it by passing MY_MUTEX_INIT_FAST. Given the inbuilt construct, it was decided to try the said approach to use MY_MUTEX_INIT_FAST for wait_mutex. testing seems to suggest that the said pthread inbuilt construct is less effective. Please check the attached graphs.

Krunal Bauskar made changes - 2021-10-14 10:21

Attachment		custom spin vs pthread spin.png [ 59757 ]
Attachment		patch (custom spin) and patch (pthread spin).png [ 59758 ]

Marko Mäkelä made changes - 2021-10-14 12:56

Link

This issue blocks ~~MDEV-26769~~ [ ~~MDEV-26769~~ ]

Krunal Bauskar made changes - 2021-10-14 15:03

Attachment

update-index (512 threads) x86 custom spin vs pthread spin loop.png [ 59761 ]

Krunal Bauskar added a comment - 2021-10-14 15:03

Adding graph for x86 that too re-proves the fact that custom spin loop is performing better than pthread based inherent spin loop.

https://jira.mariadb.org/secure/attachment/59761/update-index%20%28512%20threads%29%20x86%20custom%20spin%20vs%20pthread%20spin%20loop.png

Krunal Bauskar added a comment - 2021-10-14 15:03 Adding graph for x86 that too re-proves the fact that custom spin loop is performing better than pthread based inherent spin loop. https://jira.mariadb.org/secure/attachment/59761/update-index%20%28512%20threads%29%20x86%20custom%20spin%20vs%20pthread%20spin%20loop.png

Mark Callaghan added a comment - 2021-10-14 15:09

A feature of the custom mutex code provided by upstream MySQL/InnoDB was explicit control over spinning (how long to spin before going to sleep). While Pthread mutex has a variant that spins, the spin duration is much less than the duration for the custom InnoDB mutex.

This is an interesting problem. The mutex behavior, whether from pthread mutex or custom InnoDB mutex, is global (all instances of the mutex behave mostly the same). But it is likely that some mutex instances benefit from longer spinning durations, while others do not.

Mark Callaghan added a comment - 2021-10-14 15:09 A feature of the custom mutex code provided by upstream MySQL/InnoDB was explicit control over spinning (how long to spin before going to sleep). While Pthread mutex has a variant that spins, the spin duration is much less than the duration for the custom InnoDB mutex. This is an interesting problem. The mutex behavior, whether from pthread mutex or custom InnoDB mutex, is global (all instances of the mutex behave mostly the same). But it is likely that some mutex instances benefit from longer spinning durations, while others do not.

Marko Mäkelä added a comment - 2021-10-15 06:22

mdcallag, indeed, we did not check how long PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP would spin.

In MariaDB 10.6, the lock_sys was changed quite a bit, When the lock_wait() function was introduced in the fix of ~~MDEV-24671~~, the relative latching order of lock_sys.wait_mutex and lock_sys.mutex was swapped. Thus, some waits for lock_sys.wait_mutex would occur while we are in ‘stop the world’ mode, holding exclusive lock_sys.latch (which replaced lock_sys.mutex in ~~MDEV-20612~~). My hypothesis was that spinning in such cases would be harmful. In the final version of krunalbauskar’s pull request, for each mutex acquisition we explicitly specify whether to spin.

axel is yet to report his numbers. For the one-liner change to replace nullptr with MY_MUTEX_INIT_FAST in the mysql_mutex_init() call, he observed serious regressions. I would guess that the adaptive logic in the GNU libc would be badly misguided due to the two fundamentally different scenarios where pthread_mutex_lock() is acquired. If we are holding exclusive lock_sys.latch, other threads cannot create any locks, and therefore the contention on lock_sys.wait_mutex should disappear after the current holder releases it.

In MariaDB, the spinning logic has evolved over time. A long time ago in ~~MDEV-8684~~, danblack removed the pausing for a random duration, because the internal state of the pseudorandom number generator was a contention point. So, we pause for a fixed duration. Then, ~~MDEV-19845~~ introduced a spin loop multiplier because in the Intel Skylake microarchitecture, the latency of the PAUSE instruction increased from the previous microarchitecture (Haswell) from about 10 to about 100 clock cycles. I think that in the successor Cascade Lake it was reduced again.

While I believe that spinloops are mostly working around a bad design, this lock_sys.wait_mutex is a special case that needs special measures. I did not come up with any good idea how to split or partition that mutex. I think that the only badly contended InnoDB mutex in MariaDB 10.6 is log_sys.mutex, which I hope to fix with a redo log block format change in ~~MDEV-14425~~.

It might be interesting to check whether we would be better off without any MY_MUTEX_INIT_FAST in our version of InnoDB. We are using it since 10.5 (~~MDEV-23399~~ and ~~MDEV-23855~~) for buf_pool.mutex and buf_pool.flush_list_mutex. The latter should be much less contended after ~~MDEV-25113~~, and the former will be better after ~~MDEV-26827~~. Apart from them, we have a mutex related to MariaDB’s binlog group commit (which I would like to see removed in ~~MDEV-25611~~).

Marko Mäkelä added a comment - 2021-10-15 06:22 mdcallag , indeed, we did not check how long PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP would spin. In MariaDB 10.6, the lock_sys was changed quite a bit, When the lock_wait() function was introduced in the fix of MDEV-24671 , the relative latching order of lock_sys.wait_mutex and lock_sys.mutex was swapped. Thus, some waits for lock_sys.wait_mutex would occur while we are in ‘stop the world’ mode, holding exclusive lock_sys.latch (which replaced lock_sys.mutex in MDEV-20612 ). My hypothesis was that spinning in such cases would be harmful. In the final version of krunalbauskar ’s pull request , for each mutex acquisition we explicitly specify whether to spin. axel is yet to report his numbers. For the one-liner change to replace nullptr with MY_MUTEX_INIT_FAST in the mysql_mutex_init() call, he observed serious regressions. I would guess that the adaptive logic in the GNU libc would be badly misguided due to the two fundamentally different scenarios where pthread_mutex_lock() is acquired. If we are holding exclusive lock_sys.latch , other threads cannot create any locks, and therefore the contention on lock_sys.wait_mutex should disappear after the current holder releases it. In MariaDB, the spinning logic has evolved over time. A long time ago in MDEV-8684 , danblack removed the pausing for a random duration, because the internal state of the pseudorandom number generator was a contention point. So, we pause for a fixed duration. Then, MDEV-19845 introduced a spin loop multiplier because in the Intel Skylake microarchitecture, the latency of the PAUSE instruction increased from the previous microarchitecture (Haswell) from about 10 to about 100 clock cycles. I think that in the successor Cascade Lake it was reduced again. While I believe that spinloops are mostly working around a bad design, this lock_sys.wait_mutex is a special case that needs special measures. I did not come up with any good idea how to split or partition that mutex. I think that the only badly contended InnoDB mutex in MariaDB 10.6 is log_sys.mutex , which I hope to fix with a redo log block format change in MDEV-14425 . It might be interesting to check whether we would be better off without any MY_MUTEX_INIT_FAST in our version of InnoDB. We are using it since 10.5 ( MDEV-23399 and MDEV-23855 ) for buf_pool.mutex and buf_pool.flush_list_mutex . The latter should be much less contended after MDEV-25113 , and the former will be better after MDEV-26827 . Apart from them, we have a mutex related to MariaDB’s binlog group commit (which I would like to see removed in MDEV-25611 ).

Axel Schwenke made changes - 2021-10-15 14:16

Attachment

sysbench.pdf [ 59776 ]

Axel Schwenke added a comment - 2021-10-15 14:18

Added sysbench.pdf. marko thinks we're spinning to much.

Axel Schwenke added a comment - 2021-10-15 14:18 Added sysbench.pdf . marko thinks we're spinning to much.

Krunal Bauskar added a comment - 2021-10-18 02:47

@Axel,

Seems like there are a lot of un-related branches that is creating some confusion here including a variant of the patch.
The original patch can be found here. https://github.com/mysqlonarm/server/commit/6f7444d1498ed0ec90ee4200ed9311e688a2621a

Can you test it once maybe a quick test with write-only-workload (please try both uniform and Zipfian) and share the result?

If the original patch helps then more optimization as being proposed to non-inline function, one-liner could be explored.

Krunal Bauskar added a comment - 2021-10-18 02:47 @Axel, Seems like there are a lot of un-related branches that is creating some confusion here including a variant of the patch. The original patch can be found here. https://github.com/mysqlonarm/server/commit/6f7444d1498ed0ec90ee4200ed9311e688a2621a Can you test it once maybe a quick test with write-only-workload (please try both uniform and Zipfian) and share the result? If the original patch helps then more optimization as being proposed to non-inline function, one-liner could be explored.

Krunal Bauskar made changes - 2021-10-18 11:41

Attachment

x86-wait-mutex-run.png [ 59788 ]

Krunal Bauskar added a comment - 2021-10-18 11:42

please check https://jira.mariadb.org/secure/attachment/59788/x86-wait-mutex-run.png.
I re-benchmarked the said patch this time with the latest baseline to avoid any last minutes changes to the trunk that could cause Axel to see the said difference.

Krunal Bauskar added a comment - 2021-10-18 11:42 please check https://jira.mariadb.org/secure/attachment/59788/x86-wait-mutex-run.png . I re-benchmarked the said patch this time with the latest baseline to avoid any last minutes changes to the trunk that could cause Axel to see the said difference.

Krunal Bauskar made changes - 2021-10-19 11:23

Attachment

update-index (zipfian) - inline, update-index (zipfian) - noniline, update-non-index (zipfian) - inline and update-non-index (zipfian) - noninline.png [ 59812 ]

Krunal Bauskar added a comment - 2021-10-19 11:24

Axel,
based on the request from Marko I re-tested the non-inline function against inline version (original patch). non-inline function is 2-6% slower than inline version.
check the attached graph https://jira.mariadb.org/secure/thumbnail/59812/_thumb_59812.png

Krunal Bauskar added a comment - 2021-10-19 11:24 Axel, based on the request from Marko I re-tested the non-inline function against inline version (original patch). non-inline function is 2-6% slower than inline version. check the attached graph https://jira.mariadb.org/secure/thumbnail/59812/_thumb_59812.png

Vladislav Vaintroub added a comment - 2021-10-19 11:58 - edited

I think 2% can be sacrificed, for the sake of getting accurate profiles. We did not know that ut_delay() accounts for 40% to 70% in 10.4, in some sysbench, until I uninlined it, to get the accurate picture

Vladislav Vaintroub added a comment - 2021-10-19 11:58 - edited I think 2% can be sacrificed, for the sake of getting accurate profiles. We did not know that ut_delay() accounts for 40% to 70% in 10.4, in some sysbench, until I uninlined it, to get the accurate picture

Vladislav Vaintroub added a comment - 2021-10-19 12:20

In general, spinning tends to hide true scalability bugs. I believe if at all, spinlocks must be last measure , exception, not a rule. I think for example, that it is possible to decrease contention on log_sys.mutex, and simultaneous copying to redo log buffer without holding a mutex is doable . However once contention is hidden with spins, there will be less motivation to fix bottlenecks.

Vladislav Vaintroub added a comment - 2021-10-19 12:20 In general, spinning tends to hide true scalability bugs. I believe if at all, spinlocks must be last measure , exception, not a rule. I think for example, that it is possible to decrease contention on log_sys.mutex, and simultaneous copying to redo log buffer without holding a mutex is doable . However once contention is hidden with spins, there will be less motivation to fix bottlenecks.

Mark Callaghan added a comment - 2021-10-19 14:43

I get your point but there will always be contention and it helps to have some mutexes that can tolerate it via some spinning before they go to sleep.

There used to be a MySQL storage engine written by someone who didn't like spin locks. That engine had a custom mutex that included busy-wait spinning, alas the code had a simple bug and the compiler optimized away the spinning. By fixing that, and bringing back spinning, I contributed a large speedup (2X) to Falcon.
http://mysqlha.blogspot.com/2009/03/make-falcon-faster.html

Mark Callaghan added a comment - 2021-10-19 14:43 I get your point but there will always be contention and it helps to have some mutexes that can tolerate it via some spinning before they go to sleep. There used to be a MySQL storage engine written by someone who didn't like spin locks. That engine had a custom mutex that included busy-wait spinning, alas the code had a simple bug and the compiler optimized away the spinning. By fixing that, and bringing back spinning, I contributed a large speedup (2X) to Falcon. http://mysqlha.blogspot.com/2009/03/make-falcon-faster.html

Marko Mäkelä made changes - 2021-10-22 12:09

Link

This issue blocks ~~MDEV-26769~~ [ ~~MDEV-26769~~ ]

Krunal Bauskar added a comment - 2021-10-25 07:13

After folding of the multiple changes related to buf_pool mutex optimization I re-evaluated the said patch.

1. For uniform (as we observed before) there is no change in performance.
2. For zipfian (contention case), for ARM there is consistent improvement in performance for higher scalability. For x86, update-non-index 1024 scalability showed some regression. update-index and lower scalability continued to perform well. (Could be due to flushing issue).

Krunal Bauskar added a comment - 2021-10-25 07:13 After folding of the multiple changes related to buf_pool mutex optimization I re-evaluated the said patch. 1. For uniform (as we observed before) there is no change in performance. 2. For zipfian (contention case), for ARM there is consistent improvement in performance for higher scalability. For x86, update-non-index 1024 scalability showed some regression. update-index and lower scalability continued to perform well. (Could be due to flushing issue).

Axel Schwenke made changes - 2021-10-26 12:19

Attachment

MDEV-26779-1.pdf [ 60029 ]

Axel Schwenke added a comment - 2021-10-26 12:22

MDEV-26779-1.pdf shows no significant performance change. This was however run with uniform RNG and datadir on RAM-disk.

Axel Schwenke added a comment - 2021-10-26 12:22 MDEV-26779-1.pdf shows no significant performance change. This was however run with uniform RNG and datadir on RAM-disk.

Axel Schwenke made changes - 2021-10-27 14:00

Attachment		MDEV-26779-2.pdf [ 60047 ]
Attachment		MDEV-26779-3.pdf [ 60048 ]

Axel Schwenke added a comment - 2021-10-27 14:01

MDEV-26779-2.pdf completes the picture with the inlined version. MDEV-26779-3.pdf shows results for the Zipf RNG. In any case it looks like for x86 the non-inlined variant shows the better behavior.

Axel Schwenke added a comment - 2021-10-27 14:01 MDEV-26779-2.pdf completes the picture with the inlined version. MDEV-26779-3.pdf shows results for the Zipf RNG. In any case it looks like for x86 the non-inlined variant shows the better behavior.

Marko Mäkelä added a comment - 2021-10-27 14:11

I think that for now, we can apply a simple ARMv8-specific change of initializing lock_sys.wait_mutex with MY_MUTEX_INIT_FAST a.k.a. PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP, similar to how we did to the log_sys mutexes in ~~MDEV-26855~~.

I would not change other platforms than ARMv8, because our own tests on AMD64 do not show any significant improvement.

Spinning is basically a hack to work around contention. The lock_sys.wait_mutex must be split in some way to properly fix this, in a future task.

Marko Mäkelä added a comment - 2021-10-27 14:11 I think that for now, we can apply a simple ARMv8-specific change of initializing lock_sys.wait_mutex with MY_MUTEX_INIT_FAST a.k.a. PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP , similar to how we did to the log_sys mutexes in MDEV-26855 . I would not change other platforms than ARMv8, because our own tests on AMD64 do not show any significant improvement. Spinning is basically a hack to work around contention. The lock_sys.wait_mutex must be split in some way to properly fix this, in a future task.

Marko Mäkelä made changes - 2021-10-27 14:39

Issue Type

Task [ 3 ]

Bug [ 1 ]

Marko Mäkelä made changes - 2021-10-27 14:41

Link

This issue is caused by ~~MDEV-21452~~ [ ~~MDEV-21452~~ ]

Marko Mäkelä made changes - 2021-10-27 14:41

Fix Version/s		10.6 [ 24028 ]
Affects Version/s		10.6 [ 24028 ]
Environment		GNU/Linux on ARMv8 (Aarch64)
Labels		performance

Marko Mäkelä made changes - 2021-10-27 14:42

issue.field.resolutiondate

2021-10-27 14:42:25.0

2021-10-27 14:42:25.002

Marko Mäkelä made changes - 2021-10-27 14:42

Fix Version/s		10.6.5 [ 26034 ]
Fix Version/s	10.6 [ 24028 ]
Resolution		Fixed [ 1 ]
Status	Open [ 1 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2021-10-27 14:45

Link

This issue relates to MDEV-16232 [ MDEV-16232 ]

Marko Mäkelä made changes - 2021-10-27 14:45

Link

This issue relates to MDEV-16406 [ MDEV-16406 ]

Marko Mäkelä added a comment - 2021-10-27 14:45

I think that before attempting to split lock_sys.wait_mutex, we should implement the following and re-evaluate the situation:

MDEV-16232 so that UPDATE and DELETE will avoid setting non-gap locks in the non-contended case
MDEV-16406 so that accessing the record locks will hopefully be faster and the critical sections of lock_sys.wait_mutex smaller.

Marko Mäkelä added a comment - 2021-10-27 14:45 I think that before attempting to split lock_sys.wait_mutex , we should implement the following and re-evaluate the situation: MDEV-16232 so that UPDATE and DELETE will avoid setting non-gap locks in the non-contended case MDEV-16406 so that accessing the record locks will hopefully be faster and the critical sections of lock_sys.wait_mutex smaller.

Sergei Golubchik made changes - 2021-12-06 21:54

Workflow

MariaDB v3 [ 125993 ]

MariaDB v4 [ 159756 ]

People

Assignee:: Marko Mäkelä

Reporter:: Krunal Bauskar

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2021-10-07 04:13

Updated:: 2021-10-27 14:45

Resolved:: 2021-10-27 14:42

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration