[MDEV-33515] log_sys.lsn_lock causes excessive context switching - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4
Fix Version/s: 10.11.8, 11.0.6, 11.2.4, 11.1.5, 11.4.2
Component/s: Storage Engine - InnoDB
Labels:
- numa-scalability
- performance
Environment:
GNU/Linux, NUMA on Intel Xeon

Description

steve.shaw@intel.com is reporting that write intensive workloads on a NUMA system end up spending a lot of time in native_queued_spin_lock_slowpath.part.0 in the Linux kernel. He has provided a patch that adds a user-space spinlock around the calls to mtr_t::do_write() and is significantly improving throughput at larger numbers of concurrent connections in his test environment.

As far as I can tell, that patch would only allow one mtr_t::do_write() call to proceed at a time, and thus make waits on log_sys.latch extremely unlikely. But that would also seem to ruin part of what ~~MDEV-27774~~ achieved.

If I understood it correctly, the idea would be better implemented at a slightly lower level, to allow maximum concurrency:

diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc

index b819022fec6..884bb5af5c1 100644

--- a/storage/innobase/mtr/mtr0mtr.cc

+++ b/storage/innobase/mtr/mtr0mtr.cc

@@ -1052,7 +1052,7 @@ std::pair<lsn_t,mtr_t::page_flush_ahead> mtr_t::do_write()

   if (!m_latch_ex)

-    log_sys.latch.rd_lock(SRW_LOCK_CALL);

+    log_sys.latch.rd_spin_lock();

   if (UNIV_UNLIKELY(m_user_space && !m_user_space->max_lsn &&

                     !is_predefined_tablespace(m_user_space->id)))

The to-be-written member function rd_lock_spin() would avoid invoking futex_wait(), and instead keep invoking MY_RELAX_CPU() in the spin loop.

An exclusive log_sys.latch will be acquired rarely and held for rather short time, during DDL operations, undo tablespace truncation, as well as around log checkpoints.

Some experimentation will be needed to find something that scales well across the board (from embedded systems to high-end servers).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

1socket.png
2024-03-26 10:12
21 kB
Steve Shaw
2socket.png
2024-03-26 10:12
21 kB
Steve Shaw
baseline.svg
2024-03-21 16:01
656 kB
Vladislav Vaintroub
mariadbtpm.png
2024-03-26 10:19
84 kB
Steve Shaw
spinflag.svg
2024-03-21 16:01
437 kB
Vladislav Vaintroub
update_index_256threads_x_10tables_x_1mio_rows.svg
2024-03-21 13:58
131 kB
Vladislav Vaintroub

Issue Links

causes

MDEV-34422 InnoDB writes corrupted log on macOS and AIX due to uninitialized log_sys.lsn_lock

Closed

relates to

MDEV-27866 Switching log_sys.latch to use spin based variant

Closed

MDEV-32374 log_sys.lsn_lock is a performance hog

Closed

MDEV-21923 LSN allocation is a bottleneck

Closed

blocks: PERF-407 Failed to load

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Marko Mäkelä added a comment - 2024-03-20 13:53

I created two patches. On my Haswell microarchitecture dual Intel Xeon E5-2630 v4, both result in significantly worse throughput with 256-thread Sysbench oltp_update_index (actually I intended to test oltp_update_non_index) than the 10.11 baseline. With the baseline, the Linux kernel function native_queued_spin_lock_slowpath is the busiest one; with either fix, it would end up in the second place, behind the new function lsn_delay().

The first variant more closely resembles what steve.shaw@intel.com did. It uses std::atomic<bool> for the lock word; the lock acquisition would be xchg.

The second variant merges log_sys.lsn_lock to the most significant bit of log_sys.buf_free. Lock acquisition will be a loop around lock cmpxchg. It yields better throughput on my system than the first variant. One further thing that could be tried would be a combination of lock bts and a separate mov to load the log_sys.buf_free value. On ARMv8, POWER, RISC-V or other modern ISA, we could probably use simple std::atomic::fetch_or().

I think that some further testing on newer Intel microarchitectures is needed to determine whether I am on the right track with these.

Marko Mäkelä added a comment - 2024-03-20 13:53 I created two patches. On my Haswell microarchitecture dual Intel Xeon E5-2630 v4, both result in significantly worse throughput with 256-thread Sysbench oltp_update_index (actually I intended to test oltp_update_non_index ) than the 10.11 baseline. With the baseline, the Linux kernel function native_queued_spin_lock_slowpath is the busiest one; with either fix, it would end up in the second place, behind the new function lsn_delay() . The first variant more closely resembles what steve.shaw@intel.com did. It uses std::atomic<bool> for the lock word; the lock acquisition would be xchg . The second variant merges log_sys.lsn_lock to the most significant bit of log_sys.buf_free . Lock acquisition will be a loop around lock cmpxchg . It yields better throughput on my system than the first variant. One further thing that could be tried would be a combination of lock bts and a separate mov to load the log_sys.buf_free value. On ARMv8, POWER, RISC-V or other modern ISA, we could probably use simple std::atomic::fetch_or() . I think that some further testing on newer Intel microarchitectures is needed to determine whether I am on the right track with these.

Vladislav Vaintroub added a comment - 2024-03-21 14:00 - edited

I tried the patch on AlderLake (server running on P-cores, sysbench on E-cores)
update_index_256threads_x_10tables_x_1mio_rows.svg

In short, spinlocks do make performance worse, e.g TPS for 1 minute update_index x256 threads x10 tables x 1mio rows (in memory) looks like this

baseline	spinlock	spinflag
235146.7	151019.13	147738.52

I also added spinflag.svg and baseline.svg flamegraphs, so one can see what's going on.
Apparently, the append_prepare/lsn_delay takes half of the time 48%, in "spin" variation, on Alder Lake, while in baseline, append_prepare is barely noticable, with 0.5%

It is not just Haswell, that performs badly, and this sorta matches what we had seen 2 years ago.

Vladislav Vaintroub added a comment - 2024-03-21 14:00 - edited I tried the patch on AlderLake (server running on P-cores, sysbench on E-cores) update_index_256threads_x_10tables_x_1mio_rows.svg In short, spinlocks do make performance worse, e.g TPS for 1 minute update_index x256 threads x10 tables x 1mio rows (in memory) looks like this baseline spinlock spinflag 235146.7 151019.13 147738.52 I also added spinflag.svg and baseline.svg flamegraphs, so one can see what's going on. Apparently, the append_prepare/lsn_delay takes half of the time 48%, in "spin" variation, on Alder Lake, while in baseline, append_prepare is barely noticable, with 0.5% It is not just Haswell, that performs badly, and this sorta matches what we had seen 2 years ago.

Marko Mäkelä added a comment - 2024-03-22 10:35

https://github.com/MariaDB/server/pull/3148 introduces SET GLOBAL innodb_log_spin_wait_delay, which can be used to enable or disable the spin lock while the server is running. The value 50 should roughly correspond to what the previous spinflag patch did. The default value innodb_log_spin_wait_delay=0 means that the log_sys.lsn_lock will be used. I think that we must rely on steve.shaw@intel.com to test this on Emerald Rapids.

Marko Mäkelä added a comment - 2024-03-22 10:35 https://github.com/MariaDB/server/pull/3148 introduces SET GLOBAL innodb_log_spin_wait_delay , which can be used to enable or disable the spin lock while the server is running. The value 50 should roughly correspond to what the previous spinflag patch did. The default value innodb_log_spin_wait_delay=0 means that the log_sys.lsn_lock will be used. I think that we must rely on steve.shaw@intel.com to test this on Emerald Rapids.

Steve Shaw added a comment - 2024-03-26 10:20

I have attached on a test Emerald Rapids system with 56 core CPUs with both 1 and 2 socket tests, this improves performance by 9% (1 socket) and 12% (2 socket) respectively and the highest MariaDB throughput we have measured from any release so far with very stable performance measured throughout each of this tests.

Steve Shaw added a comment - 2024-03-26 10:20 I have attached on a test Emerald Rapids system with 56 core CPUs with both 1 and 2 socket tests, this improves performance by 9% (1 socket) and 12% (2 socket) respectively and the highest MariaDB throughput we have measured from any release so far with very stable performance measured throughout each of this tests.

Marko Mäkelä added a comment - 2024-03-26 16:15

I think that it would be interesting to know the limits on the number of concurrent threads on more recent microarchitectures. Common sense would suggest that spinning would work better when the number of concurrent threads is limited to a fraction of the number of hardware threads. I did not test such low concurrency on my Haswell microarchitecture system.

In any case, it is easy to tune this if it does not work, simply by SET GLOBAL innodb_log_spin_wait_delay.

Marko Mäkelä added a comment - 2024-03-26 16:15 I think that it would be interesting to know the limits on the number of concurrent threads on more recent microarchitectures. Common sense would suggest that spinning would work better when the number of concurrent threads is limited to a fraction of the number of hardware threads. I did not test such low concurrency on my Haswell microarchitecture system. In any case, it is easy to tune this if it does not work, simply by SET GLOBAL innodb_log_spin_wait_delay .

MariaDB Server

log_sys.lsn_lock causes excessive context switching

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration