[MDEV-32374] log_sys.lsn_lock is a performance hog Created: 2023-10-09  Updated: 2024-02-02  Resolved: 2023-11-21

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2, 11.3
Fix Version/s: 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: performance

Attachments: PNG File b58e1ce.png    
Issue Links:
Problem/Incident
is caused by MDEV-27774 Reduce scalability bottlenecks in mtr... Closed
Relates
relates to MDEV-27866 Switching log_sys.latch to use spin b... Closed

 Description   

MDEV-27774 made it possible to for multiple mtr_t::commit() to write to log_sys.buf concurrently, by replacing the InnoDB log_sys.mutex with log_sys.latch and log_sys.lsn_lock. The latter appears to be a performance bottleneck.

Because we need to keep multiple log_sys data members consistent with each other, we cannot simply invoke std::atomic::fetch_add and std::atomic::compare_exchange_weak on log_sys.buf_free to achieve a similar result. But, we could try to fit more data members in the same single cache line with log_sys.latch.



 Comments   
Comment by Marko Mäkelä [ 2023-10-09 ]

This would revert a part of MDEV-27866 by making log_sys.lsn_lock a small futex-based mutex instead of using pthread_mutex_t.

log_t::append_prepare() would access two cache lines: one for log_sys.lsn and friends, and another one for anything between log_sys.latch and log_sys.write_lsn.

Comment by Vladislav Vaintroub [ 2023-10-17 ]

I tested. Small futex based does not perform well on Windows, and also it was problematic in the past. WaitOnAddress is not a system call, it has a considerable usermode part. CRITICAL_SECTION is good enough, and is small enough, and, if anything smaller is needed SRWLOCK is good enough, but WaitOnAddress based thing does not perform as well.

Besides, I did not see this mutex being the main bottleneck on any tests, as it was suggested. Maybe one only sees it on the latest Intel NUMAs, for that, I do not know.

https://github.com/MariaDB/server/commit/ae9bcf4d5b0f950b9bee91a1fb334bc626953ef2
retains the mutex on both Windows and ARM, where it shown to perform well in the past, and uses that futex srw_mutex on Intel/Linux. Maybe someone with more NUMA nodes than I have access to, can test it.

Comment by Marko Mäkelä [ 2023-10-26 ]

Thank you, wlad. I revised your version of the patch a little further and will test it today. Ready-made packages of that should appear in https://ci.mariadb.org/39851/ shortly. This build also includes a merge of MDEV-32050.

Comment by Marko Mäkelä [ 2023-10-27 ]

Contrary to my expectation, the current patch is hurting performance. I will have to diagnose this deeper.

Comment by Marko Mäkelä [ 2023-10-30 ]

Using offcputime, I found out that log_t::append_prepare() was unnecessarily invoking set_check_flush_or_checkpoint() when no log checkpoint but only a write of the log buffer was needed. I pushed a revised version for further testing.

Comment by Steve Shaw [ 2023-11-17 ]

Have tested as attached on an Intel Cascade Lake 8280L server and at 24 virtual users the performance gain is over 5%. The gain is seen up to 60 virtual users at 627,393NOPM which is 1,458,929 MariaDB TPM which is likely to exceed the majority of installed systems. The balance is its slightly lower beyond the peak point however not many systems will be tested on bare metal servers past this point.

Comment by Marko Mäkelä [ 2023-11-17 ]

wlad, can you please review this version that the graph b58e1ce.png is for?

For the record, initially we got some very mixed results until steve.shaw@intel.com revised HammerDB so that it would clean up the server between each workload batch. Now that it will wait for the history to be purged and the modified pages to be written back to the file system, each test step will be more deterministic, with no garbage from previous steps lying around.

Comment by Vladislav Vaintroub [ 2023-11-17 ]

marko, I suggest following improvement to this patch - do not call log_write_up_to if lsn is 0.
It is going to be a no-op anyway, but quite a more expensive no-op than just an "if"

diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
index af802e9b2fb..9f39b303964 100644
--- a/storage/innobase/log/log0log.cc
+++ b/storage/innobase/log/log0log.cc
@@ -932,6 +932,7 @@ void log_write_up_to(lsn_t lsn, bool durable,
 {
   ut_ad(!srv_read_only_mode || (log_sys.buf_free < log_sys.max_buf_free));
   ut_ad(lsn != LSN_MAX);
+  ut_ad(lsn != 0);
 
   if (UNIV_UNLIKELY(recv_no_ibuf_operations))
   {
diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
index 40c5cfe1eb8..d3f963beab9 100644
--- a/storage/innobase/mtr/mtr0mtr.cc
+++ b/storage/innobase/mtr/mtr0mtr.cc
@@ -469,7 +469,8 @@ void mtr_t::commit()
     if (UNIV_UNLIKELY(lsns.second != PAGE_FLUSH_NO))
       buf_flush_ahead(m_commit_lsn, lsns.second == PAGE_FLUSH_SYNC);
 
-    log_write_up_to(write_lsn, false);
+    if (write_lsn)
+      log_write_up_to(write_lsn, false);
   }
   else
   {

Comment by Marko Mäkelä [ 2023-11-18 ]

wlad, thank you for the suggestion; I will apply it.

A call log_write_up_to(0, false, nullptr) will (unless PMEM is used) cause a call to write_lock.acquire(0, nullptr), which in turn will cause a relaxed load of write_lock.m_value. Other than that, it’s just procedure calls and evaluating conditions based on something that could be passed in registers. Evaluating conditions can be expensive. I agree, this write_lsn = log_sys.get_write_target() would most of the time be 0, and it should be more efficient to add one more condition to avoid an unlikely procedure call and avoid evaluating several conditions inside that procedure.

Comment by Vladislav Vaintroub [ 2023-11-19 ]

Up to 10.11, but not in 11.x branches anymore, there is also additional code that accesses global variable `recv_no_ibuf_operations` . inside `log_write_up_to`

 if (UNIV_UNLIKELY(recv_no_ibuf_operations))
  {
    /* A non-final batch of recovery is active no writes to the log
    are allowed yet. */
    ut_a(!callback);
    return;
  }

Plus, on Linux, as you mentioned if PMEM is compiled in (it usually is?), log.is_pmem, which reads log.flush_buf

log_write_up_to is not very inefficient when it is used with lsn=0, but I'd better avoid it in mtr_t::commit. I tried it on normal oltp_update_index, when it was called approx 1 million times per second, and lsn was not 0 only about every 2 seconds, giving 99.9999% probability of a dummy invocation.

Generated at Thu Feb 08 10:30:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.