[MDEV-24948] thd_need_wait_reports() hurts performance - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6
Fix Version/s: 10.6
Component/s: Replication, Storage Engine - InnoDB
Labels:
- performance

Description

In ~~MDEV-20612~~, ~~MDEV-24671~~ and ~~MDEV-24738~~, the concurrency of the InnoDB lock subsystem was improved, except for one aspect. Each time the refactored lock_wait() is invoked to handle a lock wait, a predicate must be evaluated. It is not only somewhat time-consuming to evaluate, but the predicate also unnecessarily holds even when replication is not set up, which causes another bottleneck. The code in question is highlighted in the following:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -1758,7 +1758,7 @@ dberr_t lock_wait(que_thr_t *thr)

     is not being used. We want to be allow the user to skip

     lock_wait_rpl_report(). */

     const bool rpl= !(type_mode & LOCK_AUTO_INC) && trx->mysql_thd &&

-      innodb_deadlock_detect && thd_need_wait_reports(trx->mysql_thd);

+      innodb_deadlock_detect && false && thd_need_wait_reports(trx->mysql_thd);

 #endif

     timespec abstime;

     set_timespec_time_nsec(abstime, suspend_time.val * 1000);

If this condition holds (with the patch it will never hold), then lock_wait_rpl_report() will have to be invoked. That function will not only hold lock_sys.wait_mutex but also hold exclusive lock_sys.latch while traversing the lock graph. This prevents any concurrent registration or removal of locks, even if those locks were not part of any conflict.

To highlight the impact of this, I created an extreme benchmark where 20 concurrent connections are accessing a table that contains 10 or 20 rows. To alleviate contention on the buffer pool page latches, I padded the PRIMARY KEY with some extra bytes, so that there should be at most a few records per index page. Here are the results:

server	rows	log_bin=0/tps	log_bin=1/tps
plain	10	824.57	819.14
patched	10	832.04	830.63
plain	20	3305.20	3324.75
patched	20	3361.45	3398.25

There is some fluctuation in the numbers due to the random nature of the contention and due to the short 60-second run-time of the test. Most notably, both log_bin=1 cases for 20 rows on the unmodified (plain) server appeared faster than the log_bin=0 runs. There were faster log_bin=0 runs; I chose the slowest ones.

At 10 rows and 20 threads, the call to thd_need_wait_reports() seems to incur a clear overhead even when --skip-log-bin is in effect.

I believe that this API should move from 'push' notification to 'pull' notification. That is, the predicate thd_need_wait_reports() must be removed, and instead, whenever the replication layer actually needs to know the waits-for relations, it could invoke a new function handlerton::report_waits(), which InnoDB would basically implement with lock_wait_rpl_report(). We could still invoke thd_rpl_deadlock_check() to report the graph.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bench-MDEV-24948.sh
3 kB
2021-02-23 07:33

Issue Links

relates to

MDEV-25016 Race condition between lock_sys_t::cancel() and page split or merge

Closed

MDEV-31655 Parallel replication deadlock victim preference code errorneously removed

Closed

MDEV-20612 Improve InnoDB lock_sys scalability

Closed

MDEV-24671 Assertion failure in lock_wait_table_reserve_slot()

Closed

MDEV-24738 Improve the InnoDB deadlock checker

Closed

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Kristian Nielsen added a comment - 2025-02-07 12:33 - edited

I implemented a draft patch for this as described in my earlier comment: https://github.com/MariaDB/server/pull/3823

This should mostly eliminate the overhead of thd_rpl_deadlock_check() on non-parallel-replication threads in most cases.

This patch changes some behaviour of optimistic parallel replication and (non-default) option --binlog-commit-wait-count. These changes in behaviour need to be considered carefully (and may not be appropriate for old GA 10.6 release), but I think they are overall good.

Another idea (that I did not implement) could be to make thd_rpl_deadlock_check() a batch interface. InnoDB could collect waited-for THDs in a local array and pass that array to the server layer to process multiple waits in one call. This would be more efficient by having less calls (only one in the common case where number of waits fits in a local array). But maybe this patch is good enough.

Andrei, I've re-assigned this to me now that I have created a patch, hope that is ok.

Kristian Nielsen added a comment - 2025-02-07 12:33 - edited I implemented a draft patch for this as described in my earlier comment: https://github.com/MariaDB/server/pull/3823 This should mostly eliminate the overhead of thd_rpl_deadlock_check() on non-parallel-replication threads in most cases. This patch changes some behaviour of optimistic parallel replication and (non-default) option --binlog-commit-wait-count. These changes in behaviour need to be considered carefully (and may not be appropriate for old GA 10.6 release), but I think they are overall good. Another idea (that I did not implement) could be to make thd_rpl_deadlock_check() a batch interface. InnoDB could collect waited-for THDs in a local array and pass that array to the server layer to process multiple waits in one call. This would be more efficient by having less calls (only one in the common case where number of waits fits in a local array). But maybe this patch is good enough. Andrei, I've re-assigned this to me now that I have created a patch, hope that is ok.

Kristian Nielsen added a comment - 2025-02-08 10:25

I ran Marko's benchmark script before and after my draft patch. Like Marko, I see a small speedup, I got the most stable numbers with log_bin=1:

server	rows	log_bin=1/tps
plain	20	3305.20
patched	20	3319.71

Kristian Nielsen added a comment - 2025-02-08 10:25 I ran Marko's benchmark script before and after my draft patch. Like Marko, I see a small speedup, I got the most stable numbers with log_bin=1: server rows log_bin=1/tps plain 20 3305.20 patched 20 3319.71

Michael Widenius added a comment - 2025-03-25 09:00

The overhead in case of log_bin=0 is one call that has 3 if's and one 3 memory accesses to things that does not change.
The overhead shown can easily be attributed to test run variances or memory alignment for the given compilations.
Taking the slowest one is not relevant as there could be an unfortunate interrupt from anything during one run that distorts the result.
The average is the important metric.

Michael Widenius added a comment - 2025-03-25 09:00 The overhead in case of log_bin=0 is one call that has 3 if's and one 3 memory accesses to things that does not change. The overhead shown can easily be attributed to test run variances or memory alignment for the given compilations. Taking the slowest one is not relevant as there could be an unfortunate interrupt from anything during one run that distorts the result. The average is the important metric.

Kristian Nielsen added a comment - 2025-03-25 09:20

One reason this can be more impactful than was first thought is that this is called for every pair (thd1, thd2) where thd1 is blocked by thd2, not just once per wait (I think it traverses a wait-for list or graph). Another reason is that this happens under a critical global mutex inside InnoDB on the lock subsystem, again IIUC.

Kristian Nielsen added a comment - 2025-03-25 09:20 One reason this can be more impactful than was first thought is that this is called for every pair (thd1, thd2) where thd1 is blocked by thd2, not just once per wait (I think it traverses a wait-for list or graph). Another reason is that this happens under a critical global mutex inside InnoDB on the lock subsystem, again IIUC.

Marko Mäkelä added a comment - 2025-03-25 10:02

As the Description points out, the problem is that when the predicate thd_need_wait_reports() holds, sometimes unnecessarily, then InnoDB will have to invoke lock_wait_rpl_report() while holding the contentious lock_sys.wait_mutex. The current patch at https://github.com/MariaDB/server/pull/3823/ at least aims to reduce the overhead of invoking the predicate by caching its result in trx_t. I assume that it would also reduce unnecessary calls to lock_wait_rpl_report().

Marko Mäkelä added a comment - 2025-03-25 10:02 As the Description points out, the problem is that when the predicate thd_need_wait_reports() holds, sometimes unnecessarily, then InnoDB will have to invoke lock_wait_rpl_report() while holding the contentious lock_sys.wait_mutex . The current patch at https://github.com/MariaDB/server/pull/3823/ at least aims to reduce the overhead of invoking the predicate by caching its result in trx_t . I assume that it would also reduce unnecessary calls to lock_wait_rpl_report() .

MariaDB Server

thd_need_wait_reports() hurts performance

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration