[MDEV-34458] wait_for_read in buf_page_get_low hurts performance - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.11.8, 10.6.18, 11.0.6, 11.1.5, 11.2.4, 11.4.2
Fix Version/s: 10.6.19, 10.11.9, 11.1.6, 11.2.5, 11.4.3
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

I'm benchmarking 10.4 vs 10.11, using sysbench, and oltp_update_index.
It is a rather large benchmark, with about 23GB data , which is run with

sysbench --db-driver=mysql --range-size=100 --table-size=5000000 --tables=8 --threads=16 --events=0 --warmup-time=5 --time=600 oltp_update_index --rand-type=uniform

I run it "cold", i,e with slow shutdown after the load, and without bufferpool loading, i.e

innodb_fast_shutdown=0

innodb_buffer_pool_load_at_startup=OFF

innodb_buffer_pool_dump_at_shutdown=OFF

One thing I noticed, is a very slow warmup
compared 10.4, where it starts at about 40K tps, 10.11 starts at 2K tps, and reaches normal speed ~40-50K tps only after about 140 seconds. mysqld reads from SSD at 80MB/s in 10.4 at that time, and 8-10 times less in 10.11

Note, this is still mostly in-memory benchmark, the buffer pool is large enough.

I noticed a particular stack, that comes often during warmup, but not after, which I would attribute to that slowness.

NtDelayExecution

RtlDelayExecution

SleepEx

std::this_thread::sleep_until<std::chrono::steady_clock,std::chrono::duration<__int64,std::ratio<1,1000000000> > >

std::this_thread::sleep_for

buf_page_get_low

btr_latch_prev

btr_cur_t::search_leaf row_ins_sec_index_entry_low

row_ins_sec_index_entry

row_upd_sec_index_entry

row_upd_sec_step row_upd row_upd_step

row_update_for_mysql

ha_innobase::update_row

handler::ha_update_row

mysql_update

mysql_execute_command

Prepared_statement::execute

Prepared_statement::execute_loop

mysql_stmt_execute_common

mysqld_stmt_execute

dispatch_command

do_command

....

Note, that Windows' Sleep can't be less that 1ms, and by default, they would still amount to 14ms or something like that.

According to Marko, this sleeping is due to ~~MDEV-33543~~, something change buffer, and it is not needed in 11.x anymore.
UPD: If I revert 90b95c6149c72f43aa2324353a76f370d018a5ac (~~MDEV-33543~~), there is no warmup, performance is much better

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oltp_update_index.png
81 kB
2024-07-01 14:08

Issue Links

is caused by

MDEV-33543 Server hang caused by InnoDB change buffer

Closed

relates to

MDEV-34178 Performance regression with IO-bound insert benchmark

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-06-26 10:38

The change buffer was removed in ~~MDEV-29694~~ (11.0). I think that in the earliest maintained 11.x branch we should revert the ~~MDEV-33543~~ change and also remove all references to BTR_MODIFY_PREV, which was only used by the change buffer.

For 10.6 and 10.11, we will need a more complex solution. My intuition would say that we could introduce a new mode value in buf_page_get_gen() and buf_page_get_low() to signal that it is a call from btr_latch_prev not for BTR_SEARCH_PREV but for the change buffer specific BTR_MODIFY_PREV.

Marko Mäkelä added a comment - 2024-06-26 10:38 The change buffer was removed in MDEV-29694 (11.0). I think that in the earliest maintained 11.x branch we should revert the MDEV-33543 change and also remove all references to BTR_MODIFY_PREV , which was only used by the change buffer. For 10.6 and 10.11, we will need a more complex solution. My intuition would say that we could introduce a new mode value in buf_page_get_gen() and buf_page_get_low() to signal that it is a call from btr_latch_prev not for BTR_SEARCH_PREV but for the change buffer specific BTR_MODIFY_PREV .

Marko Mäkelä added a comment - 2024-06-27 13:48

I merged my suggested fix https://github.com/MariaDB/server/pull/3362 to 11.1. A solution for 10.6 and 10.11 will have to be found.

Marko Mäkelä added a comment - 2024-06-27 13:48 I merged my suggested fix https://github.com/MariaDB/server/pull/3362 to 11.1. A solution for 10.6 and 10.11 will have to be found.

Vladislav Vaintroub added a comment - 2024-07-01 14:18 - edited

I reran oltp_update_index with smaller data (10GB), and still a small bufferpool(2GB)
The effect of ~~MDEV-33543~~ seems to be the same as with large bufferpool - a warmup phase, and normal load, below are time series (X axis = time, Y axis - transaction per seconds, red line - current 10.6, blue line - ~~MDEV-33543~~ reverted)

Even it the workload it IO-bound, the delay caused by sleep in ~~MDEV-33543~~ is only/mostly visible when empty bufferpool is being loaded, and there is no visible regression once bufferpool is full. (I suspect that this has something to do with how reads are done, i.e maybe it only happens for async reads, and they are not used with full bufferpool).

I also ran readonly workload oltp_point_select with exact settings 10GB data and 2 GB bufferpool, and there is no negative effect caused by ~~MDEV-33543~~ for this workload, i.e the numbers with/without 90b95c6149c72f43aa2324353a76f370d018a5ac are the same.

Vladislav Vaintroub added a comment - 2024-07-01 14:18 - edited I reran oltp_update_index with smaller data (10GB), and still a small bufferpool(2GB) The effect of MDEV-33543 seems to be the same as with large bufferpool - a warmup phase, and normal load, below are time series (X axis = time, Y axis - transaction per seconds, red line - current 10.6, blue line - MDEV-33543 reverted) Even it the workload it IO-bound, the delay caused by sleep in MDEV-33543 is only/mostly visible when empty bufferpool is being loaded, and there is no visible regression once bufferpool is full. (I suspect that this has something to do with how reads are done, i.e maybe it only happens for async reads, and they are not used with full bufferpool). I also ran readonly workload oltp_point_select with exact settings 10GB data and 2 GB bufferpool, and there is no negative effect caused by MDEV-33543 for this workload, i.e the numbers with/without 90b95c6149c72f43aa2324353a76f370d018a5ac are the same.

Debarun Banerjee added a comment - 2024-07-02 05:10 - edited

I could complete my initial analysis.

1. The performance regression in load time is indeed caused by the deadlock fix given in ~~MDEV-33543~~. I see wlad's observation also confirms it. The area of impact is actually wider but is more visible when BP is being loaded initially via DMLs. Specifically the response time could be impacted when ...

DML : Pessimistic operation on index - expanding/shrinking the Index structure (split/merge)
The leaf pages are not found in buffer pool. Likely case if BP size is small.

2. ~~MDEV-33543~~ fills a very serious gap in the current design. It is not necessary that we need "change buffering" for such hang. It is actually general design issue where we are trying to acquire a lock without the consideration of proper latch ordering. Assessing the general impact of reverting the patch, I found that almost in all cases where we gain performance, we actually run into the risk of deadlock.

Let's try to see what the real issue is. The origin of the issue dates back to ~~MDEV-30400~~ that introduced btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for leaf page searches. Here we try to optimize the pessimistic latch acquisition by trying to latch the previous page out of normal latch ordering. The attempt is a trial and if the trial fails we revert back to ordered latching. There is no issue here. However, before trying the latch we need to get the page from buffer pool and we use RW_NO_LATCH to get the page fixed in BP without latching(buf_page_get_gen). Again it is ok, as long as the page is in BP. When the page is not in BP, we try to acquire S latch on the page to wait for the page being loaded to BP. This S latch acquisition with RW_NO_LATCH, violates the lathing order and causes deadlock.

commit de4030e4d49805a7ded5c0bfee01cc3fd7623522

Author: Marko Mäkelä <marko.makela@mariadb.com>

Date:   Tue Jan 24 14:09:21 2023 +0200

    MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT

    btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()

    for searches to level=0 (the leaf level). We will never release

    parent page latches before acquiring leaf page latches. If we need to

    temporarily release the level=1 page latch in the BTR_SEARCH_PREV or

    BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the

    child node pointer so that we will land on the correct leaf page.

There are other patches around this one including ~~MDEV-27058~~ that exposed the deadlock for change buffering.

This deadlock was analyzed in ~~MDEV-33543~~ and fixed by using the already present wait logic in buf_page_get_gen() instead of S latch for RW_NO_LATCH. The wait logic is inferior to usual S latch wait and is simply a repeated sleep 100 micro-sec (This sleep could be actually more depending on platforms). The bug was seen with "change-buffering" code path and the idea was that this path should be less exercised. The judgement was not correct and the path is actually quite frequent and does impact performance when pages are not in BP and being loaded by DML expanding/shrinking large data.

It, however, also implies that the general case could be vulnerable to the hang that got fixed by ~~MDEV-33543~~ without "change-buffering" being involved. I have some idea how to fix the problem in a better manner. While trying to get a patch with RW_NO_LATCH and we are attempting "out of order" latch, we could return from buf_page_get_gen immediately (instead of waiting) if the page is being loaded and follow the ordered latching path.

I am working on the patch and upload it soon. I hope it fixes the issue.

Debarun Banerjee added a comment - 2024-07-02 05:10 - edited I could complete my initial analysis. 1. The performance regression in load time is indeed caused by the deadlock fix given in MDEV-33543 . I see wlad 's observation also confirms it. The area of impact is actually wider but is more visible when BP is being loaded initially via DMLs. Specifically the response time could be impacted when ... DML : Pessimistic operation on index - expanding/shrinking the Index structure (split/merge) The leaf pages are not found in buffer pool. Likely case if BP size is small. 2. MDEV-33543 fills a very serious gap in the current design. It is not necessary that we need "change buffering" for such hang. It is actually general design issue where we are trying to acquire a lock without the consideration of proper latch ordering. Assessing the general impact of reverting the patch, I found that almost in all cases where we gain performance, we actually run into the risk of deadlock. Let's try to see what the real issue is. The origin of the issue dates back to MDEV-30400 that introduced btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for leaf page searches. Here we try to optimize the pessimistic latch acquisition by trying to latch the previous page out of normal latch ordering. The attempt is a trial and if the trial fails we revert back to ordered latching. There is no issue here. However, before trying the latch we need to get the page from buffer pool and we use RW_NO_LATCH to get the page fixed in BP without latching(buf_page_get_gen). Again it is ok, as long as the page is in BP. When the page is not in BP, we try to acquire S latch on the page to wait for the page being loaded to BP. This S latch acquisition with RW_NO_LATCH, violates the lathing order and causes deadlock. commit de4030e4d49805a7ded5c0bfee01cc3fd7623522 Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Tue Jan 24 14:09:21 2023 +0200 MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). We will never release parent page latches before acquiring leaf page latches. If we need to temporarily release the level=1 page latch in the BTR_SEARCH_PREV or BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the child node pointer so that we will land on the correct leaf page. There are other patches around this one including MDEV-27058 that exposed the deadlock for change buffering. This deadlock was analyzed in MDEV-33543 and fixed by using the already present wait logic in buf_page_get_gen() instead of S latch for RW_NO_LATCH. The wait logic is inferior to usual S latch wait and is simply a repeated sleep 100 micro-sec (This sleep could be actually more depending on platforms). The bug was seen with "change-buffering" code path and the idea was that this path should be less exercised. The judgement was not correct and the path is actually quite frequent and does impact performance when pages are not in BP and being loaded by DML expanding/shrinking large data. It, however, also implies that the general case could be vulnerable to the hang that got fixed by MDEV-33543 without "change-buffering" being involved. I have some idea how to fix the problem in a better manner. While trying to get a patch with RW_NO_LATCH and we are attempting "out of order" latch, we could return from buf_page_get_gen immediately (instead of waiting) if the page is being loaded and follow the ordered latching path. I am working on the patch and upload it soon. I hope it fixes the issue.

Debarun Banerjee added a comment - 2024-07-02 09:11

monty
The page flusher maintains the % of dirty pages. The 90% settings is the the amount of dirty (modified but not flushed) pages in BP. The flush is not going to evict the pages out of BP. So, the number of pages in BP could be of higher %. We try to maintain a fixed set of "Free Pages" + Some non-data pages used for other purposes like Locks. Other than that the rest of the pool contain data pages.

However, there is always a possibility of hitting that code. The data pages also include the undo pages and practically with read/write workload stress and with a lot of undo pages we might still be forced to evict/reload pages and hitting this part of the code. I think a lot depends on the scenario.

Debarun Banerjee added a comment - 2024-07-02 09:11 monty The page flusher maintains the % of dirty pages. The 90% settings is the the amount of dirty (modified but not flushed) pages in BP. The flush is not going to evict the pages out of BP. So, the number of pages in BP could be of higher %. We try to maintain a fixed set of "Free Pages" + Some non-data pages used for other purposes like Locks. Other than that the rest of the pool contain data pages. However, there is always a possibility of hitting that code. The data pages also include the undo pages and practically with read/write workload stress and with a lot of undo pages we might still be forced to evict/reload pages and hitting this part of the code. I think a lot depends on the scenario.

Debarun Banerjee added a comment - 2024-07-03 06:04

I have a tentative patch. I am still testing the patch.
Branch: 10.6-~~MDEV-34458~~.
Review: https://github.com/MariaDB/server/pull/3382

wlad, Can you please help me validating if the patch improves the performance problem ?
thiru Can you please review the patch ?

Debarun Banerjee added a comment - 2024-07-03 06:04 I have a tentative patch. I am still testing the patch. Branch: 10.6- MDEV-34458 . Review: https://github.com/MariaDB/server/pull/3382 wlad , Can you please help me validating if the patch improves the performance problem ? thiru Can you please review the patch ?

Vladislav Vaintroub added a comment - 2024-07-03 10:42 - edited

Thank debarun, I checked, the performance is back with the patch. Thank you!

Vladislav Vaintroub added a comment - 2024-07-03 10:42 - edited Thank debarun , I checked, the performance is back with the patch. Thank you!

Debarun Banerjee added a comment - 2024-07-03 14:32

Thanks to thiru for the review.
Thanks to wlad for identifying the issue and validation.

Debarun Banerjee added a comment - 2024-07-03 14:32 Thanks to thiru for the review. Thanks to wlad for identifying the issue and validation.

People

Assignee:: Debarun Banerjee

Reporter:: Vladislav Vaintroub

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 2024-06-26 10:20

Updated:: 2024-09-25 09:46

Resolved:: 2024-07-03 14:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration