[MDEV-26004] Excessive wait times in buf_LRU_get_free_block() Created: 2021-06-23  Updated: 2023-06-02  Resolved: 2021-06-24

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.7, 10.6.0
Fix Version/s: 10.5.12, 10.6.3

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Relates
relates to MDEV-26055 Adaptive flushing is still not gettin... Closed
relates to MDEV-26827 Make page flushing even faster Closed
relates to MDEV-23399 10.5 performance regression with IO-b... Closed
relates to MDEV-25113 Reduce effect of parallel background ... Closed

 Description   

For MDEV-25113, one of the suggested fixes was to speed up buf_LRU_get_free_block() by making it wake up every time a buffer pool block is released to the buf_pool.free list.

At low thread counts, this patch would work as expected, improving throughput and greatly reducing the maximum latency, say, from 30 seconds to some milliseconds, when having 4GiB of data, 4GiB log and 1GiB buffer pool, running on NVMe. This was observed on both 10.5 and 10.6.

Alas, on my Debian Sid system (Linux 5.10.0), it would cause an extreme regression at 1000 concurrent clients, both reducing the throughput and increasing the latency. That problem needs to be repeated, understood and resolved before we can apply this performance improvement.



 Comments   
Comment by Vladislav Vaintroub [ 2021-06-23 ]

Works well for me on Windows, you patch. On higher thread counts, the patch outperforms the baseline by the factor of 3.

Comment by Vladislav Vaintroub [ 2021-06-23 ]

Unfortunately, I do not have access to a benchmarking machine where Linux profiling tools can be installed. Also it is a Linux problem, where I claim no expertise, but everyone else in development does

Comment by Marko Mäkelä [ 2021-06-24 ]

I conducted some more analysis today.

Unlike yesterday, today I did not observe any regression for Sysbench oltp_update_index when using 1000 concurrent connections, 4GiB data set size, 2GiB buffer pool size. I had rebooted my workstation, and /dev/shm could be smaller or Firefox might not have consumed as much memory as it had by yesterday evening.

After I reduced the buffer pool size to 1GiB or 512MiB, I finally repeated yesterday’s regression at 1000 concurrent connections. For up to 32 concurrent connections there always was improvement (roughly halved maximum latency time), even when using such a small buffer pool size. The culprit for the regression appears to be increased contention on buf_pool.mutex.

This benchmark setup may not be representative, because I used very fast NVMe storage and innodb_flush_log_at_trx_commit=0. With proper durability setting and slower storage (SATA SSD or HDD), I/O latency should dominate.

For the record, I used the following commands to collect profiling information while the workload was running:

sudo offcputime-bpfcc --stack-storage-size=1048576 -df -p $(pgrep -nx mysqld) 30 > out64.stacks
flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out64.stacks > out64.svg

See http://www.brendangregg.com/offcpuanalysis.html for more information.

Comment by Marko Mäkelä [ 2023-06-02 ]

MDEV-26055 and MDEV-26827 in MariaDB Server 10.6.13 improved the situation by making the buf_flush_page_cleaner() thread responsible for LRU flushing and eviction. An improved LRU flushing mode was introduced that will let the block remain in the buffer pool after the write was completed. These changes were observed to significantly reduce contention on buf_pool.mutex.

Generated at Thu Feb 08 09:42:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.