Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26004

Excessive wait times in buf_LRU_get_free_block()

Details

    Description

      For MDEV-25113, one of the suggested fixes was to speed up buf_LRU_get_free_block() by making it wake up every time a buffer pool block is released to the buf_pool.free list.

      At low thread counts, this patch would work as expected, improving throughput and greatly reducing the maximum latency, say, from 30 seconds to some milliseconds, when having 4GiB of data, 4GiB log and 1GiB buffer pool, running on NVMe. This was observed on both 10.5 and 10.6.

      Alas, on my Debian Sid system (Linux 5.10.0), it would cause an extreme regression at 1000 concurrent clients, both reducing the throughput and increasing the latency. That problem needs to be repeated, understood and resolved before we can apply this performance improvement.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            wlad Vladislav Vaintroub added a comment - - edited

            Works well for me on Windows, you patch. On higher thread counts, the patch outperforms the baseline by the factor of 3.

            wlad Vladislav Vaintroub added a comment - - edited Works well for me on Windows, you patch. On higher thread counts, the patch outperforms the baseline by the factor of 3.
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]

            Unfortunately, I do not have access to a benchmarking machine where Linux profiling tools can be installed. Also it is a Linux problem, where I claim no expertise, but everyone else in development does

            wlad Vladislav Vaintroub added a comment - Unfortunately, I do not have access to a benchmarking machine where Linux profiling tools can be installed. Also it is a Linux problem, where I claim no expertise, but everyone else in development does
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            I conducted some more analysis today.

            Unlike yesterday, today I did not observe any regression for Sysbench oltp_update_index when using 1000 concurrent connections, 4GiB data set size, 2GiB buffer pool size. I had rebooted my workstation, and /dev/shm could be smaller or Firefox might not have consumed as much memory as it had by yesterday evening.

            After I reduced the buffer pool size to 1GiB or 512MiB, I finally repeated yesterday’s regression at 1000 concurrent connections. For up to 32 concurrent connections there always was improvement (roughly halved maximum latency time), even when using such a small buffer pool size. The culprit for the regression appears to be increased contention on buf_pool.mutex.

            This benchmark setup may not be representative, because I used very fast NVMe storage and innodb_flush_log_at_trx_commit=0. With proper durability setting and slower storage (SATA SSD or HDD), I/O latency should dominate.

            For the record, I used the following commands to collect profiling information while the workload was running:

            sudo offcputime-bpfcc --stack-storage-size=1048576 -df -p $(pgrep -nx mysqld) 30 > out64.stacks
            flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out64.stacks > out64.svg
            

            See http://www.brendangregg.com/offcpuanalysis.html for more information.

            marko Marko Mäkelä added a comment - I conducted some more analysis today. Unlike yesterday, today I did not observe any regression for Sysbench oltp_update_index when using 1000 concurrent connections, 4GiB data set size, 2GiB buffer pool size. I had rebooted my workstation, and /dev/shm could be smaller or Firefox might not have consumed as much memory as it had by yesterday evening. After I reduced the buffer pool size to 1GiB or 512MiB, I finally repeated yesterday’s regression at 1000 concurrent connections. For up to 32 concurrent connections there always was improvement (roughly halved maximum latency time), even when using such a small buffer pool size. The culprit for the regression appears to be increased contention on buf_pool.mutex . This benchmark setup may not be representative, because I used very fast NVMe storage and innodb_flush_log_at_trx_commit=0 . With proper durability setting and slower storage (SATA SSD or HDD), I/O latency should dominate. For the record, I used the following commands to collect profiling information while the workload was running: sudo offcputime-bpfcc --stack-storage-size=1048576 -df -p $(pgrep -nx mysqld) 30 > out64.stacks flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out64.stacks > out64.svg See http://www.brendangregg.com/offcpuanalysis.html for more information.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2021-06-24 08:46:44.0 2021-06-24 08:46:44.454
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5.12 [ 26025 ]
            Fix Version/s 10.6.3 [ 25904 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 122968 ] MariaDB v4 [ 159436 ]
            marko Marko Mäkelä made changes -

            MDEV-26055 and MDEV-26827 in MariaDB Server 10.6.13 improved the situation by making the buf_flush_page_cleaner() thread responsible for LRU flushing and eviction. An improved LRU flushing mode was introduced that will let the block remain in the buffer pool after the write was completed. These changes were observed to significantly reduce contention on buf_pool.mutex.

            marko Marko Mäkelä added a comment - MDEV-26055 and MDEV-26827 in MariaDB Server 10.6.13 improved the situation by making the buf_flush_page_cleaner() thread responsible for LRU flushing and eviction. An improved LRU flushing mode was introduced that will let the block remain in the buffer pool after the write was completed. These changes were observed to significantly reduce contention on buf_pool.mutex .
            marko Marko Mäkelä made changes -

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.