Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29445

reorganise innodb buffer pool (and remove buffer pool chunks)

Details

    Description

      copied from MDEV-25341:

      • The buf_pool.free as well as the buffer pool blocks that are backing store for the AHI or lock_sys could be doubly linked with each other via bytes allocated within the page frame itself. We do not need a dummy buf_page_t for such blocks.
      • We could allocate a contiguous virtual address range for the maximum supported size of buffer pool, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.

      Attachments

        Issue Links

          Activity

            I ran a simple performance test on RAM disk on a dual Intel® Xeon® Gold 6230R (26×2 threads per socket), with innodb_buffer_pool_size=5G and innodb_log_file_size=5G:

            sysbench oltp_update_index --tables=100 --table_size=10000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run
            

            Compared to the baseline, I observed a 2% regression in the average throughput. My first suspect would be the lazy initialization of the buffer pool (MDEV-25340), which is part of this change, but I did not analyze it deeper yet.

            I also tested crash recovery by killing the workload about 115 seconds into it (5 seconds before it would end), and measuring the time to recover a copy of that data directory, using two settings for innodb_buffer_pool_size: 1 GiB (requiring 2 recovery batches) and 5 GiB (682,236,800 bytes of log processed in 1 batch). The times between baseline and the patch were very similar. I will have to repeat this experiment after diagnosing and addressing the performance regression during the workload.

            marko Marko Mäkelä added a comment - I ran a simple performance test on RAM disk on a dual Intel® Xeon® Gold 6230R (26×2 threads per socket), with innodb_buffer_pool_size=5G and innodb_log_file_size=5G : sysbench oltp_update_index --tables=100 --table_size=10000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run Compared to the baseline, I observed a 2% regression in the average throughput. My first suspect would be the lazy initialization of the buffer pool ( MDEV-25340 ), which is part of this change, but I did not analyze it deeper yet. I also tested crash recovery by killing the workload about 115 seconds into it (5 seconds before it would end), and measuring the time to recover a copy of that data directory, using two settings for innodb_buffer_pool_size : 1 GiB (requiring 2 recovery batches) and 5 GiB (682,236,800 bytes of log processed in 1 batch). The times between baseline and the patch were very similar. I will have to repeat this experiment after diagnosing and addressing the performance regression during the workload.

            Some regression right after server startup could be expected due to the lazy initialization of the buffer pool. However, once the entire buffer pool corresponding to the working set of the workload has been fully initialized, we could expect that no buf_pool_t::lazy_allocate() will take place. According to perf record -g, only 2.13% of total samples are being recorded in buf_page_get_low(); 0.83% of total samples attributed to waiting for a shared buf_page_t::lock there. Any other buffer pool related functions account for less than 0.02% of the samples each, probably less than 0.2% of the total samples. Because this is a 10.6 based development branch without MDEV-27774, a write-heavy benchmark such as sysbench oltp_update_index would be dominated by contention on log_sys.mutex. I will move on to a read-only benchmark.

            marko Marko Mäkelä added a comment - Some regression right after server startup could be expected due to the lazy initialization of the buffer pool. However, once the entire buffer pool corresponding to the working set of the workload has been fully initialized, we could expect that no buf_pool_t::lazy_allocate() will take place. According to perf record -g , only 2.13% of total samples are being recorded in buf_page_get_low() ; 0.83% of total samples attributed to waiting for a shared buf_page_t::lock there. Any other buffer pool related functions account for less than 0.02% of the samples each, probably less than 0.2% of the total samples. Because this is a 10.6 based development branch without MDEV-27774 , a write-heavy benchmark such as sysbench oltp_update_index would be dominated by contention on log_sys.mutex . I will move on to a read-only benchmark.

            I am observing a small performance improvement with

            sysbench oltp_read_only --tables=100 --table_size=100000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run

            Both for the baseline and the patched version, the first of two such subsequent runs is faster. The patch is improving throughput (queries per second) by 775537.75/769316.38 = 0.80% for the first run and by 761852.58/758954.99 = 0.38% for the second run.

            My benchmark setup is far from reliable. To get more stable numbers, it would help to pin the mariadbd process to a single NUMA node and to disable hyperthreading as well. This can be considered at most as a sanity check before running a broader set of performance tests.

            marko Marko Mäkelä added a comment - I am observing a small performance improvement with sysbench oltp_read_only --tables=100 --table_size=100000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run Both for the baseline and the patched version, the first of two such subsequent runs is faster. The patch is improving throughput (queries per second) by 775537.75/769316.38 = 0.80% for the first run and by 761852.58/758954.99 = 0.38% for the second run. My benchmark setup is far from reliable. To get more stable numbers, it would help to pin the mariadbd process to a single NUMA node and to disable hyperthreading as well. This can be considered at most as a sanity check before running a broader set of performance tests.

            After some discussion with wlad, I decided to check if this actually depends on lazy buffer pool initialization (MDEV-25340), which I had implemented in my development branch. It seems that the logic in many places would be simpler if that change was reverted. I posted some performance test results to MDEV-25340. With a 96 GiB buffer pool allocated in 1 GiB MMU pages on a dual Haswell/Broadwell Xeon that has 2×64GiB of RAM, we’re talking about possibly halving the start-up time, but the starting point was less than 1 second. I think that a startup time of about 10 ms/GB should be acceptable.

            Today, I removed the lazy allocation. Two loops on startup and resize are now invoking block_descriptors_in_bytes, which makes them extremely slow. I will fix that next week. Then, hopefully, this task is practically done. In the stress testing so far, mleich has reported one mystery crash that does not reproduce under rr. It might end up having been fixed by today’s cleanup.

            marko Marko Mäkelä added a comment - After some discussion with wlad , I decided to check if this actually depends on lazy buffer pool initialization ( MDEV-25340 ), which I had implemented in my development branch. It seems that the logic in many places would be simpler if that change was reverted. I posted some performance test results to MDEV-25340 . With a 96 GiB buffer pool allocated in 1 GiB MMU pages on a dual Haswell/Broadwell Xeon that has 2×64GiB of RAM, we’re talking about possibly halving the start-up time, but the starting point was less than 1 second. I think that a startup time of about 10 ms/GB should be acceptable. Today, I removed the lazy allocation. Two loops on startup and resize are now invoking block_descriptors_in_bytes , which makes them extremely slow. I will fix that next week. Then, hopefully, this task is practically done. In the stress testing so far, mleich has reported one mystery crash that does not reproduce under rr . It might end up having been fixed by today’s cleanup.

            Today, after fixing the loops, I retested the MDEV-25340 scenario. I observed a significantly slower but still kind-of acceptable start-up time compared to the lazy initialization.

            I believe that this should ideally target the 10.11 branch, for the following reasons:

            • This fixes a race condition in the adaptive hash index and therefore fixes MDEV-35485.
            • This might make obsolete the MDEV-24670 interface, which has been somewhat problematic and appeared in 10.11.
            marko Marko Mäkelä added a comment - Today, after fixing the loops, I retested the MDEV-25340 scenario. I observed a significantly slower but still kind-of acceptable start-up time compared to the lazy initialization. I believe that this should ideally target the 10.11 branch, for the following reasons: This fixes a race condition in the adaptive hash index and therefore fixes MDEV-35485 . This might make obsolete the MDEV-24670 interface, which has been somewhat problematic and appeared in 10.11.

            People

              debarun Debarun Banerjee
              danblack Daniel Black
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.