Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15058

Remove multiple InnoDB buffer pool instances

Details

    Description

      This came up during the MDEV-15016 review.

      I started to wonder whether multiple InnoDB buffer pools actually help with any workloads. Yes, it probably was a good idea to split the buffer pool mutex when Inaam Rana introduced multiple buffer pools in MySQL 5.5.5, but since then, there have been multiple fixes to reduce contention on the buffer pool mutex, such as Inaam's follow-up fix in MySQL 5.6.2 to use rw-locks instead of mutexes for the buf_pool->page_hash.

      In MySQL 8.0.0, Shaohua Wang implemented one more thing that MariaDB should copy: MDEV-15053 Split buf_pool_t::mutex.

      I think that we should seriously consider removing all code to support multiple buffer pools or page cleaners.
      Should multiple buffer pools be needed in the future (for example, on NUMA machines), it should be designed better from the ground up. Currently the partitioning is arbitrary; buffer pool membership is basically determined by a hash of the page number.

      The description of WL#6642: InnoDB: multiple page_cleaner threads seems to imply that it may have been a mistake to partition the buffer pool.

      Note: partitioning or splitting mutexes often seems to be a good idea. But partitioning data structures or threads might not be.

      axel, please test different workloads with innodb_buffer_pool_instances=1 and innodb_page_cleaners=1, and compare the performance to configurations that use multiple buffer pools (and page cleaners). If using a single buffer pool instance never seems to causes any regression, I think that we should simplify the code.

      Attachments

        1. 1bp.txt
          81 kB
        2. 4bp.txt
          86 kB
        3. MDEV-15058.ods
          82 kB
        4. MDEV-15058.pdf
          56 kB
        5. MDEV-15058-10.4.10.ods
          88 kB
        6. MDEV-15058-10.4vs10.5.ods
          140 kB
        7. MDEV-15058-10.5.ods
          88 kB
        8. MDEV-15058-10.5-34dafb7e3a8.ods
          49 kB
        9. MDEV-15058-10.5-dev.ods
          68 kB
        10. MDEV-15058-B.ods
          77 kB
        11. MDEV-15058-B.pdf
          51 kB
        12. MDEV-15058-RAM-ARM.ods
          77 kB
        13. MDEV-15058-RAM-Intel.ods
          82 kB
        14. MDEV-15058-singleBP.ods
          51 kB
        15. MDEV-15058-SSD-ARM.ods
          79 kB
        16. MDEV-15058-SSD-Intel.ods
          80 kB
        17. MDEV-15058-thiru.ods
          73 kB
        18. MDEV-15058-thiru.pdf
          53 kB
        19. MDEV-15058-tpcc.ods
          45 kB
        20. ramdisk-ro1.svg
          428 kB
        21. ramdisk-ro4.svg
          395 kB
        22. ramdisk-rw1.svg
          782 kB
        23. ramdisk-rw4.svg
          581 kB

        Issue Links

          Activity

            So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough

            my.cnf

            [mysqld]
             
            #####non innodb options
            max_connections = 300
            table_open_cache = 600
            query_cache_type = 0
             
            #####innodb options
            innodb_buffer_pool_size = 1G
            innodb_log_buffer_size = 32M
            innodb_log_file_size = 512M
            innodb_flush_log_at_trx_commit = 2
            innodb_doublewrite = 0
             
            loose-innodb_adaptive_hash_index_partitions = 32
            loose-innodb_adaptive_hash_index_parts = 32
             
            #####SSD
            innodb-flush-method = O_DIRECT
            innodb_io_capacity = 4000
            loose-innodb_flush_neighbors = 0
            innodb_write_io_threads = 8
             
            #####the variables for this test
            innodb_buffer_pool_instances = 1
             
            innodb_max_dirty_pages_pct = 99
            skip-innodb_adaptive_hash_index
            skip-innodb-stats-persistent
            innodb-change-buffering=none
            innodb_file_per_table = 0
            

            script to run with sysbench 1.0

            sysbench --test=/usr/share/sysbench/oltp_update_index.lua   --tables=32 --table-size=1250000  --rand-seed=42 --rand-type=uniform --num-threads=32 --report-interval=2  --mysql-socket=/tmp/mysql.sock --time=300  --max-requests=0 --mysql-user=root --percentile=95 $1
            

            where $1 is either "prepare" or "run" (you need to have a database called sbtest)

            Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top")

            For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4

            4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case.

            I attached pt-pmp output 1bp.txt (single buffer pool instance) and 4bp.txt (4 instances), made with 20 samples separated by 10 seconds delay (If someone knows a more modern tool to profile on contention , please tell)

            From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something)

            I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question.
            https://github.com/MariaDB/server/blob/f3dac591747dfbd88bd8ae2855f9a0e64006ce75/storage/innobase/buf/buf0buf.cc#L6019

            wlad Vladislav Vaintroub added a comment - So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough my.cnf [mysqld]   #####non innodb options max_connections = 300 table_open_cache = 600 query_cache_type = 0   #####innodb options innodb_buffer_pool_size = 1G innodb_log_buffer_size = 32M innodb_log_file_size = 512M innodb_flush_log_at_trx_commit = 2 innodb_doublewrite = 0   loose-innodb_adaptive_hash_index_partitions = 32 loose-innodb_adaptive_hash_index_parts = 32   #####SSD innodb-flush-method = O_DIRECT innodb_io_capacity = 4000 loose-innodb_flush_neighbors = 0 innodb_write_io_threads = 8   #####the variables for this test innodb_buffer_pool_instances = 1   innodb_max_dirty_pages_pct = 99 skip-innodb_adaptive_hash_index skip-innodb-stats-persistent innodb-change-buffering=none innodb_file_per_table = 0 script to run with sysbench 1.0 sysbench --test=/usr/share/sysbench/oltp_update_index.lua --tables=32 --table-size=1250000 --rand-seed=42 --rand-type=uniform --num-threads=32 --report-interval=2 --mysql-socket=/tmp/mysql.sock --time=300 --max-requests=0 --mysql-user=root --percentile=95 $1 where $1 is either "prepare" or "run" (you need to have a database called sbtest) Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top") For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4 4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case. I attached pt-pmp output 1bp.txt (single buffer pool instance) and 4bp.txt (4 instances), made with 20 samples separated by 10 seconds delay (If someone knows a more modern tool to profile on contention , please tell) From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something) I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question. https://github.com/MariaDB/server/blob/f3dac591747dfbd88bd8ae2855f9a0e64006ce75/storage/innobase/buf/buf0buf.cc#L6019

            I ported the changes to 10.5. Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in MDEV-15053).

            marko Marko Mäkelä added a comment - I ported the changes to 10.5 . Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in MDEV-15053 ).
            marko Marko Mäkelä added a comment - - edited

            I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the MDEV-12353 branch only. Pushed to 10.5.

            marko Marko Mäkelä added a comment - - edited I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the MDEV-12353 branch only. Pushed to 10.5.

            Based on the feedback of wlad, I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables. We will return a dummy buffer pool identifier 0, for compatibility.

            marko Marko Mäkelä added a comment - Based on the feedback of wlad , I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables . We will return a dummy buffer pool identifier 0, for compatibility.

            The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by MDEV-15053. That function would still acquire buf_pool_t::mutex (which was renamed to buf_pool_t::LRU_list_mutex).

            MDEV-15053 did not show improved performance when wlad was testing it. To keep the latching rules more understandable and to avoid race conditions, it might be best to omit most of those changes.

            marko Marko Mäkelä added a comment - The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by MDEV-15053 . That function would still acquire buf_pool_t::mutex (which was renamed to buf_pool_t::LRU_list_mutex ). MDEV-15053 did not show improved performance when wlad was testing it. To keep the latching rules more understandable and to avoid race conditions, it might be best to omit most of those changes.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.