Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15058

Remove multiple InnoDB buffer pool instances

Details

    Description

      This came up during the MDEV-15016 review.

      I started to wonder whether multiple InnoDB buffer pools actually help with any workloads. Yes, it probably was a good idea to split the buffer pool mutex when Inaam Rana introduced multiple buffer pools in MySQL 5.5.5, but since then, there have been multiple fixes to reduce contention on the buffer pool mutex, such as Inaam's follow-up fix in MySQL 5.6.2 to use rw-locks instead of mutexes for the buf_pool->page_hash.

      In MySQL 8.0.0, Shaohua Wang implemented one more thing that MariaDB should copy: MDEV-15053 Split buf_pool_t::mutex.

      I think that we should seriously consider removing all code to support multiple buffer pools or page cleaners.
      Should multiple buffer pools be needed in the future (for example, on NUMA machines), it should be designed better from the ground up. Currently the partitioning is arbitrary; buffer pool membership is basically determined by a hash of the page number.

      The description of WL#6642: InnoDB: multiple page_cleaner threads seems to imply that it may have been a mistake to partition the buffer pool.

      Note: partitioning or splitting mutexes often seems to be a good idea. But partitioning data structures or threads might not be.

      axel, please test different workloads with innodb_buffer_pool_instances=1 and innodb_page_cleaners=1, and compare the performance to configurations that use multiple buffer pools (and page cleaners). If using a single buffer pool instance never seems to causes any regression, I think that we should simplify the code.

      Attachments

        1. 1bp.txt
          81 kB
        2. 4bp.txt
          86 kB
        3. MDEV-15058.ods
          82 kB
        4. MDEV-15058.pdf
          56 kB
        5. MDEV-15058-10.4.10.ods
          88 kB
        6. MDEV-15058-10.4vs10.5.ods
          140 kB
        7. MDEV-15058-10.5.ods
          88 kB
        8. MDEV-15058-10.5-34dafb7e3a8.ods
          49 kB
        9. MDEV-15058-10.5-dev.ods
          68 kB
        10. MDEV-15058-B.ods
          77 kB
        11. MDEV-15058-B.pdf
          51 kB
        12. MDEV-15058-RAM-ARM.ods
          77 kB
        13. MDEV-15058-RAM-Intel.ods
          82 kB
        14. MDEV-15058-singleBP.ods
          51 kB
        15. MDEV-15058-SSD-ARM.ods
          79 kB
        16. MDEV-15058-SSD-Intel.ods
          80 kB
        17. MDEV-15058-thiru.ods
          73 kB
        18. MDEV-15058-thiru.pdf
          53 kB
        19. MDEV-15058-tpcc.ods
          45 kB
        20. ramdisk-ro1.svg
          428 kB
        21. ramdisk-ro4.svg
          395 kB
        22. ramdisk-rw1.svg
          782 kB
        23. ramdisk-rw4.svg
          581 kB

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            axel Axel Schwenke made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058.ods [ 45112 ]
            Attachment MDEV-15058.pdf [ 45113 ]
            axel Axel Schwenke added a comment -

            I did a range of tests on thwo machines.

            1. my main Intel benchmark machine (16 cores, 32 hw threads)
            2. the ARM server (46 cores)

            The MariaDB version used was 10.3.4 (built locally). The benchmark was sysbench in different variations and workloads (see the single sheets).

            The results are a mixed bag. On Intel it looks like multiple BP partitions don't help with performance. Certainly not for INSERT workload. That are sheets 1 and 2.

            Sheets 3 and 4 are for Intel, sysbench OLTP with varying percentage of writes. Here it looks that we get small benefits for read-only, but the more writes are done and the more BP partitions we have, the worse things get.

            Sheet 5 is ARM, sysbench OLTP ro/rw and wo. Here we have no clear verdict. It seems that 16 or 32 buffer pools give indeed a benefit.

            Sheet 6 is not buffer pool partitions, but AHI partitions. This is now quite clear, increasing AHI partitions up to 32 is good for performance. Actually I ran this test first and used 32 fpr the other tests.

            About the attached files: LibreOffice messes up the conditional formatting of the cells, so I attach the sheet also as PDF. The cells in the "throughput per used core" tables are color-coded. Red means "more than 1% slower than 1 partition", green means "more than 1% faster than 1 partition".

            "Throughput per used core" - this is system throughput (qps) divided by min(benchmark threads, available hw threads). On a perfecly scaling system it would give the same number independent from benchmark thread count.

            axel Axel Schwenke added a comment - I did a range of tests on thwo machines. my main Intel benchmark machine (16 cores, 32 hw threads) the ARM server (46 cores) The MariaDB version used was 10.3.4 (built locally). The benchmark was sysbench in different variations and workloads (see the single sheets). The results are a mixed bag. On Intel it looks like multiple BP partitions don't help with performance. Certainly not for INSERT workload. That are sheets 1 and 2. Sheets 3 and 4 are for Intel, sysbench OLTP with varying percentage of writes. Here it looks that we get small benefits for read-only, but the more writes are done and the more BP partitions we have, the worse things get. Sheet 5 is ARM, sysbench OLTP ro/rw and wo. Here we have no clear verdict. It seems that 16 or 32 buffer pools give indeed a benefit. Sheet 6 is not buffer pool partitions, but AHI partitions. This is now quite clear, increasing AHI partitions up to 32 is good for performance. Actually I ran this test first and used 32 fpr the other tests. About the attached files: LibreOffice messes up the conditional formatting of the cells, so I attach the sheet also as PDF. The cells in the "throughput per used core" tables are color-coded. Red means "more than 1% slower than 1 partition", green means "more than 1% faster than 1 partition". "Throughput per used core" - this is system throughput (qps) divided by min(benchmark threads, available hw threads). On a perfecly scaling system it would give the same number independent from benchmark thread count.
            axel Axel Schwenke made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Sprint 10.3.5-1 [ 229 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-B.ods [ 45173 ]
            Attachment MDEV-15058-B.pdf [ 45174 ]
            axel Axel Schwenke added a comment -

            Attached new results in MDEV-15058-B.

            {pdf,ods}

            Those numbers are for MariaDB 10.3, commit c0d5d7c0

            Again there are results for Intel (first 4 sheets) and ARM (last sheet). For ARM the situation is rather clear: multiple buffer pools have a slight negative impact on performance. Also clear is the situation for Intel and INSERT-only workload.

            axel Axel Schwenke added a comment - Attached new results in MDEV-15058 -B. {pdf,ods} Those numbers are for MariaDB 10.3, commit c0d5d7c0 Again there are results for Intel (first 4 sheets) and ARM (last sheet). For ARM the situation is rather clear: multiple buffer pools have a slight negative impact on performance. Also clear is the situation for Intel and INSERT-only workload.
            axel Axel Schwenke made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            axel, can you please test innodb_buffer_pool_instances=1 with MDEV-15158 and MDEV-15053? Both are present in bb-10.3-thiru that is based on 10.3 (slightly newer than c0d5d7c0 which you tested).
            To work around MDEV-15246 you will have to revert the problematic commit:

            git fetch origin
            git checkout 52be75a682fb79c0a0d86bf3011aaeeb1f647d64
            git revert bc7a1dc1fbd27e6064d3b40443fe242397668af
            

            It is MDEV-15053 that should reduce buf_pool->mutex contention, thus hopefully ‘entitling’ us to revert back to a single buffer pool.
            The MDEV-15158 is merely removing remaining writes to the TRX_SYS page. If you are not using binlog or Galera, you should not see any effect of that.

            marko Marko Mäkelä added a comment - axel , can you please test innodb_buffer_pool_instances=1 with MDEV-15158 and MDEV-15053 ? Both are present in bb-10.3-thiru that is based on 10.3 (slightly newer than c0d5d7c0 which you tested). To work around MDEV-15246 you will have to revert the problematic commit: git fetch origin git checkout 52be75a682fb79c0a0d86bf3011aaeeb1f647d64 git revert bc7a1dc1fbd27e6064d3b40443fe242397668af It is MDEV-15053 that should reduce buf_pool->mutex contention, thus hopefully ‘entitling’ us to revert back to a single buffer pool. The MDEV-15158 is merely removing remaining writes to the TRX_SYS page. If you are not using binlog or Galera, you should not see any effect of that.
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-thiru.ods [ 45238 ]
            Attachment MDEV-15058-thiru.pdf [ 45239 ]
            axel Axel Schwenke added a comment -

            I added two more sheets (again: .ods and .pdf) with numbers for commit 2b97d02 - further referenced as 10.3.5-thiru. There are system- and per core throughput with color coding as before: red cells indicate performance more than 1% worse than single buffer pool, green cells indicate performance more than 1% better than single buffer pool.
            I also added two sheets comparing 10.3.4 with 10.3.5-thiru for one buffer pool. For some usecases 10.3.5-thiru is 20% faster. This is more pronounced on the ARM machine (it has more cores) and for workloads with more writes.
            As for the number of buffer pools: for read-only workloads and on Intel I see some benefits in the order of 2% for using multiple buffer pools.

            axel Axel Schwenke added a comment - I added two more sheets (again: .ods and .pdf) with numbers for commit 2b97d02 - further referenced as 10.3.5-thiru. There are system- and per core throughput with color coding as before: red cells indicate performance more than 1% worse than single buffer pool, green cells indicate performance more than 1% better than single buffer pool. I also added two sheets comparing 10.3.4 with 10.3.5-thiru for one buffer pool. For some usecases 10.3.5-thiru is 20% faster. This is more pronounced on the ARM machine (it has more cores) and for workloads with more writes. As for the number of buffer pools: for read-only workloads and on Intel I see some benefits in the order of 2% for using multiple buffer pools.
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-tpcc.ods [ 45240 ]
            axel Axel Schwenke added a comment -

            Added numbers for TPC-C - an OLTP type benchmark. Specifically this is the TPC-C implementation named "HammerDB". The workload includes reads and writes and gives two numbers: system TPM = system throughput in transactions per minute and NOPM = new orders per minute. The NOPM number is comparable between different databases, the TPM number is specific for a DBMS.
            As for the results: with 2 or 4 buffer pools, not much changes. With 16 buffer pools there is a visible decrease in performance.
            This was run on Intel (no HammerDB for ARM) and with the 10.3.5-thiru build.

            axel Axel Schwenke added a comment - Added numbers for TPC-C - an OLTP type benchmark. Specifically this is the TPC-C implementation named "HammerDB". The workload includes reads and writes and gives two numbers: system TPM = system throughput in transactions per minute and NOPM = new orders per minute. The NOPM number is comparable between different databases, the TPM number is specific for a DBMS. As for the results: with 2 or 4 buffer pools, not much changes. With 16 buffer pools there is a visible decrease in performance. This was run on Intel (no HammerDB for ARM) and with the 10.3.5-thiru build.
            axel Axel Schwenke made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]

            Thank you, axel!

            Since the 10.3.4 release there have been quite a few performance improvements, most notably MDEV-15104, MDEV-15158, MDEV-15132, MDEV-15059.
            In MDEV-15058-tpcc.ods, a single buffer pool seems to generally win for lower concurrency, and the biggest improvement with multiple buffer pools is at most 1%. There is only one outlier: with 1 user, 4 buffer pool instances seem to give a 6.7% improvement.

            Similarly, with MDEV-15058-thiru.pdf it is clear that 4 buffer pool instances is the sweet spot for the test hardware, giving up to 4.5% improvement with 1 client connection. With more connections, the difference goes down, but not completely.

            Meanwhile, there is the separate branch bb-10.3-MDEV-15053 (commit 24deb3737f60c8aea1781c4dd244322f0066b197 based on the 10.3 commit 988ec800edb3dd9238b6f3948157d21bdb0c083b). I believe that its performance should be similar to 10.3.5-thiru (commit 2b97d026623d1928ce61752ef13ca1c7fb77b4e7) which was tested.

            I have the feeling that removing the code to deal with multiple buffer pools might not improve performance enough to beat innodb_buffer_pool_instances=4 in write-heavy benchmarks on the test system. Write performance could be improved by changing the page flushing algorithms and data structures.

            marko Marko Mäkelä added a comment - Thank you, axel ! Since the 10.3.4 release there have been quite a few performance improvements, most notably MDEV-15104 , MDEV-15158 , MDEV-15132 , MDEV-15059 . In MDEV-15058-tpcc.ods , a single buffer pool seems to generally win for lower concurrency, and the biggest improvement with multiple buffer pools is at most 1%. There is only one outlier: with 1 user, 4 buffer pool instances seem to give a 6.7% improvement. Similarly, with MDEV-15058-thiru.pdf it is clear that 4 buffer pool instances is the sweet spot for the test hardware, giving up to 4.5% improvement with 1 client connection. With more connections, the difference goes down, but not completely. Meanwhile, there is the separate branch bb-10.3- MDEV-15053 (commit 24deb3737f60c8aea1781c4dd244322f0066b197 based on the 10.3 commit 988ec800edb3dd9238b6f3948157d21bdb0c083b). I believe that its performance should be similar to 10.3.5-thiru (commit 2b97d026623d1928ce61752ef13ca1c7fb77b4e7) which was tested. I have the feeling that removing the code to deal with multiple buffer pools might not improve performance enough to beat innodb_buffer_pool_instances=4 in write-heavy benchmarks on the test system. Write performance could be improved by changing the page flushing algorithms and data structures.

            Removing multiple buffer pools necessarily means removing multiple page cleaner threads as well.
            Also, InnoDB (and MariaDB) would no longer depend on libnuma.
            I did a quick refactoring (based on the latest 10.3 and MDEV-15053) and pushed to bb-10.3-MDEV-15058 for review (by thiru) and benchmarking.

            marko Marko Mäkelä added a comment - Removing multiple buffer pools necessarily means removing multiple page cleaner threads as well. Also, InnoDB (and MariaDB) would no longer depend on libnuma. I did a quick refactoring (based on the latest 10.3 and MDEV-15053 ) and pushed to bb-10.3-MDEV-15058 for review (by thiru ) and benchmarking.
            marko Marko Mäkelä added a comment - - edited

            I created another version of this patch. It would be interesting to benchmark it against the base version, to see if we really need MDEV-15053 in order to remove multiple buffer pools.
            bb-10.3-MDEV-15058-2 against "plain 10.3"
            bb-10.3-MDEV-15058 contains also MDEV-15053

            marko Marko Mäkelä added a comment - - edited I created another version of this patch. It would be interesting to benchmark it against the base version, to see if we really need MDEV-15053 in order to remove multiple buffer pools. bb-10.3-MDEV-15058-2 against "plain 10.3" bb-10.3-MDEV-15058 contains also MDEV-15053
            serg Sergei Golubchik made changes -
            Sprint 10.3.5-1 [ 229 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-singleBP.ods [ 45306 ]
            axel Axel Schwenke added a comment -

            I added a new spread sheet MDEV-15058-singleBP. This compares MariaDB 10.3.5 with the two buildbot trees: bb-10.3-mdev-15058 and bb-10.3-mdev-15058-2. The benchmark is sysbench OLTP. There are two sheets for two different architectures:

            1. my Intel machine (16 cores, 32 hw threads). This has 10.3.5 number only for 4 buffer pools. Here all 3 contenders behave very much the same for read-only or read-mostly workload. For write-intensive workload, 10.3.5 is fastest.

            2. the ARM server (46 cores). Here I ran the test for 1..64 buffer pools. Again for read-only and read-write there are only small differences, also the number of buffer pools doesn't matter much. For write-only workload two buffer pools give best results for 10.3.5, but the bb-10.3-mdev-15058 tree performs equally or better.

            axel Axel Schwenke added a comment - I added a new spread sheet MDEV-15058 -singleBP. This compares MariaDB 10.3.5 with the two buildbot trees: bb-10.3-mdev-15058 and bb-10.3-mdev-15058-2. The benchmark is sysbench OLTP. There are two sheets for two different architectures: 1. my Intel machine (16 cores, 32 hw threads). This has 10.3.5 number only for 4 buffer pools. Here all 3 contenders behave very much the same for read-only or read-mostly workload. For write-intensive workload, 10.3.5 is fastest. 2. the ARM server (46 cores). Here I ran the test for 1..64 buffer pools. Again for read-only and read-write there are only small differences, also the number of buffer pools doesn't matter much. For write-only workload two buffer pools give best results for 10.3.5, but the bb-10.3-mdev-15058 tree performs equally or better.

            axel, the latest benchmarks seem to be an "approval" to go ahead and remove multiple buffer pools.

            However, I wonder how big the workload size is compared to the buffer pool size.

            I would like to ask for one more set of write-heavy benchmarks, comparing to multiple buffer pools, and the buffer pool size being about 10% of the workload size. The minimum for using multiple innodb_buffer_pool_instances is innodb_buffer_pool_size=1g. The writes could be updates with uniformly distributed keys, so that a large number of pages get dirtied, instead of dirtying only a few pages.

            The reason for this kind of benchmark is that the buffer pool mutex(es) are acquired mostly when pages are flushed, evicted or loaded to the buffer pool. If the workload mostly fits in the buffer pool, then there should not be that much load on the buffer pool.

            There is a concern that with a single buffer pool and a single page cleaner (there cannot be more page cleaner threads than buffer pool instances) the flushing rate would be too small to saturate an SSD, and this could limit the performance of write-heavy workloads.

            marko Marko Mäkelä added a comment - axel , the latest benchmarks seem to be an "approval" to go ahead and remove multiple buffer pools. However, I wonder how big the workload size is compared to the buffer pool size. I would like to ask for one more set of write-heavy benchmarks, comparing to multiple buffer pools, and the buffer pool size being about 10% of the workload size. The minimum for using multiple innodb_buffer_pool_instances is innodb_buffer_pool_size=1g . The writes could be updates with uniformly distributed keys, so that a large number of pages get dirtied, instead of dirtying only a few pages. The reason for this kind of benchmark is that the buffer pool mutex(es) are acquired mostly when pages are flushed, evicted or loaded to the buffer pool. If the workload mostly fits in the buffer pool, then there should not be that much load on the buffer pool. There is a concern that with a single buffer pool and a single page cleaner (there cannot be more page cleaner threads than buffer pool instances) the flushing rate would be too small to saturate an SSD, and this could limit the performance of write-heavy workloads.

            Consider separating LRU flushing from flush list flushing, which are not really related one to another anyway: https://www.percona.com/blog/2016/05/05/percona-server-5-7-multi-threaded-lru-flushing/ This way with a single instance you'd have two flusher threads.

            If their CPU priority is high, flushing should never be bounded by lack of threads

            laurynas Laurynas Biveinis added a comment - Consider separating LRU flushing from flush list flushing, which are not really related one to another anyway: https://www.percona.com/blog/2016/05/05/percona-server-5-7-multi-threaded-lru-flushing/ This way with a single instance you'd have two flusher threads. If their CPU priority is high, flushing should never be bounded by lack of threads
            axel Axel Schwenke made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-RAM-Intel.ods [ 45344 ]
            axel Axel Schwenke added a comment - - edited

            New results added: MDEV-15058-RAM-Intel.ods

            This is with completely different setup:

            • datadir in /dev/shm
            • 1G buffer pool, 10G data set
            • sysbench OLTP with 50, 80, 100% writes
            • max dirty percent = 99, dblwrite disabled

            Now there is a significant regression when switching to a single buffer pool. The number of page cleaners has also an impact, but much smaller. The experimental "single buffer pool" trees behave better than 10.3.5 with a single buffer pool, but are still significantly slower than 10.3.5 with 4 buffer pools.

            axel Axel Schwenke added a comment - - edited New results added: MDEV-15058 -RAM-Intel.ods This is with completely different setup: datadir in /dev/shm 1G buffer pool, 10G data set sysbench OLTP with 50, 80, 100% writes max dirty percent = 99, dblwrite disabled Now there is a significant regression when switching to a single buffer pool. The number of page cleaners has also an impact, but much smaller. The experimental "single buffer pool" trees behave better than 10.3.5 with a single buffer pool, but are still significantly slower than 10.3.5 with 4 buffer pools.
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-RAM-Intel.ods [ 45344 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-RAM-Intel.ods [ 45345 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-RAM-ARM.ods [ 45346 ]
            axel Axel Schwenke added a comment -

            Attached results for ARM and the same setup (/dev/shm, write-heavy). Results are very similar to those for Intel. For 100% writes the number of page cleaners has some impact on performance, too.

            axel Axel Schwenke added a comment - Attached results for ARM and the same setup (/dev/shm, write-heavy). Results are very similar to those for Intel. For 100% writes the number of page cleaners has some impact on performance, too.
            inaamrana Inaam Rana added a comment -

            Marko, I believe we won't only need multiple instances but also multiple background threads for flushing.

            We can think in terms of a single LRU flusher and a single page_cleaner. The code as it is written right now will serially go through each instance and do a batch (LRU batches are chunkized though). Imagine if the last instance is the one most in need of flushing. Therefore it might make sense to have configurable number of background flushing threads.

            With multiple threads we might want to tweak some other bits of code as well:

            • Try to avoid two threads working on same instance i.e.: page_cleaner on flush_list and lru_thread on LRU list. As that will probably make buf_pool::mutex more hot.
            • May be have some way of knowing which instance is in urgent need of flushing and prioritize according to that
            • We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance
            • When dblwr is disabled, after posting one batch we call os_aio_wait_until_no_pending_writes() in buf_dblwr_sync_datafiles(). If we have multiple threads working we might get unnecessarily delayed by wait_until_no_pending_writes.
            • When dblwr is disabled we can shove a lot of IO requests in a single batch. The aio_array have limited segments (one per IO thread) and each can take up to 256 pending IO requests. If we are using simulated AIO we call os_aio_simulated_wake_handler_threads() at end of batch. And batch can be quite large. And there can be multiple batches happening concurrently. I think it makes more sense to call os_aio_simulated_wake_handler_threads more frequently (or have mechanism inside os0file.cc where io threads are woken up before we ran out of slots)
            • See if the idea mentioned here is practicable: https://bugs.mysql.com/bug.php?id=74637
            inaamrana Inaam Rana added a comment - Marko, I believe we won't only need multiple instances but also multiple background threads for flushing. We can think in terms of a single LRU flusher and a single page_cleaner. The code as it is written right now will serially go through each instance and do a batch (LRU batches are chunkized though). Imagine if the last instance is the one most in need of flushing. Therefore it might make sense to have configurable number of background flushing threads. With multiple threads we might want to tweak some other bits of code as well: Try to avoid two threads working on same instance i.e.: page_cleaner on flush_list and lru_thread on LRU list. As that will probably make buf_pool::mutex more hot. May be have some way of knowing which instance is in urgent need of flushing and prioritize according to that We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance When dblwr is disabled, after posting one batch we call os_aio_wait_until_no_pending_writes() in buf_dblwr_sync_datafiles(). If we have multiple threads working we might get unnecessarily delayed by wait_until_no_pending_writes. When dblwr is disabled we can shove a lot of IO requests in a single batch. The aio_array have limited segments (one per IO thread) and each can take up to 256 pending IO requests. If we are using simulated AIO we call os_aio_simulated_wake_handler_threads() at end of batch. And batch can be quite large. And there can be multiple batches happening concurrently. I think it makes more sense to call os_aio_simulated_wake_handler_threads more frequently (or have mechanism inside os0file.cc where io threads are woken up before we ran out of slots) See if the idea mentioned here is practicable: https://bugs.mysql.com/bug.php?id=74637

            Inaam, Marko, the last reply prompts me to market our MT flusher / parallel doublewrite some more.

            Our design addresses the first three items: 1) each thread has its own private lru/flush list; 2) each thread has own independent heuristics on when to flush; 3) parallel doublewrite.

            I believe it takes care of the issues in bug 74637 as well.

            laurynas Laurynas Biveinis added a comment - Inaam, Marko, the last reply prompts me to market our MT flusher / parallel doublewrite some more. Our design addresses the first three items: 1) each thread has its own private lru/flush list; 2) each thread has own independent heuristics on when to flush; 3) parallel doublewrite. I believe it takes care of the issues in bug 74637 as well.
            wlad Vladislav Vaintroub added a comment - - edited

            I'm also reminded that I wanted to know more about the story of Linux AIO with segments per thread. MariaDB design on Windows is (almost) such that any thread can take any IO completion, the segments do not play any role . (more accurately, there are actually 2 IO pools, one for read and one for write requests, and that worked around some deadlock I've seen in the past
            But I feel like getting rid of segments might improve scalability on Linuxen, like it did when MariaDB Windows switched from prehistoric Win95-style async IO to NT3.5-style one with completion port.

            wlad Vladislav Vaintroub added a comment - - edited I'm also reminded that I wanted to know more about the story of Linux AIO with segments per thread. MariaDB design on Windows is (almost) such that any thread can take any IO completion, the segments do not play any role . (more accurately, there are actually 2 IO pools, one for read and one for write requests, and that worked around some deadlock I've seen in the past But I feel like getting rid of segments might improve scalability on Linuxen, like it did when MariaDB Windows switched from prehistoric Win95-style async IO to NT3.5-style one with completion port.
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-SSD-ARM.ods [ 45354 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-SSD-Intel.ods [ 45355 ]
            axel Axel Schwenke added a comment -

            Hi. I attached results for the datadir on SSD. The test was run on Intel and ARM again. Observations:

            ARM: the system is clearly IO-bound. While it does have a SSD, it's a rather slow one and it holds everything. iostat reports ~30% cpu time spent in iowait and 100% utilisation for the disk.

            Intel: the numbers are very similar to the ones for the datadir in RAM, just that the differences are a bit smoother. A difference that should be noted is, that now the number of page cleaners has more impact. Check the diagrams bottom left (4 buffer pools, 1/2/4 page cleaner threads). iostat shows up to 45% iowait and the SSDs reach 98% utilisation. Unlike the ARM system, the Intel system has dedicated SSD for the datadir and it's actually two units in RAID-0.

            axel Axel Schwenke added a comment - Hi. I attached results for the datadir on SSD. The test was run on Intel and ARM again. Observations: ARM: the system is clearly IO-bound. While it does have a SSD, it's a rather slow one and it holds everything. iostat reports ~30% cpu time spent in iowait and 100% utilisation for the disk. Intel: the numbers are very similar to the ones for the datadir in RAM, just that the differences are a bit smoother. A difference that should be noted is, that now the number of page cleaners has more impact. Check the diagrams bottom left (4 buffer pools, 1/2/4 page cleaner threads). iostat shows up to 45% iowait and the SSDs reach 98% utilisation. Unlike the ARM system, the Intel system has dedicated SSD for the datadir and it's actually two units in RAID-0.
            axel Axel Schwenke made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            marko Marko Mäkelä added a comment - - edited

            log_flush_order_mutex, which was shared between the flush_list of multiple buffer pool instances, may still be needed.

            If we removed this mutex, we would have to protect the buf_pool.flush_list with the log_sys.mutex, prolonging the hold time of that mutex. This might be acceptable, given that we only need to touch the flush_list when a page is modified for the first time. Most of the time, we could release the log_sys.mutex early.

            Either way, we would seem to need mtr_t::is_dirty() and mtr_t::is_block_dirtied().

            marko Marko Mäkelä added a comment - - edited log_flush_order_mutex , which was shared between the flush_list of multiple buffer pool instances, may still be needed. If we removed this mutex, we would have to protect the buf_pool.flush_list with the log_sys.mutex , prolonging the hold time of that mutex. This might be acceptable, given that we only need to touch the flush_list when a page is modified for the first time. Most of the time, we could release the log_sys.mutex early. Either way, we would seem to need mtr_t::is_dirty() and mtr_t::is_block_dirtied() .
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.4 [ 22408 ]
            Assignee Axel Schwenke [ axel ] Marko Mäkelä [ marko ]
            danblack Daniel Black made changes -
            anjumnaveed81 Anjum Naveed added a comment -

            I did test the impact of multiple buffer pool instances sometime back. The purpose was different so I will need to rerun the tests for reporting purposes. I found the bottleneck was the doublewrite buffer. When doublewrite buffer was turned off, multiple buffer pool instances resulted in improvement on Intel as well as ARM system. I did not modify any code so I cannot comment on flushing mechanism.

            In addition to updating the flushing mechanism, I am in strong support of the suggestion from Inaam Rana "We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance".

            anjumnaveed81 Anjum Naveed added a comment - I did test the impact of multiple buffer pool instances sometime back. The purpose was different so I will need to rerun the tests for reporting purposes. I found the bottleneck was the doublewrite buffer. When doublewrite buffer was turned off, multiple buffer pool instances resulted in improvement on Intel as well as ARM system. I did not modify any code so I cannot comment on flushing mechanism. In addition to updating the flushing mechanism, I am in strong support of the suggestion from Inaam Rana " We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance ".
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            anjumnaveed81 Anjum Naveed added a comment -

            Dear all,

            I have done following: (code base is version 10.5)
            1- Create separate file for doublewrite buffer independent of trx_sys_tablespace. This file takes independent directory path so that I can move it around.
            2- Number of doublewrite buffer files linked to buffer pool instances and one but_dblwr data structure (and hence mutex) for each instance.
            3- A small modification in flush_buffered_writes. Instead of flushing in a loop of instances, I provide instance as argument so that each doublewrite buffer file can be written in dependently, rather than in sequence. (This is then controlled by page cleaner threads).
            4- Moved the doublewrite folder to RAMDISK (in actual systems it should be in a separate disk at least)
            5- I have intentionally left page cleaner code unchanged because I wanted to see the impact of multiple buffer pool instances coupled with multiple files to write into. (More on this in a minute).

            At present I am testing on my development laptop so I do not have the liberty to move doublewrite buffers to separate disk, hence the RAMDISK, which will provide upper bound improvement.
            I have used hammerdb tpcc test. I have tested 4 buffer pool instances vs 1 buffer pool instance of modified code and original code. When I use 4 buffer pool instances with 4 page cleaner threads, I get about 19% improvement over the unmodified code. On the other hand, when I use single buffer pool instance, improvement is 33%. (Keep in mind that doublewrite buffer files are in RAMDISK).

            If I use actual hard drive for doublewrite buffer files, I will still get improvement, although not as much. As far as this test is concerned, it supports use of single buffer pool instead of using multiple buffer pool instances. However, looking at the code, I believe the issue is with page cleaner and the way pages are being distributed to instances that is holding the performance down and not the actual use of multiple files and buffer pool instances.

            I still believe multiple buffer pool instances is the way to go, specially for large systems and we need to improve the things around buffer pools (specifically page cleaner code). Please suggest if more time should be spent in this direction OR we have already decided to use single buffer pool.

            anjumnaveed81 Anjum Naveed added a comment - Dear all, I have done following: (code base is version 10.5) 1- Create separate file for doublewrite buffer independent of trx_sys_tablespace. This file takes independent directory path so that I can move it around. 2- Number of doublewrite buffer files linked to buffer pool instances and one but_dblwr data structure (and hence mutex) for each instance. 3- A small modification in flush_buffered_writes. Instead of flushing in a loop of instances, I provide instance as argument so that each doublewrite buffer file can be written in dependently, rather than in sequence. (This is then controlled by page cleaner threads). 4- Moved the doublewrite folder to RAMDISK (in actual systems it should be in a separate disk at least) 5- I have intentionally left page cleaner code unchanged because I wanted to see the impact of multiple buffer pool instances coupled with multiple files to write into. (More on this in a minute). At present I am testing on my development laptop so I do not have the liberty to move doublewrite buffers to separate disk, hence the RAMDISK, which will provide upper bound improvement. I have used hammerdb tpcc test. I have tested 4 buffer pool instances vs 1 buffer pool instance of modified code and original code. When I use 4 buffer pool instances with 4 page cleaner threads, I get about 19% improvement over the unmodified code. On the other hand, when I use single buffer pool instance, improvement is 33%. (Keep in mind that doublewrite buffer files are in RAMDISK). If I use actual hard drive for doublewrite buffer files, I will still get improvement, although not as much. As far as this test is concerned, it supports use of single buffer pool instead of using multiple buffer pool instances. However, looking at the code, I believe the issue is with page cleaner and the way pages are being distributed to instances that is holding the performance down and not the actual use of multiple files and buffer pool instances. I still believe multiple buffer pool instances is the way to go, specially for large systems and we need to improve the things around buffer pools (specifically page cleaner code). Please suggest if more time should be spent in this direction OR we have already decided to use single buffer pool.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            danblack Daniel Black made changes -

            anjumnaveed81, I am sorry for missing your updates until now. I would love to see your changes, preferrably in the form of a pull request against the current 10.5 branch in https://github.com/MariaDB/server/.

            I think that your work fits under MDEV-16526, whose objectives include removing scalability bottlenecks of a single buffer pool instance.

            marko Marko Mäkelä added a comment - anjumnaveed81 , I am sorry for missing your updates until now. I would love to see your changes, preferrably in the form of a pull request against the current 10.5 branch in https://github.com/MariaDB/server/ . I think that your work fits under MDEV-16526 , whose objectives include removing scalability bottlenecks of a single buffer pool instance.

            axel, can you please test a single buffer pool instance vs. multiple buffer pools in a write-heavy workload on the latest 10.5? Maybe things have improved since the previous test and we do not need to wait for MDEV-16526? Remember to test with the doublewrite buffer disabled, because that is an obvious bottleneck.

            marko Marko Mäkelä added a comment - axel , can you please test a single buffer pool instance vs. multiple buffer pools in a write-heavy workload on the latest 10.5? Maybe things have improved since the previous test and we do not need to wait for MDEV-16526 ? Remember to test with the doublewrite buffer disabled, because that is an obvious bottleneck.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Axel Schwenke [ axel ]
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-10.4.10.ods [ 50230 ]
            Attachment MDEV-15058-10.4vs10.5.ods [ 50231 ]
            Attachment MDEV-15058-10.5.ods [ 50232 ]
            axel Axel Schwenke added a comment -

            Attached 3 new spread sheets with results for 10.5.0 and 10.4.10. After seeing the numbers for 10.5.0 I decided to run the workload with latest 10.4 for comparison. In write-heavy benchmarks (see the comparative spread sheet) 10.5 is up to 40% slower than 10.4. Also there are some anomalies like very poor performance with a single buffer pool, getting back to normal with multiple pools.
            I'm afraid those numbers must be taken with a big spoon full (not just a grain) of salt. The cpu usage numbers taken due to the benchmark runs, suggest that very much time is spent in mutex waits. I would not conclude anything from those numbers without looking at detailed mutex stats (which are not, unfortunately being monitored during that benchmark run).

            axel Axel Schwenke added a comment - Attached 3 new spread sheets with results for 10.5.0 and 10.4.10. After seeing the numbers for 10.5.0 I decided to run the workload with latest 10.4 for comparison. In write-heavy benchmarks (see the comparative spread sheet) 10.5 is up to 40% slower than 10.4. Also there are some anomalies like very poor performance with a single buffer pool, getting back to normal with multiple pools. I'm afraid those numbers must be taken with a big spoon full (not just a grain) of salt. The cpu usage numbers taken due to the benchmark runs, suggest that very much time is spent in mutex waits. I would not conclude anything from those numbers without looking at detailed mutex stats (which are not, unfortunately being monitored during that benchmark run).

            Thank you, axel! I discussed this shortly with wlad (who benchmarked MDEV-16264 and claimed that with single buffer pool, the flushing performance was actually better).

            MDEV-15058-10.5.ods Sheet 7 "OLTP, 32 tables, uniform rng, small BP, no-dblwrite, SSD" is for 50%, 80% and 100% write ratio. For smaller write ratio, it suggests that 4 buffer pools are giving the optimal performance. The difference between 1 and 4 buffer pool instances is 10.1% for 50% writes and 5.7% for 100% writes. I would expect the difference to be even bigger with 0% write ratio. The figures seem to suggest that page flushing is not a bottleneck at all.

            It seems that we could have a bottleneck for things that affect page lookups or page replacement (eviction and reloading) in the buffer pool. With multiple buffer pools, we have a partitioning function:

            inline buf_pool_t* buf_pool_get(const page_id_t page_id)
            {
                    /* 2log of BUF_READ_AHEAD_AREA (64) */
                    ulint		ignored_page_no = page_id.page_no() >> 6;
             
                    page_id_t	id(page_id.space(), ignored_page_no);
             
                    ulint		i = id.fold() % srv_buf_pool_instances;
             
                    return(&buf_pool_ptr[i]);
            }
            

            The first hash function is:

            	ulint fold() const { return (m_space << 20) + m_space + m_page_no; }
            

            If we are lucky, there might be a trivial bottleneck on buf_pool->page_hash that maps page numbers into block descriptors, specifically on the partitioned rw-latch that protects it:

            hash_lock = hash_get_lock(buf_pool->page_hash, page_id.fold());
            

            The rw_lock_t for the page hash table is partitioned into srv_n_page_hash_locks (default 16).
            I would suggest to apply the following patch against the latest 10.5 (at least 3a3605f4b1ad08bbcb823cd41b724f2def9f2ba3):

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 17e3d1fa968..234271047a6 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -1,3 +1,4 @@
            +#define UNIV_PERF_DEBUG
             /*****************************************************************************
             
             Copyright (c) 2000, 2019, Oracle and/or its affiliates. All Rights Reserved.
            

            This will expose the parameter innodb_page_hash_locks (default 16, ranging from 1 to 1024). I would suggest to rerun the benchmark with innodb_page_hash_locks=64, with 0%, 50%, 80% and 100% writes.

            I would also suggest to compile with cmake -DPLUGIN_PERFSCHEMA=NO to see a ‘more pure’ performance difference. This benchmark should only need to compare the latest 10.5 development snapshot, with different innodb_buffer_pool_instances, on a buffer pool that is smaller than the table.

            marko Marko Mäkelä added a comment - Thank you, axel ! I discussed this shortly with wlad (who benchmarked MDEV-16264 and claimed that with single buffer pool, the flushing performance was actually better). MDEV-15058-10.5.ods Sheet 7 "OLTP, 32 tables, uniform rng, small BP, no-dblwrite, SSD" is for 50%, 80% and 100% write ratio. For smaller write ratio, it suggests that 4 buffer pools are giving the optimal performance. The difference between 1 and 4 buffer pool instances is 10.1% for 50% writes and 5.7% for 100% writes. I would expect the difference to be even bigger with 0% write ratio. The figures seem to suggest that page flushing is not a bottleneck at all. It seems that we could have a bottleneck for things that affect page lookups or page replacement (eviction and reloading) in the buffer pool. With multiple buffer pools, we have a partitioning function: inline buf_pool_t* buf_pool_get( const page_id_t page_id) { /* 2log of BUF_READ_AHEAD_AREA (64) */ ulint ignored_page_no = page_id.page_no() >> 6;   page_id_t id(page_id.space(), ignored_page_no);   ulint i = id.fold() % srv_buf_pool_instances;   return (&buf_pool_ptr[i]); } The first hash function is: ulint fold() const { return (m_space << 20) + m_space + m_page_no; } If we are lucky, there might be a trivial bottleneck on buf_pool->page_hash that maps page numbers into block descriptors, specifically on the partitioned rw-latch that protects it: hash_lock = hash_get_lock(buf_pool->page_hash, page_id.fold()); The rw_lock_t for the page hash table is partitioned into srv_n_page_hash_locks (default 16). I would suggest to apply the following patch against the latest 10.5 (at least 3a3605f4b1ad08bbcb823cd41b724f2def9f2ba3): diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 17e3d1fa968..234271047a6 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -1,3 +1,4 @@ +#define UNIV_PERF_DEBUG /***************************************************************************** Copyright (c) 2000, 2019, Oracle and/or its affiliates. All Rights Reserved. This will expose the parameter innodb_page_hash_locks (default 16, ranging from 1 to 1024). I would suggest to rerun the benchmark with innodb_page_hash_locks=64 , with 0%, 50%, 80% and 100% writes. I would also suggest to compile with cmake -DPLUGIN_PERFSCHEMA=NO to see a ‘more pure’ performance difference. This benchmark should only need to compare the latest 10.5 development snapshot, with different innodb_buffer_pool_instances , on a buffer pool that is smaller than the table.
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-10.5-dev.ods [ 50267 ]
            axel Axel Schwenke added a comment -

            I attached a new sheet MDEV-15058-10.5-dev.ods. Comparing workload from former sheet 7 (small BP, SSD) for 10.5.0 (as before) with 10.5 head (commit cc3135cf) with 10.5 head + innodb_page_hash_locks=64 (PERF_DEBUG patch).

            It now seems that more than 4 buffer pools are not good performancewise.

            axel Axel Schwenke added a comment - I attached a new sheet MDEV-15058-10.5-dev.ods . Comparing workload from former sheet 7 (small BP, SSD) for 10.5.0 (as before) with 10.5 head (commit cc3135cf) with 10.5 head + innodb_page_hash_locks=64 (PERF_DEBUG patch). It now seems that more than 4 buffer pools are not good performancewise.
            wlad Vladislav Vaintroub added a comment - - edited

            there are some big dips in Sheet#3
            like,

            50% writes

            64 7214 13638 18248 9020 15435 6636
            128 41 7499 17 16 4153 15603

            and for other N% writes, too
            What is it?

            wlad Vladislav Vaintroub added a comment - - edited there are some big dips in Sheet#3 like, 50% writes 64 7214 13638 18248 9020 15435 6636 128 41 7499 17 16 4153 15603 and for other N% writes, too What is it?

            I had a word with axel. He noted that earlier supposedly read-only benchmarks were accidentally read/write. Based on performance_schema output that he shared, there were some unexpected anomalies. I suggested retesting with innodb_stats_persistent=OFF and innodb_change_buffering=none to make the benchmark more deterministic.

            marko Marko Mäkelä added a comment - I had a word with axel . He noted that earlier supposedly read-only benchmarks were accidentally read/write. Based on performance_schema output that he shared, there were some unexpected anomalies. I suggested retesting with innodb_stats_persistent=OFF and innodb_change_buffering=none to make the benchmark more deterministic.
            axel Axel Schwenke made changes -
            Attachment MDEV-15058-10.5-34dafb7e3a8.ods [ 50309 ]
            axel Axel Schwenke added a comment - - edited

            New numbers attached in MDEV-15058-10.5-34dafb7e3a8.ods

            This test uses a fresh build from the 10.5 branch and changed InnoDB configuration. The biggest difference comes from innodb_file_per_table=0. I have also skip-innodb_adaptive_hash_index, skip-innodb-stats-persistent and innodb-change-buffering=none.

            Now the numbers are quite smooth with steady throughput during the benchmark runtime. The bottleneck is now the disk, read-only reads ~860MB/s from the disk and cpu usage is 19% user, 9% system, 72% iowait.Read-write reads ~300MB/s and writes ~270MB/s from/to disk at 6.5% user, 3.5% system, 50% iowait and 40% idle..

            I reconfigured everything to put the datadir into a RAM disk (/dev/shm). This dramatically increases the throughput. Read-only now uses cpu at 76%user, 24% system. Read-write is 63% user, 27% system, 10% idle.

            Using multiple buffer pools with the datadir on SSD has not much impact on RO performance but visible impact on RW performance with the optimum at 4 BP. With the datadir in RAM multiple buffer pools increase both RO and RW performance. Performance increases from 1 over 2 to 4 BP and then stays stable at higher BP numbers (tested up to 32).

            axel Axel Schwenke added a comment - - edited New numbers attached in MDEV-15058-10.5-34dafb7e3a8.ods This test uses a fresh build from the 10.5 branch and changed InnoDB configuration. The biggest difference comes from innodb_file_per_table=0. I have also skip-innodb_adaptive_hash_index, skip-innodb-stats-persistent and innodb-change-buffering=none. Now the numbers are quite smooth with steady throughput during the benchmark runtime. The bottleneck is now the disk, read-only reads ~860MB/s from the disk and cpu usage is 19% user, 9% system, 72% iowait.Read-write reads ~300MB/s and writes ~270MB/s from/to disk at 6.5% user, 3.5% system, 50% iowait and 40% idle.. I reconfigured everything to put the datadir into a RAM disk (/dev/shm). This dramatically increases the throughput. Read-only now uses cpu at 76%user, 24% system. Read-write is 63% user, 27% system, 10% idle. Using multiple buffer pools with the datadir on SSD has not much impact on RO performance but visible impact on RW performance with the optimum at 4 BP. With the datadir in RAM multiple buffer pools increase both RO and RW performance. Performance increases from 1 over 2 to 4 BP and then stays stable at higher BP numbers (tested up to 32).
            marko Marko Mäkelä added a comment - - edited

            axel, thank you! With those parameters and changes to the benchmark, we avoid hitting the following bottlenecks, which are independent of the number of InnoDB buffer pool instances:

            • ENGINE=Aria operations on internal temporary tables: caused by range scans, maybe due to occasionally wrong statistics leading to suboptimal query plans?
            • InnoDB adaptive hash index: it sometimes helps, sometimes hurts performance (see MDEV-17492)
            • fil_system.mutex contention due to writing MLOG_FILE_NAME records when an .ibd file is first modified since a log checkpoint: should be improved when MDEV-14425 introduces a separate log file for checkpoints and file operations. Because log checkpoints are triggered ‘randomly’, so will these contentions. During these operations, we are not holding any buf_pool mutexes.
            • InnoDB persistent statistics collection could kick in randomly, triggering large index scans, affecting concurrent workload.
            • InnoDB change buffering could cause a lot of extra I/O, for avoiding one read. Maybe it is not at all useful with SSD nowadays.

            Were the last runs with the default value of innodb_page_hash_locks=16?

            Edit: Because the page flushing seems optimal at innodb_buffer_pool_instances=4, it looks like we may have to run new benchmarks after MDEV-16526 and MDEV-21534 have been completed.

            marko Marko Mäkelä added a comment - - edited axel , thank you! With those parameters and changes to the benchmark, we avoid hitting the following bottlenecks, which are independent of the number of InnoDB buffer pool instances: ENGINE=Aria operations on internal temporary tables: caused by range scans, maybe due to occasionally wrong statistics leading to suboptimal query plans? InnoDB adaptive hash index: it sometimes helps, sometimes hurts performance (see MDEV-17492 ) fil_system.mutex contention due to writing MLOG_FILE_NAME records when an .ibd file is first modified since a log checkpoint: should be improved when MDEV-14425 introduces a separate log file for checkpoints and file operations. Because log checkpoints are triggered ‘randomly’, so will these contentions. During these operations, we are not holding any buf_pool mutexes. InnoDB persistent statistics collection could kick in randomly, triggering large index scans, affecting concurrent workload. InnoDB change buffering could cause a lot of extra I/O, for avoiding one read. Maybe it is not at all useful with SSD nowadays. Were the last runs with the default value of innodb_page_hash_locks=16 ? Edit: Because the page flushing seems optimal at innodb_buffer_pool_instances=4 , it looks like we may have to run new benchmarks after MDEV-16526 and MDEV-21534 have been completed.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            axel Axel Schwenke made changes -
            Attachment ramdisk-ro1.svg [ 50318 ]
            Attachment ramdisk-ro4.svg [ 50319 ]
            Attachment ramdisk-rw1.svg [ 50320 ]
            Attachment ramdisk-rw4.svg [ 50321 ]
            axel Axel Schwenke added a comment - - edited

            I attached CPU Flame Graphs (http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) for the following 4 scenarios:

            1. single BP, OLTP read-only (ramdisk-ro1.svg)
            2. single BP, OLTP read-write, 80% writes ( (ramdisk-rw1.svg)
            3. 4 BP, OLTP read-only (ramdisk-ro4.svg)
            4. 4 BP, OLTP read-write, 80% writes ( (ramdisk-rw4.svg)

            In any case the datadir was in memory (/dev/shm) and there were 32 benchmark threads running. That should be the sweet spot as the hardware can do 32 concurrent (hyper)threads

            axel Axel Schwenke added a comment - - edited I attached CPU Flame Graphs ( http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html ) for the following 4 scenarios: single BP, OLTP read-only ( ramdisk-ro1.svg ) single BP, OLTP read-write, 80% writes ( ( ramdisk-rw1.svg ) 4 BP, OLTP read-only ( ramdisk-ro4.svg ) 4 BP, OLTP read-write, 80% writes ( ( ramdisk-rw4.svg ) In any case the datadir was in memory (/dev/shm) and there were 32 benchmark threads running. That should be the sweet spot as the hardware can do 32 concurrent (hyper)threads

            Between ramdisk-rw1.svg and ramdisk-rw4.svg, the major difference seems to be that in the page fault handling for ha_innobase::records_in_range(), the buf_LRU_get_free_block(buf_pool_t*) is taking much longer time with a single buffer pool than with 4 buffer pool instances.
            Can we get the perf annotate or similar report for this function and its callees, to highlight the difference in more detail?

            marko Marko Mäkelä added a comment - Between ramdisk-rw1.svg and ramdisk-rw4.svg , the major difference seems to be that in the page fault handling for ha_innobase::records_in_range() , the buf_LRU_get_free_block(buf_pool_t*) is taking much longer time with a single buffer pool than with 4 buffer pool instances. Can we get the perf annotate or similar report for this function and its callees, to highlight the difference in more detail?

            wlad, can you please try to figure out the bottleneck? I wonder if it could be related to buf_pool->LRU_old in some way.

            marko Marko Mäkelä added a comment - wlad , can you please try to figure out the bottleneck? I wonder if it could be related to buf_pool->LRU_old in some way.
            marko Marko Mäkelä made changes -
            Assignee Axel Schwenke [ axel ] Vladislav Vaintroub [ wlad ]
            wlad Vladislav Vaintroub made changes -
            Attachment 1bp.txt [ 50466 ]
            Attachment 4bp.txt [ 50467 ]

            So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough

            my.cnf

            [mysqld]
             
            #####non innodb options
            max_connections = 300
            table_open_cache = 600
            query_cache_type = 0
             
            #####innodb options
            innodb_buffer_pool_size = 1G
            innodb_log_buffer_size = 32M
            innodb_log_file_size = 512M
            innodb_flush_log_at_trx_commit = 2
            innodb_doublewrite = 0
             
            loose-innodb_adaptive_hash_index_partitions = 32
            loose-innodb_adaptive_hash_index_parts = 32
             
            #####SSD
            innodb-flush-method = O_DIRECT
            innodb_io_capacity = 4000
            loose-innodb_flush_neighbors = 0
            innodb_write_io_threads = 8
             
            #####the variables for this test
            innodb_buffer_pool_instances = 1
             
            innodb_max_dirty_pages_pct = 99
            skip-innodb_adaptive_hash_index
            skip-innodb-stats-persistent
            innodb-change-buffering=none
            innodb_file_per_table = 0
            

            script to run with sysbench 1.0

            sysbench --test=/usr/share/sysbench/oltp_update_index.lua   --tables=32 --table-size=1250000  --rand-seed=42 --rand-type=uniform --num-threads=32 --report-interval=2  --mysql-socket=/tmp/mysql.sock --time=300  --max-requests=0 --mysql-user=root --percentile=95 $1
            

            where $1 is either "prepare" or "run" (you need to have a database called sbtest)

            Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top")

            For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4

            4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case.

            I attached pt-pmp output 1bp.txt (single buffer pool instance) and 4bp.txt (4 instances), made with 20 samples separated by 10 seconds delay (If someone knows a more modern tool to profile on contention , please tell)

            From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something)

            I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question.
            https://github.com/MariaDB/server/blob/f3dac591747dfbd88bd8ae2855f9a0e64006ce75/storage/innobase/buf/buf0buf.cc#L6019

            wlad Vladislav Vaintroub added a comment - So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough my.cnf [mysqld]   #####non innodb options max_connections = 300 table_open_cache = 600 query_cache_type = 0   #####innodb options innodb_buffer_pool_size = 1G innodb_log_buffer_size = 32M innodb_log_file_size = 512M innodb_flush_log_at_trx_commit = 2 innodb_doublewrite = 0   loose-innodb_adaptive_hash_index_partitions = 32 loose-innodb_adaptive_hash_index_parts = 32   #####SSD innodb-flush-method = O_DIRECT innodb_io_capacity = 4000 loose-innodb_flush_neighbors = 0 innodb_write_io_threads = 8   #####the variables for this test innodb_buffer_pool_instances = 1   innodb_max_dirty_pages_pct = 99 skip-innodb_adaptive_hash_index skip-innodb-stats-persistent innodb-change-buffering=none innodb_file_per_table = 0 script to run with sysbench 1.0 sysbench --test=/usr/share/sysbench/oltp_update_index.lua --tables=32 --table-size=1250000 --rand-seed=42 --rand-type=uniform --num-threads=32 --report-interval=2 --mysql-socket=/tmp/mysql.sock --time=300 --max-requests=0 --mysql-user=root --percentile=95 $1 where $1 is either "prepare" or "run" (you need to have a database called sbtest) Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top") For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4 4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case. I attached pt-pmp output 1bp.txt (single buffer pool instance) and 4bp.txt (4 instances), made with 20 samples separated by 10 seconds delay (If someone knows a more modern tool to profile on contention , please tell) From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something) I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question. https://github.com/MariaDB/server/blob/f3dac591747dfbd88bd8ae2855f9a0e64006ce75/storage/innobase/buf/buf0buf.cc#L6019

            I ported the changes to 10.5. Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in MDEV-15053).

            marko Marko Mäkelä added a comment - I ported the changes to 10.5 . Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in MDEV-15053 ).
            marko Marko Mäkelä made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            marko Marko Mäkelä made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            marko Marko Mäkelä added a comment - - edited

            I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the MDEV-12353 branch only. Pushed to 10.5.

            marko Marko Mäkelä added a comment - - edited I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the MDEV-12353 branch only. Pushed to 10.5.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2020-02-12 12:59:27.0 2020-02-12 12:59:27.063
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5.1 [ 24029 ]
            Fix Version/s 10.5 [ 23123 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]

            Based on the feedback of wlad, I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables. We will return a dummy buffer pool identifier 0, for compatibility.

            marko Marko Mäkelä added a comment - Based on the feedback of wlad , I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables . We will return a dummy buffer pool identifier 0, for compatibility.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by MDEV-15053. That function would still acquire buf_pool_t::mutex (which was renamed to buf_pool_t::LRU_list_mutex).

            MDEV-15053 did not show improved performance when wlad was testing it. To keep the latching rules more understandable and to avoid race conditions, it might be best to omit most of those changes.

            marko Marko Mäkelä added a comment - The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by MDEV-15053 . That function would still acquire buf_pool_t::mutex (which was renamed to buf_pool_t::LRU_list_mutex ). MDEV-15053 did not show improved performance when wlad was testing it. To keep the latching rules more understandable and to avoid race conditions, it might be best to omit most of those changes.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 85121 ] MariaDB v4 [ 133448 ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.