[MDEV-15058] Remove multiple InnoDB buffer pool instances Created: 2018-01-24 Updated: 2020-10-15 Resolved: 2020-02-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 10.5.1 |
| Type: | Task | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This came up during the I started to wonder whether multiple InnoDB buffer pools actually help with any workloads. Yes, it probably was a good idea to split the buffer pool mutex when Inaam Rana introduced multiple buffer pools in MySQL 5.5.5, but since then, there have been multiple fixes to reduce contention on the buffer pool mutex, such as Inaam's follow-up fix in MySQL 5.6.2 to use rw-locks instead of mutexes for the buf_pool->page_hash. In MySQL 8.0.0, Shaohua Wang implemented one more thing that MariaDB should copy: I think that we should seriously consider removing all code to support multiple buffer pools or page cleaners. The description of WL#6642: InnoDB: multiple page_cleaner threads seems to imply that it may have been a mistake to partition the buffer pool. Note: partitioning or splitting mutexes often seems to be a good idea. But partitioning data structures or threads might not be. axel, please test different workloads with innodb_buffer_pool_instances=1 and innodb_page_cleaners=1, and compare the performance to configurations that use multiple buffer pools (and page cleaners). If using a single buffer pool instance never seems to causes any regression, I think that we should simplify the code. |
| Comments |
| Comment by Axel Schwenke [ 2018-01-31 ] | ||||||||||||||||||||||||||||||||
|
I did a range of tests on thwo machines.
The MariaDB version used was 10.3.4 (built locally). The benchmark was sysbench in different variations and workloads (see the single sheets). The results are a mixed bag. On Intel it looks like multiple BP partitions don't help with performance. Certainly not for INSERT workload. That are sheets 1 and 2. Sheets 3 and 4 are for Intel, sysbench OLTP with varying percentage of writes. Here it looks that we get small benefits for read-only, but the more writes are done and the more BP partitions we have, the worse things get. Sheet 5 is ARM, sysbench OLTP ro/rw and wo. Here we have no clear verdict. It seems that 16 or 32 buffer pools give indeed a benefit. Sheet 6 is not buffer pool partitions, but AHI partitions. This is now quite clear, increasing AHI partitions up to 32 is good for performance. Actually I ran this test first and used 32 fpr the other tests. About the attached files: LibreOffice messes up the conditional formatting of the cells, so I attach the sheet also as PDF. The cells in the "throughput per used core" tables are color-coded. Red means "more than 1% slower than 1 partition", green means "more than 1% faster than 1 partition". "Throughput per used core" - this is system throughput (qps) divided by min(benchmark threads, available hw threads). On a perfecly scaling system it would give the same number independent from benchmark thread count. | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-02-08 ] | ||||||||||||||||||||||||||||||||
|
Attached new results in Those numbers are for MariaDB 10.3, commit c0d5d7c0 Again there are results for Intel (first 4 sheets) and ARM (last sheet). For ARM the situation is rather clear: multiple buffer pools have a slight negative impact on performance. Also clear is the situation for Intel and INSERT-only workload. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-02-13 ] | ||||||||||||||||||||||||||||||||
|
axel, can you please test innodb_buffer_pool_instances=1 with
It is | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-02-20 ] | ||||||||||||||||||||||||||||||||
|
I added two more sheets (again: .ods and .pdf) with numbers for commit 2b97d02 - further referenced as 10.3.5-thiru. There are system- and per core throughput with color coding as before: red cells indicate performance more than 1% worse than single buffer pool, green cells indicate performance more than 1% better than single buffer pool. | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-02-20 ] | ||||||||||||||||||||||||||||||||
|
Added numbers for TPC-C - an OLTP type benchmark. Specifically this is the TPC-C implementation named "HammerDB". The workload includes reads and writes and gives two numbers: system TPM = system throughput in transactions per minute and NOPM = new orders per minute. The NOPM number is comparable between different databases, the TPM number is specific for a DBMS. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-02-23 ] | ||||||||||||||||||||||||||||||||
|
Thank you, axel! Since the 10.3.4 release there have been quite a few performance improvements, most notably Similarly, with MDEV-15058-thiru.pdf Meanwhile, there is the separate branch bb-10.3- I have the feeling that removing the code to deal with multiple buffer pools might not improve performance enough to beat innodb_buffer_pool_instances=4 in write-heavy benchmarks on the test system. Write performance could be improved by changing the page flushing algorithms and data structures. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-02-24 ] | ||||||||||||||||||||||||||||||||
|
Removing multiple buffer pools necessarily means removing multiple page cleaner threads as well. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-03-01 ] | ||||||||||||||||||||||||||||||||
|
I created another version of this patch. It would be interesting to benchmark it against the base version, to see if we really need | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-03-07 ] | ||||||||||||||||||||||||||||||||
|
I added a new spread sheet 1. my Intel machine (16 cores, 32 hw threads). This has 10.3.5 number only for 4 buffer pools. Here all 3 contenders behave very much the same for read-only or read-mostly workload. For write-intensive workload, 10.3.5 is fastest. 2. the ARM server (46 cores). Here I ran the test for 1..64 buffer pools. Again for read-only and read-write there are only small differences, also the number of buffer pools doesn't matter much. For write-only workload two buffer pools give best results for 10.3.5, but the bb-10.3-mdev-15058 tree performs equally or better. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-03-09 ] | ||||||||||||||||||||||||||||||||
|
axel, the latest benchmarks seem to be an "approval" to go ahead and remove multiple buffer pools. However, I wonder how big the workload size is compared to the buffer pool size. I would like to ask for one more set of write-heavy benchmarks, comparing to multiple buffer pools, and the buffer pool size being about 10% of the workload size. The minimum for using multiple innodb_buffer_pool_instances is innodb_buffer_pool_size=1g. The writes could be updates with uniformly distributed keys, so that a large number of pages get dirtied, instead of dirtying only a few pages. The reason for this kind of benchmark is that the buffer pool mutex(es) are acquired mostly when pages are flushed, evicted or loaded to the buffer pool. If the workload mostly fits in the buffer pool, then there should not be that much load on the buffer pool. There is a concern that with a single buffer pool and a single page cleaner (there cannot be more page cleaner threads than buffer pool instances) the flushing rate would be too small to saturate an SSD, and this could limit the performance of write-heavy workloads. | ||||||||||||||||||||||||||||||||
| Comment by Laurynas Biveinis [ 2018-03-12 ] | ||||||||||||||||||||||||||||||||
|
Consider separating LRU flushing from flush list flushing, which are not really related one to another anyway: https://www.percona.com/blog/2016/05/05/percona-server-5-7-multi-threaded-lru-flushing/ This way with a single instance you'd have two flusher threads. If their CPU priority is high, flushing should never be bounded by lack of threads | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-03-13 ] | ||||||||||||||||||||||||||||||||
|
New results added: This is with completely different setup:
Now there is a significant regression when switching to a single buffer pool. The number of page cleaners has also an impact, but much smaller. The experimental "single buffer pool" trees behave better than 10.3.5 with a single buffer pool, but are still significantly slower than 10.3.5 with 4 buffer pools. | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-03-13 ] | ||||||||||||||||||||||||||||||||
|
Attached results for ARM and the same setup (/dev/shm, write-heavy). Results are very similar to those for Intel. For 100% writes the number of page cleaners has some impact on performance, too. | ||||||||||||||||||||||||||||||||
| Comment by Inaam Rana [ 2018-03-15 ] | ||||||||||||||||||||||||||||||||
|
Marko, I believe we won't only need multiple instances but also multiple background threads for flushing. We can think in terms of a single LRU flusher and a single page_cleaner. The code as it is written right now will serially go through each instance and do a batch (LRU batches are chunkized though). Imagine if the last instance is the one most in need of flushing. Therefore it might make sense to have configurable number of background flushing threads. With multiple threads we might want to tweak some other bits of code as well:
| ||||||||||||||||||||||||||||||||
| Comment by Laurynas Biveinis [ 2018-03-16 ] | ||||||||||||||||||||||||||||||||
|
Inaam, Marko, the last reply prompts me to market our MT flusher / parallel doublewrite some more. Our design addresses the first three items: 1) each thread has its own private lru/flush list; 2) each thread has own independent heuristics on when to flush; 3) parallel doublewrite. I believe it takes care of the issues in bug 74637 as well. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-03-16 ] | ||||||||||||||||||||||||||||||||
|
I'm also reminded that I wanted to know more about the story of Linux AIO with segments per thread. MariaDB design on Windows is (almost) such that any thread can take any IO completion, the segments do not play any role . (more accurately, there are actually 2 IO pools, one for read and one for write requests, and that worked around some deadlock I've seen in the past | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2018-03-16 ] | ||||||||||||||||||||||||||||||||
|
Hi. I attached results for the datadir on SSD. The test was run on Intel and ARM again. Observations: ARM: the system is clearly IO-bound. While it does have a SSD, it's a rather slow one and it holds everything. iostat reports ~30% cpu time spent in iowait and 100% utilisation for the disk. Intel: the numbers are very similar to the ones for the datadir in RAM, just that the differences are a bit smoother. A difference that should be noted is, that now the number of page cleaners has more impact. Check the diagrams bottom left (4 buffer pools, 1/2/4 page cleaner threads). iostat shows up to 45% iowait and the SSDs reach 98% utilisation. Unlike the ARM system, the Intel system has dedicated SSD for the datadir and it's actually two units in RAID-0. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-06-19 ] | ||||||||||||||||||||||||||||||||
|
log_flush_order_mutex, which was shared between the flush_list of multiple buffer pool instances, may still be needed. If we removed this mutex, we would have to protect the buf_pool.flush_list with the log_sys.mutex, prolonging the hold time of that mutex. This might be acceptable, given that we only need to touch the flush_list when a page is modified for the first time. Most of the time, we could release the log_sys.mutex early. Either way, we would seem to need mtr_t::is_dirty() and mtr_t::is_block_dirtied(). | ||||||||||||||||||||||||||||||||
| Comment by Anjum Naveed [ 2019-05-06 ] | ||||||||||||||||||||||||||||||||
|
I did test the impact of multiple buffer pool instances sometime back. The purpose was different so I will need to rerun the tests for reporting purposes. I found the bottleneck was the doublewrite buffer. When doublewrite buffer was turned off, multiple buffer pool instances resulted in improvement on Intel as well as ARM system. I did not modify any code so I cannot comment on flushing mechanism. In addition to updating the flushing mechanism, I am in strong support of the suggestion from Inaam Rana "We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance". | ||||||||||||||||||||||||||||||||
| Comment by Anjum Naveed [ 2019-05-22 ] | ||||||||||||||||||||||||||||||||
|
Dear all, I have done following: (code base is version 10.5) At present I am testing on my development laptop so I do not have the liberty to move doublewrite buffers to separate disk, hence the RAMDISK, which will provide upper bound improvement. If I use actual hard drive for doublewrite buffer files, I will still get improvement, although not as much. As far as this test is concerned, it supports use of single buffer pool instead of using multiple buffer pool instances. However, looking at the code, I believe the issue is with page cleaner and the way pages are being distributed to instances that is holding the performance down and not the actual use of multiple files and buffer pool instances. I still believe multiple buffer pool instances is the way to go, specially for large systems and we need to improve the things around buffer pools (specifically page cleaner code). Please suggest if more time should be spent in this direction OR we have already decided to use single buffer pool. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-11 ] | ||||||||||||||||||||||||||||||||
|
anjumnaveed81, I am sorry for missing your updates until now. I would love to see your changes, preferrably in the form of a pull request against the current 10.5 branch in https://github.com/MariaDB/server/. I think that your work fits under | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-17 ] | ||||||||||||||||||||||||||||||||
|
axel, can you please test a single buffer pool instance vs. multiple buffer pools in a write-heavy workload on the latest 10.5? Maybe things have improved since the previous test and we do not need to wait for | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-01-09 ] | ||||||||||||||||||||||||||||||||
|
Attached 3 new spread sheets with results for 10.5.0 and 10.4.10. After seeing the numbers for 10.5.0 I decided to run the workload with latest 10.4 for comparison. In write-heavy benchmarks (see the comparative spread sheet) 10.5 is up to 40% slower than 10.4. Also there are some anomalies like very poor performance with a single buffer pool, getting back to normal with multiple pools. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-09 ] | ||||||||||||||||||||||||||||||||
|
Thank you, axel! I discussed this shortly with wlad (who benchmarked MDEV-15058-10.5.ods It seems that we could have a bottleneck for things that affect page lookups or page replacement (eviction and reloading) in the buffer pool. With multiple buffer pools, we have a partitioning function:
The first hash function is:
If we are lucky, there might be a trivial bottleneck on buf_pool->page_hash that maps page numbers into block descriptors, specifically on the partitioned rw-latch that protects it:
The rw_lock_t for the page hash table is partitioned into srv_n_page_hash_locks (default 16).
This will expose the parameter innodb_page_hash_locks (default 16, ranging from 1 to 1024). I would suggest to rerun the benchmark with innodb_page_hash_locks=64, with 0%, 50%, 80% and 100% writes. I would also suggest to compile with cmake -DPLUGIN_PERFSCHEMA=NO to see a ‘more pure’ performance difference. This benchmark should only need to compare the latest 10.5 development snapshot, with different innodb_buffer_pool_instances, on a buffer pool that is smaller than the table. | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-01-15 ] | ||||||||||||||||||||||||||||||||
|
I attached a new sheet MDEV-15058-10.5-dev.ods It now seems that more than 4 buffer pools are not good performancewise. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-01-15 ] | ||||||||||||||||||||||||||||||||
|
there are some big dips in Sheet#3 50% writes
and for other N% writes, too | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-17 ] | ||||||||||||||||||||||||||||||||
|
I had a word with axel. He noted that earlier supposedly read-only benchmarks were accidentally read/write. Based on performance_schema output that he shared, there were some unexpected anomalies. I suggested retesting with innodb_stats_persistent=OFF and innodb_change_buffering=none to make the benchmark more deterministic. | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-01-24 ] | ||||||||||||||||||||||||||||||||
|
New numbers attached in MDEV-15058-10.5-34dafb7e3a8.ods This test uses a fresh build from the 10.5 branch and changed InnoDB configuration. The biggest difference comes from innodb_file_per_table=0. I have also skip-innodb_adaptive_hash_index, skip-innodb-stats-persistent and innodb-change-buffering=none. Now the numbers are quite smooth with steady throughput during the benchmark runtime. The bottleneck is now the disk, read-only reads ~860MB/s from the disk and cpu usage is 19% user, 9% system, 72% iowait.Read-write reads ~300MB/s and writes ~270MB/s from/to disk at 6.5% user, 3.5% system, 50% iowait and 40% idle.. I reconfigured everything to put the datadir into a RAM disk (/dev/shm). This dramatically increases the throughput. Read-only now uses cpu at 76%user, 24% system. Read-write is 63% user, 27% system, 10% idle. Using multiple buffer pools with the datadir on SSD has not much impact on RO performance but visible impact on RW performance with the optimum at 4 BP. With the datadir in RAM multiple buffer pools increase both RO and RW performance. Performance increases from 1 over 2 to 4 BP and then stays stable at higher BP numbers (tested up to 32). | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-27 ] | ||||||||||||||||||||||||||||||||
|
axel, thank you! With those parameters and changes to the benchmark, we avoid hitting the following bottlenecks, which are independent of the number of InnoDB buffer pool instances:
Were the last runs with the default value of innodb_page_hash_locks=16? Edit: Because the page flushing seems optimal at innodb_buffer_pool_instances=4, it looks like we may have to run new benchmarks after | ||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-01-28 ] | ||||||||||||||||||||||||||||||||
|
I attached CPU Flame Graphs (http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) for the following 4 scenarios:
In any case the datadir was in memory (/dev/shm) and there were 32 benchmark threads running. That should be the sweet spot as the hardware can do 32 concurrent (hyper)threads | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-28 ] | ||||||||||||||||||||||||||||||||
|
Between ramdisk-rw1.svg | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-28 ] | ||||||||||||||||||||||||||||||||
|
wlad, can you please try to figure out the bottleneck? I wonder if it could be related to buf_pool->LRU_old in some way. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-02-11 ] | ||||||||||||||||||||||||||||||||
|
So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough my.cnf
script to run with sysbench 1.0
where $1 is either "prepare" or "run" (you need to have a database called sbtest) Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top") For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4 4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case. I attached pt-pmp output 1bp.txt From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something) I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-12 ] | ||||||||||||||||||||||||||||||||
|
I ported the changes to 10.5. Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-12 ] | ||||||||||||||||||||||||||||||||
|
I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-12 ] | ||||||||||||||||||||||||||||||||
|
Based on the feedback of wlad, I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables. We will return a dummy buffer pool identifier 0, for compatibility. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-03-24 ] | ||||||||||||||||||||||||||||||||
|
The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by
|