[MDEV-15058] Remove multiple InnoDB buffer pool instances - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.5.1
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

This came up during the ~~MDEV-15016~~ review.

I started to wonder whether multiple InnoDB buffer pools actually help with any workloads. Yes, it probably was a good idea to split the buffer pool mutex when Inaam Rana introduced multiple buffer pools in MySQL 5.5.5, but since then, there have been multiple fixes to reduce contention on the buffer pool mutex, such as Inaam's follow-up fix in MySQL 5.6.2 to use rw-locks instead of mutexes for the buf_pool->page_hash.

In MySQL 8.0.0, Shaohua Wang implemented one more thing that MariaDB should copy: ~~MDEV-15053~~ Split buf_pool_t::mutex.

I think that we should seriously consider removing all code to support multiple buffer pools or page cleaners.
Should multiple buffer pools be needed in the future (for example, on NUMA machines), it should be designed better from the ground up. Currently the partitioning is arbitrary; buffer pool membership is basically determined by a hash of the page number.

The description of WL#6642: InnoDB: multiple page_cleaner threads seems to imply that it may have been a mistake to partition the buffer pool.

Note: partitioning or splitting mutexes often seems to be a good idea. But partitioning data structures or threads might not be.

axel, please test different workloads with innodb_buffer_pool_instances=1 and innodb_page_cleaners=1, and compare the performance to configurations that use multiple buffer pools (and page cleaners). If using a single buffer pool instance never seems to causes any regression, I think that we should simplify the code.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

1bp.txt
81 kB
2020-02-11 22:12
4bp.txt
86 kB
2020-02-11 22:12
MDEV-15058.ods
82 kB
2018-01-31 17:16
MDEV-15058.pdf
56 kB
2018-01-31 17:17
MDEV-15058-10.4.10.ods
88 kB
2020-01-09 16:09
MDEV-15058-10.4vs10.5.ods
140 kB
2020-01-09 16:09
MDEV-15058-10.5.ods
88 kB
2020-01-09 16:09
MDEV-15058-10.5-34dafb7e3a8.ods
49 kB
2020-01-24 21:03
MDEV-15058-10.5-dev.ods
68 kB
2020-01-15 16:02
MDEV-15058-B.ods
77 kB
2018-02-08 10:13
MDEV-15058-B.pdf
51 kB
2018-02-08 10:13
MDEV-15058-RAM-ARM.ods
77 kB
2018-03-13 17:00
MDEV-15058-RAM-Intel.ods
82 kB
2018-03-13 13:13
MDEV-15058-singleBP.ods
51 kB
2018-03-07 14:29
MDEV-15058-SSD-ARM.ods
79 kB
2018-03-16 10:55
MDEV-15058-SSD-Intel.ods
80 kB
2018-03-16 10:55
MDEV-15058-thiru.ods
73 kB
2018-02-20 09:44
MDEV-15058-thiru.pdf
53 kB
2018-02-20 09:44
MDEV-15058-tpcc.ods
45 kB
2018-02-20 09:49
ramdisk-ro1.svg
428 kB
2020-01-28 12:47
ramdisk-ro4.svg
395 kB
2020-01-28 12:47
ramdisk-rw1.svg
782 kB
2020-01-28 12:47
ramdisk-rw4.svg
581 kB
2020-01-28 12:47

Issue Links

blocks

MDEV-21962 Allocate buf_pool statically

Closed

causes

MDEV-22027 Assertion oldest_lsn >= log_sys.last_checkpoint_lsn failed in log_checkpoint()

Closed

MDEV-22114 Assertion `!is_owned()' failed. | handle_fatal_signal (sig=6) in MutexDebug

Closed

MDEV-23399 10.5 performance regression with IO-bound tpcc

Closed

MDEV-33966 sysbench performance regression with concurrent workloads

Stalled

relates to

MDEV-15053 Reduce buf_pool_t::mutex contention

Closed

MDEV-21212 buf_page_get_gen -> buf_pool->stat.n_page_gets++ is a cpu waste (0.5-1%)

Closed

MDEV-15016 multiple page cleaner threads use a lot of CPU on idle server

Closed

MDEV-15685 large pages - out of memory handling results in SIGBUS sql/sql_show.cc:6822(get_schema_views_record

Open

MDEV-16526 Overhaul the InnoDB page flushing

Closed

(5 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä created issue - 2018-01-24 13:55

Marko Mäkelä made changes - 2018-01-24 13:55

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-15016~~ [ ~~MDEV-15016~~ ]

Marko Mäkelä made changes - 2018-01-24 13:55

Link

This issue relates to ~~MDEV-15053~~ [ ~~MDEV-15053~~ ]

Marko Mäkelä made changes - 2018-01-24 13:55

Status

Open [ 1 ]

Confirmed [ 10101 ]

Axel Schwenke made changes - 2018-01-29 15:10

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Axel Schwenke made changes - 2018-01-31 17:17

Attachment		MDEV-15058.ods [ 45112 ]
Attachment		MDEV-15058.pdf [ 45113 ]

Axel Schwenke added a comment - 2018-01-31 17:33

I did a range of tests on thwo machines.

my main Intel benchmark machine (16 cores, 32 hw threads)
the ARM server (46 cores)

The MariaDB version used was 10.3.4 (built locally). The benchmark was sysbench in different variations and workloads (see the single sheets).

The results are a mixed bag. On Intel it looks like multiple BP partitions don't help with performance. Certainly not for INSERT workload. That are sheets 1 and 2.

Sheets 3 and 4 are for Intel, sysbench OLTP with varying percentage of writes. Here it looks that we get small benefits for read-only, but the more writes are done and the more BP partitions we have, the worse things get.

Sheet 5 is ARM, sysbench OLTP ro/rw and wo. Here we have no clear verdict. It seems that 16 or 32 buffer pools give indeed a benefit.

Sheet 6 is not buffer pool partitions, but AHI partitions. This is now quite clear, increasing AHI partitions up to 32 is good for performance. Actually I ran this test first and used 32 fpr the other tests.

About the attached files: LibreOffice messes up the conditional formatting of the cells, so I attach the sheet also as PDF. The cells in the "throughput per used core" tables are color-coded. Red means "more than 1% slower than 1 partition", green means "more than 1% faster than 1 partition".

"Throughput per used core" - this is system throughput (qps) divided by min(benchmark threads, available hw threads). On a perfecly scaling system it would give the same number independent from benchmark thread count.

Axel Schwenke added a comment - 2018-01-31 17:33 I did a range of tests on thwo machines. my main Intel benchmark machine (16 cores, 32 hw threads) the ARM server (46 cores) The MariaDB version used was 10.3.4 (built locally). The benchmark was sysbench in different variations and workloads (see the single sheets). The results are a mixed bag. On Intel it looks like multiple BP partitions don't help with performance. Certainly not for INSERT workload. That are sheets 1 and 2. Sheets 3 and 4 are for Intel, sysbench OLTP with varying percentage of writes. Here it looks that we get small benefits for read-only, but the more writes are done and the more BP partitions we have, the worse things get. Sheet 5 is ARM, sysbench OLTP ro/rw and wo. Here we have no clear verdict. It seems that 16 or 32 buffer pools give indeed a benefit. Sheet 6 is not buffer pool partitions, but AHI partitions. This is now quite clear, increasing AHI partitions up to 32 is good for performance. Actually I ran this test first and used 32 fpr the other tests. About the attached files: LibreOffice messes up the conditional formatting of the cells, so I attach the sheet also as PDF. The cells in the "throughput per used core" tables are color-coded. Red means "more than 1% slower than 1 partition", green means "more than 1% faster than 1 partition". "Throughput per used core" - this is system throughput (qps) divided by min(benchmark threads, available hw threads). On a perfecly scaling system it would give the same number independent from benchmark thread count.

Axel Schwenke made changes - 2018-01-31 17:34

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2018-02-07 12:55

Sprint

10.3.5-1 [ 229 ]

Axel Schwenke made changes - 2018-02-08 10:13

Attachment		MDEV-15058-B.ods [ 45173 ]
Attachment		MDEV-15058-B.pdf [ 45174 ]

Axel Schwenke added a comment - 2018-02-08 10:17

Attached new results in ~~MDEV-15058~~-B.

{pdf,ods}

Those numbers are for MariaDB 10.3, commit c0d5d7c0

Again there are results for Intel (first 4 sheets) and ARM (last sheet). For ARM the situation is rather clear: multiple buffer pools have a slight negative impact on performance. Also clear is the situation for Intel and INSERT-only workload.

Axel Schwenke added a comment - 2018-02-08 10:17 Attached new results in MDEV-15058 -B. {pdf,ods} Those numbers are for MariaDB 10.3, commit c0d5d7c0 Again there are results for Intel (first 4 sheets) and ARM (last sheet). For ARM the situation is rather clear: multiple buffer pools have a slight negative impact on performance. Also clear is the situation for Intel and INSERT-only workload.

Axel Schwenke made changes - 2018-02-08 10:42

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2018-02-13 15:09

axel, can you please test innodb_buffer_pool_instances=1 with ~~MDEV-15158~~ and ~~MDEV-15053~~? Both are present in bb-10.3-thiru that is based on 10.3 (slightly newer than c0d5d7c0 which you tested).
To work around ~~MDEV-15246~~ you will have to revert the problematic commit:

git fetch origin

git checkout 52be75a682fb79c0a0d86bf3011aaeeb1f647d64

git revert bc7a1dc1fbd27e6064d3b40443fe242397668af

It is ~~MDEV-15053~~ that should reduce buf_pool->mutex contention, thus hopefully ‘entitling’ us to revert back to a single buffer pool.
The ~~MDEV-15158~~ is merely removing remaining writes to the TRX_SYS page. If you are not using binlog or Galera, you should not see any effect of that.

Marko Mäkelä added a comment - 2018-02-13 15:09 axel , can you please test innodb_buffer_pool_instances=1 with MDEV-15158 and MDEV-15053 ? Both are present in bb-10.3-thiru that is based on 10.3 (slightly newer than c0d5d7c0 which you tested). To work around MDEV-15246 you will have to revert the problematic commit: git fetch origin git checkout 52be75a682fb79c0a0d86bf3011aaeeb1f647d64 git revert bc7a1dc1fbd27e6064d3b40443fe242397668af It is MDEV-15053 that should reduce buf_pool->mutex contention, thus hopefully ‘entitling’ us to revert back to a single buffer pool. The MDEV-15158 is merely removing remaining writes to the TRX_SYS page. If you are not using binlog or Galera, you should not see any effect of that.

Axel Schwenke made changes - 2018-02-20 09:44

Attachment		MDEV-15058-thiru.ods [ 45238 ]
Attachment		MDEV-15058-thiru.pdf [ 45239 ]

Axel Schwenke added a comment - 2018-02-20 09:48

I added two more sheets (again: .ods and .pdf) with numbers for commit 2b97d02 - further referenced as 10.3.5-thiru. There are system- and per core throughput with color coding as before: red cells indicate performance more than 1% worse than single buffer pool, green cells indicate performance more than 1% better than single buffer pool.
I also added two sheets comparing 10.3.4 with 10.3.5-thiru for one buffer pool. For some usecases 10.3.5-thiru is 20% faster. This is more pronounced on the ARM machine (it has more cores) and for workloads with more writes.
As for the number of buffer pools: for read-only workloads and on Intel I see some benefits in the order of 2% for using multiple buffer pools.

Axel Schwenke added a comment - 2018-02-20 09:48 I added two more sheets (again: .ods and .pdf) with numbers for commit 2b97d02 - further referenced as 10.3.5-thiru. There are system- and per core throughput with color coding as before: red cells indicate performance more than 1% worse than single buffer pool, green cells indicate performance more than 1% better than single buffer pool. I also added two sheets comparing 10.3.4 with 10.3.5-thiru for one buffer pool. For some usecases 10.3.5-thiru is 20% faster. This is more pronounced on the ARM machine (it has more cores) and for workloads with more writes. As for the number of buffer pools: for read-only workloads and on Intel I see some benefits in the order of 2% for using multiple buffer pools.

Axel Schwenke made changes - 2018-02-20 09:49

Attachment

MDEV-15058-tpcc.ods [ 45240 ]

Axel Schwenke added a comment - 2018-02-20 09:54

Added numbers for TPC-C - an OLTP type benchmark. Specifically this is the TPC-C implementation named "HammerDB". The workload includes reads and writes and gives two numbers: system TPM = system throughput in transactions per minute and NOPM = new orders per minute. The NOPM number is comparable between different databases, the TPM number is specific for a DBMS.
As for the results: with 2 or 4 buffer pools, not much changes. With 16 buffer pools there is a visible decrease in performance.
This was run on Intel (no HammerDB for ARM) and with the 10.3.5-thiru build.

Axel Schwenke added a comment - 2018-02-20 09:54 Added numbers for TPC-C - an OLTP type benchmark. Specifically this is the TPC-C implementation named "HammerDB". The workload includes reads and writes and gives two numbers: system TPM = system throughput in transactions per minute and NOPM = new orders per minute. The NOPM number is comparable between different databases, the TPM number is specific for a DBMS. As for the results: with 2 or 4 buffer pools, not much changes. With 16 buffer pools there is a visible decrease in performance. This was run on Intel (no HammerDB for ARM) and with the 10.3.5-thiru build.

Axel Schwenke made changes - 2018-02-20 09:57

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Marko Mäkelä added a comment - 2018-02-23 03:47

Thank you, axel!

Since the 10.3.4 release there have been quite a few performance improvements, most notably ~~MDEV-15104~~, ~~MDEV-15158~~, ~~MDEV-15132~~, ~~MDEV-15059~~.
In MDEV-15058-tpcc.ods, a single buffer pool seems to generally win for lower concurrency, and the biggest improvement with multiple buffer pools is at most 1%. There is only one outlier: with 1 user, 4 buffer pool instances seem to give a 6.7% improvement.

Similarly, with MDEV-15058-thiru.pdf it is clear that 4 buffer pool instances is the sweet spot for the test hardware, giving up to 4.5% improvement with 1 client connection. With more connections, the difference goes down, but not completely.

Meanwhile, there is the separate branch bb-10.3-~~MDEV-15053~~ (commit 24deb3737f60c8aea1781c4dd244322f0066b197 based on the 10.3 commit 988ec800edb3dd9238b6f3948157d21bdb0c083b). I believe that its performance should be similar to 10.3.5-thiru (commit 2b97d026623d1928ce61752ef13ca1c7fb77b4e7) which was tested.

I have the feeling that removing the code to deal with multiple buffer pools might not improve performance enough to beat innodb_buffer_pool_instances=4 in write-heavy benchmarks on the test system. Write performance could be improved by changing the page flushing algorithms and data structures.

Marko Mäkelä added a comment - 2018-02-23 03:47 Thank you, axel ! Since the 10.3.4 release there have been quite a few performance improvements, most notably MDEV-15104 , MDEV-15158 , MDEV-15132 , MDEV-15059 . In MDEV-15058-tpcc.ods , a single buffer pool seems to generally win for lower concurrency, and the biggest improvement with multiple buffer pools is at most 1%. There is only one outlier: with 1 user, 4 buffer pool instances seem to give a 6.7% improvement. Similarly, with MDEV-15058-thiru.pdf it is clear that 4 buffer pool instances is the sweet spot for the test hardware, giving up to 4.5% improvement with 1 client connection. With more connections, the difference goes down, but not completely. Meanwhile, there is the separate branch bb-10.3- MDEV-15053 (commit 24deb3737f60c8aea1781c4dd244322f0066b197 based on the 10.3 commit 988ec800edb3dd9238b6f3948157d21bdb0c083b). I believe that its performance should be similar to 10.3.5-thiru (commit 2b97d026623d1928ce61752ef13ca1c7fb77b4e7) which was tested. I have the feeling that removing the code to deal with multiple buffer pools might not improve performance enough to beat innodb_buffer_pool_instances=4 in write-heavy benchmarks on the test system. Write performance could be improved by changing the page flushing algorithms and data structures.

Marko Mäkelä added a comment - 2018-02-24 02:02

Removing multiple buffer pools necessarily means removing multiple page cleaner threads as well.
Also, InnoDB (and MariaDB) would no longer depend on libnuma.
I did a quick refactoring (based on the latest 10.3 and ~~MDEV-15053~~) and pushed to bb-10.3-MDEV-15058 for review (by thiru) and benchmarking.

Marko Mäkelä added a comment - 2018-02-24 02:02 Removing multiple buffer pools necessarily means removing multiple page cleaner threads as well. Also, InnoDB (and MariaDB) would no longer depend on libnuma. I did a quick refactoring (based on the latest 10.3 and MDEV-15053 ) and pushed to bb-10.3-MDEV-15058 for review (by thiru ) and benchmarking.

Marko Mäkelä added a comment - 2018-03-01 21:33 - edited

I created another version of this patch. It would be interesting to benchmark it against the base version, to see if we really need ~~MDEV-15053~~ in order to remove multiple buffer pools.
bb-10.3-MDEV-15058-2 against "plain 10.3"
bb-10.3-MDEV-15058 contains also ~~MDEV-15053~~

Marko Mäkelä added a comment - 2018-03-01 21:33 - edited I created another version of this patch. It would be interesting to benchmark it against the base version, to see if we really need MDEV-15053 in order to remove multiple buffer pools. bb-10.3-MDEV-15058-2 against "plain 10.3" bb-10.3-MDEV-15058 contains also MDEV-15053

Sergei Golubchik made changes - 2018-03-06 08:29

Sprint

10.3.5-1 [ 229 ]

Axel Schwenke made changes - 2018-03-07 14:29

Attachment

MDEV-15058-singleBP.ods [ 45306 ]

Axel Schwenke added a comment - 2018-03-07 14:38

I added a new spread sheet ~~MDEV-15058~~-singleBP. This compares MariaDB 10.3.5 with the two buildbot trees: bb-10.3-mdev-15058 and bb-10.3-mdev-15058-2. The benchmark is sysbench OLTP. There are two sheets for two different architectures:

1. my Intel machine (16 cores, 32 hw threads). This has 10.3.5 number only for 4 buffer pools. Here all 3 contenders behave very much the same for read-only or read-mostly workload. For write-intensive workload, 10.3.5 is fastest.

2. the ARM server (46 cores). Here I ran the test for 1..64 buffer pools. Again for read-only and read-write there are only small differences, also the number of buffer pools doesn't matter much. For write-only workload two buffer pools give best results for 10.3.5, but the bb-10.3-mdev-15058 tree performs equally or better.

Axel Schwenke added a comment - 2018-03-07 14:38 I added a new spread sheet MDEV-15058 -singleBP. This compares MariaDB 10.3.5 with the two buildbot trees: bb-10.3-mdev-15058 and bb-10.3-mdev-15058-2. The benchmark is sysbench OLTP. There are two sheets for two different architectures: 1. my Intel machine (16 cores, 32 hw threads). This has 10.3.5 number only for 4 buffer pools. Here all 3 contenders behave very much the same for read-only or read-mostly workload. For write-intensive workload, 10.3.5 is fastest. 2. the ARM server (46 cores). Here I ran the test for 1..64 buffer pools. Again for read-only and read-write there are only small differences, also the number of buffer pools doesn't matter much. For write-only workload two buffer pools give best results for 10.3.5, but the bb-10.3-mdev-15058 tree performs equally or better.

Marko Mäkelä added a comment - 2018-03-09 21:07

axel, the latest benchmarks seem to be an "approval" to go ahead and remove multiple buffer pools.

However, I wonder how big the workload size is compared to the buffer pool size.

I would like to ask for one more set of write-heavy benchmarks, comparing to multiple buffer pools, and the buffer pool size being about 10% of the workload size. The minimum for using multiple innodb_buffer_pool_instances is innodb_buffer_pool_size=1g. The writes could be updates with uniformly distributed keys, so that a large number of pages get dirtied, instead of dirtying only a few pages.

The reason for this kind of benchmark is that the buffer pool mutex(es) are acquired mostly when pages are flushed, evicted or loaded to the buffer pool. If the workload mostly fits in the buffer pool, then there should not be that much load on the buffer pool.

There is a concern that with a single buffer pool and a single page cleaner (there cannot be more page cleaner threads than buffer pool instances) the flushing rate would be too small to saturate an SSD, and this could limit the performance of write-heavy workloads.

Marko Mäkelä added a comment - 2018-03-09 21:07 axel , the latest benchmarks seem to be an "approval" to go ahead and remove multiple buffer pools. However, I wonder how big the workload size is compared to the buffer pool size. I would like to ask for one more set of write-heavy benchmarks, comparing to multiple buffer pools, and the buffer pool size being about 10% of the workload size. The minimum for using multiple innodb_buffer_pool_instances is innodb_buffer_pool_size=1g . The writes could be updates with uniformly distributed keys, so that a large number of pages get dirtied, instead of dirtying only a few pages. The reason for this kind of benchmark is that the buffer pool mutex(es) are acquired mostly when pages are flushed, evicted or loaded to the buffer pool. If the workload mostly fits in the buffer pool, then there should not be that much load on the buffer pool. There is a concern that with a single buffer pool and a single page cleaner (there cannot be more page cleaner threads than buffer pool instances) the flushing rate would be too small to saturate an SSD, and this could limit the performance of write-heavy workloads.

Laurynas Biveinis added a comment - 2018-03-12 09:01

Consider separating LRU flushing from flush list flushing, which are not really related one to another anyway: https://www.percona.com/blog/2016/05/05/percona-server-5-7-multi-threaded-lru-flushing/ This way with a single instance you'd have two flusher threads.

If their CPU priority is high, flushing should never be bounded by lack of threads

Laurynas Biveinis added a comment - 2018-03-12 09:01 Consider separating LRU flushing from flush list flushing, which are not really related one to another anyway: https://www.percona.com/blog/2016/05/05/percona-server-5-7-multi-threaded-lru-flushing/ This way with a single instance you'd have two flusher threads. If their CPU priority is high, flushing should never be bounded by lack of threads

Axel Schwenke made changes - 2018-03-13 12:35

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Axel Schwenke made changes - 2018-03-13 12:35

Attachment

MDEV-15058-RAM-Intel.ods [ 45344 ]

Axel Schwenke added a comment - 2018-03-13 12:41 - edited

New results added: ~~MDEV-15058~~-RAM-Intel.ods

This is with completely different setup:

datadir in /dev/shm
1G buffer pool, 10G data set
sysbench OLTP with 50, 80, 100% writes
max dirty percent = 99, dblwrite disabled

Now there is a significant regression when switching to a single buffer pool. The number of page cleaners has also an impact, but much smaller. The experimental "single buffer pool" trees behave better than 10.3.5 with a single buffer pool, but are still significantly slower than 10.3.5 with 4 buffer pools.

Axel Schwenke added a comment - 2018-03-13 12:41 - edited New results added: MDEV-15058 -RAM-Intel.ods This is with completely different setup: datadir in /dev/shm 1G buffer pool, 10G data set sysbench OLTP with 50, 80, 100% writes max dirty percent = 99, dblwrite disabled Now there is a significant regression when switching to a single buffer pool. The number of page cleaners has also an impact, but much smaller. The experimental "single buffer pool" trees behave better than 10.3.5 with a single buffer pool, but are still significantly slower than 10.3.5 with 4 buffer pools.

Axel Schwenke made changes - 2018-03-13 13:13

Attachment

~~MDEV-15058~~-RAM-Intel.ods [ 45344 ]

Axel Schwenke made changes - 2018-03-13 13:13

Attachment

MDEV-15058-RAM-Intel.ods [ 45345 ]

Axel Schwenke made changes - 2018-03-13 17:00

Attachment

MDEV-15058-RAM-ARM.ods [ 45346 ]

Axel Schwenke added a comment - 2018-03-13 17:02

Attached results for ARM and the same setup (/dev/shm, write-heavy). Results are very similar to those for Intel. For 100% writes the number of page cleaners has some impact on performance, too.

Axel Schwenke added a comment - 2018-03-13 17:02 Attached results for ARM and the same setup (/dev/shm, write-heavy). Results are very similar to those for Intel. For 100% writes the number of page cleaners has some impact on performance, too.

Inaam Rana added a comment - 2018-03-15 19:04

Marko, I believe we won't only need multiple instances but also multiple background threads for flushing.

We can think in terms of a single LRU flusher and a single page_cleaner. The code as it is written right now will serially go through each instance and do a batch (LRU batches are chunkized though). Imagine if the last instance is the one most in need of flushing. Therefore it might make sense to have configurable number of background flushing threads.

With multiple threads we might want to tweak some other bits of code as well:

Try to avoid two threads working on same instance i.e.: page_cleaner on flush_list and lru_thread on LRU list. As that will probably make buf_pool::mutex more hot.
May be have some way of knowing which instance is in urgent need of flushing and prioritize according to that
We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance
When dblwr is disabled, after posting one batch we call os_aio_wait_until_no_pending_writes() in buf_dblwr_sync_datafiles(). If we have multiple threads working we might get unnecessarily delayed by wait_until_no_pending_writes.
When dblwr is disabled we can shove a lot of IO requests in a single batch. The aio_array have limited segments (one per IO thread) and each can take up to 256 pending IO requests. If we are using simulated AIO we call os_aio_simulated_wake_handler_threads() at end of batch. And batch can be quite large. And there can be multiple batches happening concurrently. I think it makes more sense to call os_aio_simulated_wake_handler_threads more frequently (or have mechanism inside os0file.cc where io threads are woken up before we ran out of slots)
See if the idea mentioned here is practicable: https://bugs.mysql.com/bug.php?id=74637

Inaam Rana added a comment - 2018-03-15 19:04 Marko, I believe we won't only need multiple instances but also multiple background threads for flushing. We can think in terms of a single LRU flusher and a single page_cleaner. The code as it is written right now will serially go through each instance and do a batch (LRU batches are chunkized though). Imagine if the last instance is the one most in need of flushing. Therefore it might make sense to have configurable number of background flushing threads. With multiple threads we might want to tweak some other bits of code as well: Try to avoid two threads working on same instance i.e.: page_cleaner on flush_list and lru_thread on LRU list. As that will probably make buf_pool::mutex more hot. May be have some way of knowing which instance is in urgent need of flushing and prioritize according to that We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance When dblwr is disabled, after posting one batch we call os_aio_wait_until_no_pending_writes() in buf_dblwr_sync_datafiles(). If we have multiple threads working we might get unnecessarily delayed by wait_until_no_pending_writes. When dblwr is disabled we can shove a lot of IO requests in a single batch. The aio_array have limited segments (one per IO thread) and each can take up to 256 pending IO requests. If we are using simulated AIO we call os_aio_simulated_wake_handler_threads() at end of batch. And batch can be quite large. And there can be multiple batches happening concurrently. I think it makes more sense to call os_aio_simulated_wake_handler_threads more frequently (or have mechanism inside os0file.cc where io threads are woken up before we ran out of slots) See if the idea mentioned here is practicable: https://bugs.mysql.com/bug.php?id=74637

Laurynas Biveinis added a comment - 2018-03-16 10:40

Inaam, Marko, the last reply prompts me to market our MT flusher / parallel doublewrite some more.

Our design addresses the first three items: 1) each thread has its own private lru/flush list; 2) each thread has own independent heuristics on when to flush; 3) parallel doublewrite.

I believe it takes care of the issues in bug 74637 as well.

Laurynas Biveinis added a comment - 2018-03-16 10:40 Inaam, Marko, the last reply prompts me to market our MT flusher / parallel doublewrite some more. Our design addresses the first three items: 1) each thread has its own private lru/flush list; 2) each thread has own independent heuristics on when to flush; 3) parallel doublewrite. I believe it takes care of the issues in bug 74637 as well.

Vladislav Vaintroub added a comment - 2018-03-16 10:53 - edited

I'm also reminded that I wanted to know more about the story of Linux AIO with segments per thread. MariaDB design on Windows is (almost) such that any thread can take any IO completion, the segments do not play any role . (more accurately, there are actually 2 IO pools, one for read and one for write requests, and that worked around some deadlock I've seen in the past
But I feel like getting rid of segments might improve scalability on Linuxen, like it did when MariaDB Windows switched from prehistoric Win95-style async IO to NT3.5-style one with completion port.

Vladislav Vaintroub added a comment - 2018-03-16 10:53 - edited I'm also reminded that I wanted to know more about the story of Linux AIO with segments per thread. MariaDB design on Windows is (almost) such that any thread can take any IO completion, the segments do not play any role . (more accurately, there are actually 2 IO pools, one for read and one for write requests, and that worked around some deadlock I've seen in the past But I feel like getting rid of segments might improve scalability on Linuxen, like it did when MariaDB Windows switched from prehistoric Win95-style async IO to NT3.5-style one with completion port.

Axel Schwenke made changes - 2018-03-16 10:55

Attachment

MDEV-15058-SSD-ARM.ods [ 45354 ]

Axel Schwenke made changes - 2018-03-16 10:55

Attachment

MDEV-15058-SSD-Intel.ods [ 45355 ]

Axel Schwenke added a comment - 2018-03-16 11:06

Hi. I attached results for the datadir on SSD. The test was run on Intel and ARM again. Observations:

ARM: the system is clearly IO-bound. While it does have a SSD, it's a rather slow one and it holds everything. iostat reports ~30% cpu time spent in iowait and 100% utilisation for the disk.

Intel: the numbers are very similar to the ones for the datadir in RAM, just that the differences are a bit smoother. A difference that should be noted is, that now the number of page cleaners has more impact. Check the diagrams bottom left (4 buffer pools, 1/2/4 page cleaner threads). iostat shows up to 45% iowait and the SSDs reach 98% utilisation. Unlike the ARM system, the Intel system has dedicated SSD for the datadir and it's actually two units in RAID-0.

Axel Schwenke added a comment - 2018-03-16 11:06 Hi. I attached results for the datadir on SSD. The test was run on Intel and ARM again. Observations: ARM: the system is clearly IO-bound. While it does have a SSD, it's a rather slow one and it holds everything. iostat reports ~30% cpu time spent in iowait and 100% utilisation for the disk. Intel: the numbers are very similar to the ones for the datadir in RAM, just that the differences are a bit smoother. A difference that should be noted is, that now the number of page cleaners has more impact. Check the diagrams bottom left (4 buffer pools, 1/2/4 page cleaner threads). iostat shows up to 45% iowait and the SSDs reach 98% utilisation. Unlike the ARM system, the Intel system has dedicated SSD for the datadir and it's actually two units in RAID-0.

Axel Schwenke made changes - 2018-04-20 15:33

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Marko Mäkelä added a comment - 2018-06-19 10:20 - edited

log_flush_order_mutex, which was shared between the flush_list of multiple buffer pool instances, may still be needed.

If we removed this mutex, we would have to protect the buf_pool.flush_list with the log_sys.mutex, prolonging the hold time of that mutex. This might be acceptable, given that we only need to touch the flush_list when a page is modified for the first time. Most of the time, we could release the log_sys.mutex early.

Either way, we would seem to need mtr_t::is_dirty() and mtr_t::is_block_dirtied().

Marko Mäkelä added a comment - 2018-06-19 10:20 - edited log_flush_order_mutex , which was shared between the flush_list of multiple buffer pool instances, may still be needed. If we removed this mutex, we would have to protect the buf_pool.flush_list with the log_sys.mutex , prolonging the hold time of that mutex. This might be acceptable, given that we only need to touch the flush_list when a page is modified for the first time. Most of the time, we could release the log_sys.mutex early. Either way, we would seem to need mtr_t::is_dirty() and mtr_t::is_block_dirtied() .

Marko Mäkelä made changes - 2018-06-19 16:42

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2019-02-04 06:57

Fix Version/s		10.5 [ 23123 ]
Fix Version/s	10.4 [ 22408 ]
Assignee	Axel Schwenke [ axel ]	Marko Mäkelä [ marko ]

Daniel Black made changes - 2019-02-25 09:26

Link

This issue relates to MDEV-15685 [ MDEV-15685 ]

Anjum Naveed added a comment - 2019-05-06 18:04

I did test the impact of multiple buffer pool instances sometime back. The purpose was different so I will need to rerun the tests for reporting purposes. I found the bottleneck was the doublewrite buffer. When doublewrite buffer was turned off, multiple buffer pool instances resulted in improvement on Intel as well as ARM system. I did not modify any code so I cannot comment on flushing mechanism.

In addition to updating the flushing mechanism, I am in strong support of the suggestion from Inaam Rana "We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance".

Anjum Naveed added a comment - 2019-05-06 18:04 I did test the impact of multiple buffer pool instances sometime back. The purpose was different so I will need to rerun the tests for reporting purposes. I found the bottleneck was the doublewrite buffer. When doublewrite buffer was turned off, multiple buffer pool instances resulted in improvement on Intel as well as ARM system. I did not modify any code so I cannot comment on flushing mechanism. In addition to updating the flushing mechanism, I am in strong support of the suggestion from Inaam Rana " We'll have multiple threads writing to same doublwrite buffer. Perhaps makes sense to have separate dblwr buffer per instance ".

Ralf Gebhardt made changes - 2019-05-17 06:49

Priority

Major [ 3 ]

Critical [ 2 ]

Anjum Naveed added a comment - 2019-05-22 20:12

Dear all,

I have done following: (code base is version 10.5)
1- Create separate file for doublewrite buffer independent of trx_sys_tablespace. This file takes independent directory path so that I can move it around.
2- Number of doublewrite buffer files linked to buffer pool instances and one but_dblwr data structure (and hence mutex) for each instance.
3- A small modification in flush_buffered_writes. Instead of flushing in a loop of instances, I provide instance as argument so that each doublewrite buffer file can be written in dependently, rather than in sequence. (This is then controlled by page cleaner threads).
4- Moved the doublewrite folder to RAMDISK (in actual systems it should be in a separate disk at least)
5- I have intentionally left page cleaner code unchanged because I wanted to see the impact of multiple buffer pool instances coupled with multiple files to write into. (More on this in a minute).

At present I am testing on my development laptop so I do not have the liberty to move doublewrite buffers to separate disk, hence the RAMDISK, which will provide upper bound improvement.
I have used hammerdb tpcc test. I have tested 4 buffer pool instances vs 1 buffer pool instance of modified code and original code. When I use 4 buffer pool instances with 4 page cleaner threads, I get about 19% improvement over the unmodified code. On the other hand, when I use single buffer pool instance, improvement is 33%. (Keep in mind that doublewrite buffer files are in RAMDISK).

If I use actual hard drive for doublewrite buffer files, I will still get improvement, although not as much. As far as this test is concerned, it supports use of single buffer pool instead of using multiple buffer pool instances. However, looking at the code, I believe the issue is with page cleaner and the way pages are being distributed to instances that is holding the performance down and not the actual use of multiple files and buffer pool instances.

I still believe multiple buffer pool instances is the way to go, specially for large systems and we need to improve the things around buffer pools (specifically page cleaner code). Please suggest if more time should be spent in this direction OR we have already decided to use single buffer pool.

Anjum Naveed added a comment - 2019-05-22 20:12 Dear all, I have done following: (code base is version 10.5) 1- Create separate file for doublewrite buffer independent of trx_sys_tablespace. This file takes independent directory path so that I can move it around. 2- Number of doublewrite buffer files linked to buffer pool instances and one but_dblwr data structure (and hence mutex) for each instance. 3- A small modification in flush_buffered_writes. Instead of flushing in a loop of instances, I provide instance as argument so that each doublewrite buffer file can be written in dependently, rather than in sequence. (This is then controlled by page cleaner threads). 4- Moved the doublewrite folder to RAMDISK (in actual systems it should be in a separate disk at least) 5- I have intentionally left page cleaner code unchanged because I wanted to see the impact of multiple buffer pool instances coupled with multiple files to write into. (More on this in a minute). At present I am testing on my development laptop so I do not have the liberty to move doublewrite buffers to separate disk, hence the RAMDISK, which will provide upper bound improvement. I have used hammerdb tpcc test. I have tested 4 buffer pool instances vs 1 buffer pool instance of modified code and original code. When I use 4 buffer pool instances with 4 page cleaner threads, I get about 19% improvement over the unmodified code. On the other hand, when I use single buffer pool instance, improvement is 33%. (Keep in mind that doublewrite buffer files are in RAMDISK). If I use actual hard drive for doublewrite buffer files, I will still get improvement, although not as much. As far as this test is concerned, it supports use of single buffer pool instead of using multiple buffer pool instances. However, looking at the code, I believe the issue is with page cleaner and the way pages are being distributed to instances that is holding the performance down and not the actual use of multiple files and buffer pool instances. I still believe multiple buffer pool instances is the way to go, specially for large systems and we need to improve the things around buffer pools (specifically page cleaner code). Please suggest if more time should be spent in this direction OR we have already decided to use single buffer pool.

Marko Mäkelä made changes - 2019-11-15 16:09

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2019-11-15 16:09

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2019-11-15 16:10

Link

This issue is blocked by ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2019-11-15 16:10

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Daniel Black made changes - 2019-12-04 05:31

Link

This issue relates to ~~MDEV-21212~~ [ ~~MDEV-21212~~ ]

Marko Mäkelä added a comment - 2019-12-11 06:00

anjumnaveed81, I am sorry for missing your updates until now. I would love to see your changes, preferrably in the form of a pull request against the current 10.5 branch in https://github.com/MariaDB/server/.

I think that your work fits under ~~MDEV-16526~~, whose objectives include removing scalability bottlenecks of a single buffer pool instance.

Marko Mäkelä added a comment - 2019-12-11 06:00 anjumnaveed81 , I am sorry for missing your updates until now. I would love to see your changes, preferrably in the form of a pull request against the current 10.5 branch in https://github.com/MariaDB/server/ . I think that your work fits under MDEV-16526 , whose objectives include removing scalability bottlenecks of a single buffer pool instance.

Marko Mäkelä added a comment - 2019-12-17 15:54

axel, can you please test a single buffer pool instance vs. multiple buffer pools in a write-heavy workload on the latest 10.5? Maybe things have improved since the previous test and we do not need to wait for ~~MDEV-16526~~? Remember to test with the doublewrite buffer disabled, because that is an obvious bottleneck.

Marko Mäkelä added a comment - 2019-12-17 15:54 axel , can you please test a single buffer pool instance vs. multiple buffer pools in a write-heavy workload on the latest 10.5? Maybe things have improved since the previous test and we do not need to wait for MDEV-16526 ? Remember to test with the doublewrite buffer disabled, because that is an obvious bottleneck.

Marko Mäkelä made changes - 2019-12-17 15:54

Assignee

Marko Mäkelä [ marko ]

Axel Schwenke [ axel ]

Axel Schwenke made changes - 2020-01-09 16:09

Attachment		MDEV-15058-10.4.10.ods [ 50230 ]
Attachment		MDEV-15058-10.4vs10.5.ods [ 50231 ]
Attachment		MDEV-15058-10.5.ods [ 50232 ]

Axel Schwenke added a comment - 2020-01-09 16:14

Attached 3 new spread sheets with results for 10.5.0 and 10.4.10. After seeing the numbers for 10.5.0 I decided to run the workload with latest 10.4 for comparison. In write-heavy benchmarks (see the comparative spread sheet) 10.5 is up to 40% slower than 10.4. Also there are some anomalies like very poor performance with a single buffer pool, getting back to normal with multiple pools.
I'm afraid those numbers must be taken with a big spoon full (not just a grain) of salt. The cpu usage numbers taken due to the benchmark runs, suggest that very much time is spent in mutex waits. I would not conclude anything from those numbers without looking at detailed mutex stats (which are not, unfortunately being monitored during that benchmark run).

Axel Schwenke added a comment - 2020-01-09 16:14 Attached 3 new spread sheets with results for 10.5.0 and 10.4.10. After seeing the numbers for 10.5.0 I decided to run the workload with latest 10.4 for comparison. In write-heavy benchmarks (see the comparative spread sheet) 10.5 is up to 40% slower than 10.4. Also there are some anomalies like very poor performance with a single buffer pool, getting back to normal with multiple pools. I'm afraid those numbers must be taken with a big spoon full (not just a grain) of salt. The cpu usage numbers taken due to the benchmark runs, suggest that very much time is spent in mutex waits. I would not conclude anything from those numbers without looking at detailed mutex stats (which are not, unfortunately being monitored during that benchmark run).

Marko Mäkelä added a comment - 2020-01-09 17:36

Thank you, axel! I discussed this shortly with wlad (who benchmarked ~~MDEV-16264~~ and claimed that with single buffer pool, the flushing performance was actually better).

MDEV-15058-10.5.ods Sheet 7 "OLTP, 32 tables, uniform rng, small BP, no-dblwrite, SSD" is for 50%, 80% and 100% write ratio. For smaller write ratio, it suggests that 4 buffer pools are giving the optimal performance. The difference between 1 and 4 buffer pool instances is 10.1% for 50% writes and 5.7% for 100% writes. I would expect the difference to be even bigger with 0% write ratio. The figures seem to suggest that page flushing is not a bottleneck at all.

It seems that we could have a bottleneck for things that affect page lookups or page replacement (eviction and reloading) in the buffer pool. With multiple buffer pools, we have a partitioning function:

inline buf_pool_t* buf_pool_get(const page_id_t page_id)

        /* 2log of BUF_READ_AHEAD_AREA (64) */

        ulint		ignored_page_no = page_id.page_no() >> 6;

        page_id_t	id(page_id.space(), ignored_page_no);

        ulint		i = id.fold() % srv_buf_pool_instances;

        return(&buf_pool_ptr[i]);

The first hash function is:

	ulint fold() const { return (m_space << 20) + m_space + m_page_no; }

If we are lucky, there might be a trivial bottleneck on buf_pool->page_hash that maps page numbers into block descriptors, specifically on the partitioned rw-latch that protects it:

hash_lock = hash_get_lock(buf_pool->page_hash, page_id.fold());

The rw_lock_t for the page hash table is partitioned into srv_n_page_hash_locks (default 16).
I would suggest to apply the following patch against the latest 10.5 (at least 3a3605f4b1ad08bbcb823cd41b724f2def9f2ba3):

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index 17e3d1fa968..234271047a6 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -1,3 +1,4 @@

+#define UNIV_PERF_DEBUG

 /*****************************************************************************

 Copyright (c) 2000, 2019, Oracle and/or its affiliates. All Rights Reserved.

This will expose the parameter innodb_page_hash_locks (default 16, ranging from 1 to 1024). I would suggest to rerun the benchmark with innodb_page_hash_locks=64, with 0%, 50%, 80% and 100% writes.

I would also suggest to compile with cmake -DPLUGIN_PERFSCHEMA=NO to see a ‘more pure’ performance difference. This benchmark should only need to compare the latest 10.5 development snapshot, with different innodb_buffer_pool_instances, on a buffer pool that is smaller than the table.

Marko Mäkelä added a comment - 2020-01-09 17:36 Thank you, axel ! I discussed this shortly with wlad (who benchmarked MDEV-16264 and claimed that with single buffer pool, the flushing performance was actually better). MDEV-15058-10.5.ods Sheet 7 "OLTP, 32 tables, uniform rng, small BP, no-dblwrite, SSD" is for 50%, 80% and 100% write ratio. For smaller write ratio, it suggests that 4 buffer pools are giving the optimal performance. The difference between 1 and 4 buffer pool instances is 10.1% for 50% writes and 5.7% for 100% writes. I would expect the difference to be even bigger with 0% write ratio. The figures seem to suggest that page flushing is not a bottleneck at all. It seems that we could have a bottleneck for things that affect page lookups or page replacement (eviction and reloading) in the buffer pool. With multiple buffer pools, we have a partitioning function: inline buf_pool_t* buf_pool_get( const page_id_t page_id) { /* 2log of BUF_READ_AHEAD_AREA (64) */ ulint ignored_page_no = page_id.page_no() >> 6; page_id_t id(page_id.space(), ignored_page_no); ulint i = id.fold() % srv_buf_pool_instances; return (&buf_pool_ptr[i]); } The first hash function is: ulint fold() const { return (m_space << 20) + m_space + m_page_no; } If we are lucky, there might be a trivial bottleneck on buf_pool->page_hash that maps page numbers into block descriptors, specifically on the partitioned rw-latch that protects it: hash_lock = hash_get_lock(buf_pool->page_hash, page_id.fold()); The rw_lock_t for the page hash table is partitioned into srv_n_page_hash_locks (default 16). I would suggest to apply the following patch against the latest 10.5 (at least 3a3605f4b1ad08bbcb823cd41b724f2def9f2ba3): diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 17e3d1fa968..234271047a6 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -1,3 +1,4 @@ +#define UNIV_PERF_DEBUG /***************************************************************************** Copyright (c) 2000, 2019, Oracle and/or its affiliates. All Rights Reserved. This will expose the parameter innodb_page_hash_locks (default 16, ranging from 1 to 1024). I would suggest to rerun the benchmark with innodb_page_hash_locks=64 , with 0%, 50%, 80% and 100% writes. I would also suggest to compile with cmake -DPLUGIN_PERFSCHEMA=NO to see a ‘more pure’ performance difference. This benchmark should only need to compare the latest 10.5 development snapshot, with different innodb_buffer_pool_instances , on a buffer pool that is smaller than the table.

Axel Schwenke made changes - 2020-01-15 16:02

Attachment

MDEV-15058-10.5-dev.ods [ 50267 ]

Axel Schwenke added a comment - 2020-01-15 16:06

I attached a new sheet MDEV-15058-10.5-dev.ods. Comparing workload from former sheet 7 (small BP, SSD) for 10.5.0 (as before) with 10.5 head (commit cc3135cf) with 10.5 head + innodb_page_hash_locks=64 (PERF_DEBUG patch).

It now seems that more than 4 buffer pools are not good performancewise.

Axel Schwenke added a comment - 2020-01-15 16:06 I attached a new sheet MDEV-15058-10.5-dev.ods . Comparing workload from former sheet 7 (small BP, SSD) for 10.5.0 (as before) with 10.5 head (commit cc3135cf) with 10.5 head + innodb_page_hash_locks=64 (PERF_DEBUG patch). It now seems that more than 4 buffer pools are not good performancewise.

Vladislav Vaintroub added a comment - 2020-01-15 17:30 - edited

there are some big dips in Sheet#3
like,

50% writes

64	7214	13638	18248	9020	15435	6636
128	41	7499	17	16	4153	15603

and for other N% writes, too
What is it?

Vladislav Vaintroub added a comment - 2020-01-15 17:30 - edited there are some big dips in Sheet#3 like, 50% writes 64 7214 13638 18248 9020 15435 6636 128 41 7499 17 16 4153 15603 and for other N% writes, too What is it?

Marko Mäkelä added a comment - 2020-01-17 15:58

I had a word with axel. He noted that earlier supposedly read-only benchmarks were accidentally read/write. Based on performance_schema output that he shared, there were some unexpected anomalies. I suggested retesting with innodb_stats_persistent=OFF and innodb_change_buffering=none to make the benchmark more deterministic.

Marko Mäkelä added a comment - 2020-01-17 15:58 I had a word with axel . He noted that earlier supposedly read-only benchmarks were accidentally read/write. Based on performance_schema output that he shared, there were some unexpected anomalies. I suggested retesting with innodb_stats_persistent=OFF and innodb_change_buffering=none to make the benchmark more deterministic.

Axel Schwenke made changes - 2020-01-24 21:03

Attachment

MDEV-15058-10.5-34dafb7e3a8.ods [ 50309 ]

Axel Schwenke added a comment - 2020-01-24 21:18 - edited

New numbers attached in MDEV-15058-10.5-34dafb7e3a8.ods

This test uses a fresh build from the 10.5 branch and changed InnoDB configuration. The biggest difference comes from innodb_file_per_table=0. I have also skip-innodb_adaptive_hash_index, skip-innodb-stats-persistent and innodb-change-buffering=none.

Now the numbers are quite smooth with steady throughput during the benchmark runtime. The bottleneck is now the disk, read-only reads ~860MB/s from the disk and cpu usage is 19% user, 9% system, 72% iowait.Read-write reads ~300MB/s and writes ~270MB/s from/to disk at 6.5% user, 3.5% system, 50% iowait and 40% idle..

I reconfigured everything to put the datadir into a RAM disk (/dev/shm). This dramatically increases the throughput. Read-only now uses cpu at 76%user, 24% system. Read-write is 63% user, 27% system, 10% idle.

Using multiple buffer pools with the datadir on SSD has not much impact on RO performance but visible impact on RW performance with the optimum at 4 BP. With the datadir in RAM multiple buffer pools increase both RO and RW performance. Performance increases from 1 over 2 to 4 BP and then stays stable at higher BP numbers (tested up to 32).

Axel Schwenke added a comment - 2020-01-24 21:18 - edited New numbers attached in MDEV-15058-10.5-34dafb7e3a8.ods This test uses a fresh build from the 10.5 branch and changed InnoDB configuration. The biggest difference comes from innodb_file_per_table=0. I have also skip-innodb_adaptive_hash_index, skip-innodb-stats-persistent and innodb-change-buffering=none. Now the numbers are quite smooth with steady throughput during the benchmark runtime. The bottleneck is now the disk, read-only reads ~860MB/s from the disk and cpu usage is 19% user, 9% system, 72% iowait.Read-write reads ~300MB/s and writes ~270MB/s from/to disk at 6.5% user, 3.5% system, 50% iowait and 40% idle.. I reconfigured everything to put the datadir into a RAM disk (/dev/shm). This dramatically increases the throughput. Read-only now uses cpu at 76%user, 24% system. Read-write is 63% user, 27% system, 10% idle. Using multiple buffer pools with the datadir on SSD has not much impact on RO performance but visible impact on RW performance with the optimum at 4 BP. With the datadir in RAM multiple buffer pools increase both RO and RW performance. Performance increases from 1 over 2 to 4 BP and then stays stable at higher BP numbers (tested up to 32).

Marko Mäkelä added a comment - 2020-01-27 11:27 - edited

axel, thank you! With those parameters and changes to the benchmark, we avoid hitting the following bottlenecks, which are independent of the number of InnoDB buffer pool instances:

ENGINE=Aria operations on internal temporary tables: caused by range scans, maybe due to occasionally wrong statistics leading to suboptimal query plans?
InnoDB adaptive hash index: it sometimes helps, sometimes hurts performance (see MDEV-17492)
fil_system.mutex contention due to writing MLOG_FILE_NAME records when an .ibd file is first modified since a log checkpoint: should be improved when ~~MDEV-14425~~ introduces a separate log file for checkpoints and file operations. Because log checkpoints are triggered ‘randomly’, so will these contentions. During these operations, we are not holding any buf_pool mutexes.
InnoDB persistent statistics collection could kick in randomly, triggering large index scans, affecting concurrent workload.
InnoDB change buffering could cause a lot of extra I/O, for avoiding one read. Maybe it is not at all useful with SSD nowadays.

Were the last runs with the default value of innodb_page_hash_locks=16?

Edit: Because the page flushing seems optimal at innodb_buffer_pool_instances=4, it looks like we may have to run new benchmarks after ~~MDEV-16526~~ and ~~MDEV-21534~~ have been completed.

Marko Mäkelä added a comment - 2020-01-27 11:27 - edited axel , thank you! With those parameters and changes to the benchmark, we avoid hitting the following bottlenecks, which are independent of the number of InnoDB buffer pool instances: ENGINE=Aria operations on internal temporary tables: caused by range scans, maybe due to occasionally wrong statistics leading to suboptimal query plans? InnoDB adaptive hash index: it sometimes helps, sometimes hurts performance (see MDEV-17492 ) fil_system.mutex contention due to writing MLOG_FILE_NAME records when an .ibd file is first modified since a log checkpoint: should be improved when MDEV-14425 introduces a separate log file for checkpoints and file operations. Because log checkpoints are triggered ‘randomly’, so will these contentions. During these operations, we are not holding any buf_pool mutexes. InnoDB persistent statistics collection could kick in randomly, triggering large index scans, affecting concurrent workload. InnoDB change buffering could cause a lot of extra I/O, for avoiding one read. Maybe it is not at all useful with SSD nowadays. Were the last runs with the default value of innodb_page_hash_locks=16 ? Edit: Because the page flushing seems optimal at innodb_buffer_pool_instances=4 , it looks like we may have to run new benchmarks after MDEV-16526 and MDEV-21534 have been completed.

Marko Mäkelä made changes - 2020-01-27 12:11

Link

This issue is blocked by ~~MDEV-15053~~ [ ~~MDEV-15053~~ ]

Marko Mäkelä made changes - 2020-01-27 12:11

Link

This issue relates to ~~MDEV-15053~~ [ ~~MDEV-15053~~ ]

Axel Schwenke made changes - 2020-01-28 12:47

Attachment		ramdisk-ro1.svg [ 50318 ]
Attachment		ramdisk-ro4.svg [ 50319 ]
Attachment		ramdisk-rw1.svg [ 50320 ]
Attachment		ramdisk-rw4.svg [ 50321 ]

Axel Schwenke added a comment - 2020-01-28 12:50 - edited

I attached CPU Flame Graphs (http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) for the following 4 scenarios:

single BP, OLTP read-only (ramdisk-ro1.svg)
single BP, OLTP read-write, 80% writes ( (ramdisk-rw1.svg)
4 BP, OLTP read-only (ramdisk-ro4.svg)
4 BP, OLTP read-write, 80% writes ( (ramdisk-rw4.svg)

In any case the datadir was in memory (/dev/shm) and there were 32 benchmark threads running. That should be the sweet spot as the hardware can do 32 concurrent (hyper)threads

Axel Schwenke added a comment - 2020-01-28 12:50 - edited I attached CPU Flame Graphs ( http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html ) for the following 4 scenarios: single BP, OLTP read-only ( ramdisk-ro1.svg ) single BP, OLTP read-write, 80% writes ( ( ramdisk-rw1.svg ) 4 BP, OLTP read-only ( ramdisk-ro4.svg ) 4 BP, OLTP read-write, 80% writes ( ( ramdisk-rw4.svg ) In any case the datadir was in memory (/dev/shm) and there were 32 benchmark threads running. That should be the sweet spot as the hardware can do 32 concurrent (hyper)threads

Marko Mäkelä added a comment - 2020-01-28 13:44

Between ramdisk-rw1.svg and ramdisk-rw4.svg, the major difference seems to be that in the page fault handling for ha_innobase::records_in_range(), the buf_LRU_get_free_block(buf_pool_t*) is taking much longer time with a single buffer pool than with 4 buffer pool instances.
Can we get the perf annotate or similar report for this function and its callees, to highlight the difference in more detail?

Marko Mäkelä added a comment - 2020-01-28 13:44 Between ramdisk-rw1.svg and ramdisk-rw4.svg , the major difference seems to be that in the page fault handling for ha_innobase::records_in_range() , the buf_LRU_get_free_block(buf_pool_t*) is taking much longer time with a single buffer pool than with 4 buffer pool instances. Can we get the perf annotate or similar report for this function and its callees, to highlight the difference in more detail?

Marko Mäkelä added a comment - 2020-01-28 14:36

wlad, can you please try to figure out the bottleneck? I wonder if it could be related to buf_pool->LRU_old in some way.

Marko Mäkelä added a comment - 2020-01-28 14:36 wlad , can you please try to figure out the bottleneck? I wonder if it could be related to buf_pool->LRU_old in some way.

Marko Mäkelä made changes - 2020-01-28 14:36

Assignee

Axel Schwenke [ axel ]

Vladislav Vaintroub [ wlad ]

Vladislav Vaintroub made changes - 2020-02-11 22:12

Attachment		1bp.txt [ 50466 ]
Attachment		4bp.txt [ 50467 ]

Vladislav Vaintroub added a comment - 2020-02-11 22:26

So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough

my.cnf

[mysqld]

#####non innodb options

max_connections = 300

table_open_cache = 600

query_cache_type = 0

#####innodb options

innodb_buffer_pool_size = 1G

innodb_log_buffer_size = 32M

innodb_log_file_size = 512M

innodb_flush_log_at_trx_commit = 2

innodb_doublewrite = 0

loose-innodb_adaptive_hash_index_partitions = 32

loose-innodb_adaptive_hash_index_parts = 32

#####SSD

innodb-flush-method = O_DIRECT

innodb_io_capacity = 4000

loose-innodb_flush_neighbors = 0

innodb_write_io_threads = 8

#####the variables for this test

innodb_buffer_pool_instances = 1

innodb_max_dirty_pages_pct = 99

skip-innodb_adaptive_hash_index

skip-innodb-stats-persistent

innodb-change-buffering=none

innodb_file_per_table = 0

script to run with sysbench 1.0

sysbench --test=/usr/share/sysbench/oltp_update_index.lua   --tables=32 --table-size=1250000  --rand-seed=42 --rand-type=uniform --num-threads=32 --report-interval=2  --mysql-socket=/tmp/mysql.sock --time=300  --max-requests=0 --mysql-user=root --percentile=95 $1

where $1 is either "prepare" or "run" (you need to have a database called sbtest)

Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top")

For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4

4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case.

I attached pt-pmp output 1bp.txt (single buffer pool instance) and 4bp.txt (4 instances), made with 20 samples separated by 10 seconds delay (If someone knows a more modern tool to profile on contention , please tell)

From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something)

I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question.
https://github.com/MariaDB/server/blob/f3dac591747dfbd88bd8ae2855f9a0e64006ce75/storage/innobase/buf/buf0buf.cc#L6019

Vladislav Vaintroub added a comment - 2020-02-11 22:26 So, I ran this benchmark, which I think resembles the axel "sweetspot" close enough my.cnf [mysqld] #####non innodb options max_connections = 300 table_open_cache = 600 query_cache_type = 0 #####innodb options innodb_buffer_pool_size = 1G innodb_log_buffer_size = 32M innodb_log_file_size = 512M innodb_flush_log_at_trx_commit = 2 innodb_doublewrite = 0 loose-innodb_adaptive_hash_index_partitions = 32 loose-innodb_adaptive_hash_index_parts = 32 #####SSD innodb-flush-method = O_DIRECT innodb_io_capacity = 4000 loose-innodb_flush_neighbors = 0 innodb_write_io_threads = 8 #####the variables for this test innodb_buffer_pool_instances = 1 innodb_max_dirty_pages_pct = 99 skip-innodb_adaptive_hash_index skip-innodb-stats-persistent innodb-change-buffering=none innodb_file_per_table = 0 script to run with sysbench 1.0 sysbench --test=/usr/share/sysbench/oltp_update_index.lua --tables=32 --table-size=1250000 --rand-seed=42 --rand-type=uniform --num-threads=32 --report-interval=2 --mysql-socket=/tmp/mysql.sock --time=300 --max-requests=0 --mysql-user=root --percentile=95 $1 where $1 is either "prepare" or "run" (you need to have a database called sbtest) Note on benchmark itself - it uses very low bufferpool to data size ratio (I believe data would be around 8-10GB if it was in files, rather than ibdata1), and only 1GB buffer pool, so it is designed to be IO intensive. It uses whole 2 CPUs out of 56 on the benchmark machine (and the difference between 1 and 4 buffer pools was not obvious in "top") For the benchmarks, I ran the server with either innodb_buffer_pool_instances set to 1 or 4 4 buffer pools wins against 1 buffer pool with about 9000 tps against around 6000 tps, at least we can say that whatever Axel had found is reproducible for this use case. I attached pt-pmp output 1bp.txt (single buffer pool instance) and 4bp.txt (4 instances), made with 20 samples separated by 10 seconds delay (If someone knows a more modern tool to profile on contention , please tell) From that, I grepped TTAS to find out lines with innodb mutexes (but please, take a look also on anything else, maybe I missed something) I think the contention might be on the buffer pool mutex in buf_page_io_complete(buf0buf.cc:6019) , at least it appears rather oft in 1bp.txt in a couple of different callstacks . here is the code in question. https://github.com/MariaDB/server/blob/f3dac591747dfbd88bd8ae2855f9a0e64006ce75/storage/innobase/buf/buf0buf.cc#L6019

Marko Mäkelä added a comment - 2020-02-12 10:16

I ported the changes to 10.5. Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in ~~MDEV-15053~~).

Marko Mäkelä added a comment - 2020-02-12 10:16 I ported the changes to 10.5 . Given that our benchmark was extreme and that normally the doublewrite buffer would be a scalability bottleneck for write-intensive workloads, I think that it should be an acceptable change. In most cases, a single buffer pool performed at least as well as multiple ones. Removing the code to handle multiple buffer pool instances could slightly improve the overall performance and open up opportunities to make more use of std::atomic (in MDEV-15053 ).

Marko Mäkelä made changes - 2020-02-12 10:16

Assignee

Vladislav Vaintroub [ wlad ]

Marko Mäkelä [ marko ]

Marko Mäkelä made changes - 2020-02-12 10:16

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2020-02-12 12:59 - edited

I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the ~~MDEV-12353~~ branch only. Pushed to 10.5.

Marko Mäkelä added a comment - 2020-02-12 12:59 - edited I did some more testing, checked the buildbot results (no failures), and fixed a bug that was caught in the MDEV-12353 branch only. Pushed to 10.5.

Marko Mäkelä made changes - 2020-02-12 12:59

issue.field.resolutiondate

2020-02-12 12:59:27.0

2020-02-12 12:59:27.063

Marko Mäkelä made changes - 2020-02-12 12:59

Fix Version/s		10.5.1 [ 24029 ]
Fix Version/s	10.5 [ 23123 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Marko Mäkelä added a comment - 2020-02-12 18:59

Based on the feedback of wlad, I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables. We will return a dummy buffer pool identifier 0, for compatibility.

Marko Mäkelä added a comment - 2020-02-12 18:59 Based on the feedback of wlad , I reverted changes to some INFORMATION_SCHEMA.INNODB_ tables . We will return a dummy buffer pool identifier 0, for compatibility.

Marko Mäkelä made changes - 2020-03-13 10:20

Link

This issue is blocked by ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2020-03-13 10:22

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2020-03-17 12:55

Link

This issue blocks ~~MDEV-21962~~ [ ~~MDEV-21962~~ ]

Marko Mäkelä made changes - 2020-03-24 08:18

Link

This issue is blocked by ~~MDEV-15053~~ [ ~~MDEV-15053~~ ]

Marko Mäkelä added a comment - 2020-03-24 08:27

The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by ~~MDEV-15053~~. That function would still acquire buf_pool_t::mutex (which was renamed to buf_pool_t::LRU_list_mutex).

~~MDEV-15053~~ did not show improved performance when wlad was testing it. To keep the latching rules more understandable and to avoid race conditions, it might be best to omit most of those changes.

Marko Mäkelä added a comment - 2020-03-24 08:27 The slightly increased contention in buf_page_io_complete() in the write-heavy workloads when moving from 4 buffer pool instances to 1 would likely not be helped by MDEV-15053 . That function would still acquire buf_pool_t::mutex (which was renamed to buf_pool_t::LRU_list_mutex ). MDEV-15053 did not show improved performance when wlad was testing it. To keep the latching rules more understandable and to avoid race conditions, it might be best to omit most of those changes.

Marko Mäkelä made changes - 2020-03-24 08:27

Link

This issue relates to ~~MDEV-15053~~ [ ~~MDEV-15053~~ ]

Marko Mäkelä made changes - 2020-04-02 06:38

Link

This issue causes ~~MDEV-22114~~ [ ~~MDEV-22114~~ ]

Marko Mäkelä made changes - 2020-06-02 04:56

Link

This issue causes ~~MDEV-22027~~ [ ~~MDEV-22027~~ ]

Marko Mäkelä made changes - 2020-08-04 13:34

Link

This issue causes ~~MDEV-23399~~ [ ~~MDEV-23399~~ ]

Sergei Golubchik made changes - 2021-12-06 21:23

Workflow

MariaDB v3 [ 85121 ]

MariaDB v4 [ 133448 ]

Rob Schwyzer (Inactive) made changes - 2024-04-04 19:53

Remote Link

This issue links to "Page (MariaDB Confluence)" [ 36714 ]

Rob Schwyzer (Inactive) made changes - 2024-04-05 18:34

Remote Link

This issue links to "Page (MariaDB Confluence)" [ 36714 ]

Marko Mäkelä made changes - 2024-09-11 13:02

Link

This issue causes MDEV-33966 [ MDEV-33966 ]

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 2018-01-24 13:55

Updated:: 2024-09-11 13:02

Resolved:: 2020-02-12 12:59

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration