[MDEV-16168] Performance regression on sysbench write benchmarks from 10.2 to 10.3 Created: 2018-05-14 Updated: 2020-11-17 Resolved: 2020-11-17 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Vladislav Vaintroub | Assignee: | Marko Mäkelä |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2 Loading table, sysbench prepare
sysbench run
where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy) Below are the results from the runs, for 10.2.4 and recent 10.3.7
(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0) I tried to do some initial profiling (based on CPU sampling), and this is what showed up. Below., in all places, baseline refers to 10.2.14, and *comparison" is 10.3.7 Exclusive samples comparison
In short, more busy wait ut_delay, less lazy wait SleepConditionVariableCS . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
|
| Comments |
| Comment by Vladislav Vaintroub [ 2018-05-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Assigning to marko by default Innodb guru. svoj, you can jump in, if you want to repeat, or have any idea how to fix. axel, same for you. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Given that log_sys.mutex appears to be one of the contention points, I think that we should keep track of the average volume of redo log written per write transaction. More redo log writes should lead to more contention. There are two features in MariaDB 10.3 that could increase the writes: Also, some mini-transactions for writing undo log were changed in the hope of simplifying crash recovery; one of the changes was essentially reverted in I would like to compare how these fields in SHOW ENGINE INNODB STATUS proceed:
We should record both counts at the start and end of the test, calculate the difference and the ratio. For example, if these counts were 50 and 2000000 at the end of the test, the ratio would be (2000000-1642672)/(50-34) = 22333 bytes per transaction. If MariaDB 10.3 is generating significantly more redo log than 10.2, that could explain the increased log_sys.mutex contention. If purge threads are using more CPU than earlier, that could be due to svoj’s work of removing trx_sys.mutex and employing some lock-free data structures. It could be useful to test a 10.3 version right before switching to lock-free data structures. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was also this change to ut_delay() and some related discussion. On the Intel Skylake microarchitecture, the PAUSE instruction (REPZ NOP) takes more wall-clock time than on earlier ones. Maybe we are now burning more CPU on older microarchitectures? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ref ut_delay: I think I saw svoj test PTHREAD_MUTEX_ADAPTIVE_NP and it wasn't good. Maybe some rethinking the adaptive delay in mariadb to a mean/std approach per mutex rather than a straight mean. We're wasting heaps of cycles in ut_delay so a little bit of maths extra isn't going to hurt too bad. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko, I did some calculations of bytes per trx, guided by your hint, and this is what it is (surprisingly, to me maybe, the number of bytes is not integer 10.3.7 trxid2 = 215344539 bytes per trx = 270.43 10.2.14 trxid1 = 11388 bytes per trx = 263.32 multiuser benchmarks was run between the point 1 and point 2. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To reduce the redo log volume, it looks like we have to resume generating the redo log record types MLOG_UNDO_INIT and MLOG_UNDO_HDR_REUSE. The record MLOG_UNDO_INSERT was already restored in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
MLOG_UNDO_HDR_REUSE was removed by | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
regarding "relax CPU" number of loops for "rep nop" being 200, https://github.com/MariaDB/server/commit/b3346c2f41d0f7bd0d919826592b37ae09fa41c8#diff-e877f572e9fd77c43d5e7af075f92e74R91 , this is something from me, I measured this as optimal back in 2008, on (probably) Core2 Duo. No idea if this still holds. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The MLOG_UNDO_INIT record should occupy 3..12 bytes while the lower-level records (MLOG_2BYTES, MLOG_4BYTES) could easily consume 7 more bytes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have now restored the MLOG_UNDO_INIT record, which I had removed in the hope that recovery would be simpler and faster. (But I failed to notice that a small increase on the number of redo log bytes written can hurt performance.) I developed a simple test for measuring the volume of redo log written by an update. Note that transaction IDs are incremented both at start and at commit. Due to
In the above test, you can disable purge to observe the effect to the number of redo log bytes written. The test should probably run a longer time for the purge to kick in. The effect of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm this in a memory-based benchmark:
and
The transactions/second number is:
Something has drastically changed between 10.3.4 and 10.3.5. Commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c is the biggest culprit. I still think that the idea of I chose commit af566d8a63f908c76d43957e790e861d944486b0 because it is in the middle of some code changes by svoj. There is some visible regression between that and mariadb-10.3.5, and not much improvement since then. The patch against the most recent 10.3 essentially disabled the resetting of the DB_TRX_ID (
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Surprisingly, removing the following supposed optimization
fixes the performance regression of 5db9c6e490ea2c6b7997dff4cbee18274496dd2c. With this optimization removed, we are always acquiring and releasing the TRX_SYS page latch. Replication and Galera are disabled, after all. This patch is not applicable to later versions, because | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I applied a similar unnecessary acquisition of the TRX_SYS page lock to the latest 10.3, but it did reduce performance as expected:
I tried to remove a write that was added by
This write is performed while holding rseg->mutex, but there are 128 rollback segments, so I do not think it should matter with 40 concurrent users, if each transaction is assigned a different rollback segment. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My 10.3 (b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 ) results don't go inline with what marko reported:
However 10.2 results are perfectly inline:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
maybe you need to run just a little bit longer, like marko and I did | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
OTOH, marko ran for 2 minutes as well. I reported on 300 seconds | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I used the following build configuration for both 10.2 and 10.3:
svoj used the following:
I retested commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with the latter config:
I applied the patch to remove the optimization from trx_write_serialisation_history() (which counter-intuitively improved performance from about 100ktps to 178ktps using my config):
So, removing the optimization does help in both cases, but much more with my build configuration. This change is not relevant for solving the regression, because the code has been seriously rewritten since then, but it shows that performance is hard. Perhaps the explanation could be that the TRX_SYS page lock serves as some kind of useful throttle that reduces contention on the rseg->mutex? It would seem best to acquire and release rseg->mutex only once during commit, and to try to ensure that the transactions running in different threads do not get assigned the same rseg (of course, this is only possible for up to 128 concurrent threads). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried modifying svoj’s config by removing -DWITH_JEMALLOC=system. Performance remained roughly the same:
So, jemalloc does help multi-client performance a little compared to GNU libc 2.27, but not that much, and the relative performance difference remains. A major difference between our build configurations could be that cmake/build_configurations/mysql_release.cmake uses -O3 while -DCMAKE_BUILD_TYPE=RelWithDebInfo only uses -O2. After replacing -O2 with -O3 in my configuration, I got the following:
This suggests that there could be a clear regression from 10.2 to 10.3 when compiling with -O2. I hope that most distributed packages use -O3. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
distributed packages are compiled with -DBUILD_CONFIGURATION=mysql_release Still, I suggest to look more than just threads=1, and thread=40. 40 threads might be a sweet spot on svoj's machine. I tried 1,..,4096, that's more comprehensive. I suggest to vary durability settings as well (innodb_flush_log_at_trx_commit = 1). innodb_flush_method=O_DIRECT_NO_FSYNC is also a trick that pretends your box has battery backed disk. And it does not probably. dataset size, AHI vs no AHI, threadpool vs no threadpool., etc. Seeing a full picture after running 2 tests. 2 minutes each , is not easy. I'd really start full comparison for different levels on concurrency, and then vary other parameters, if that does not show regression | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I had a systematic error, caused by not rebuilding from the scratch. It looks like -DBUILD_CONFIG=mysql_release does use -O2 after all. My observations about 5db9c6e490ea2c6b7997dff4cbee18274496dd2c and its predecessor are repeatable. And I still see a significant performance regression with 10.3.5 or the latest 10.3 Here are some rough results from 3-minute runs:
There is some fluctuation in these numbers, so 10% variance could be ignored. Surprisingly, 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c ( Another interesting pair is bc7a1dc1fbd27e6064d3b40443fe242397668af7 and b782971c58b5656820429b8ef3fae5fd82f5a0f7. The latter commit fixed a correctness bug that was introduced by the former. There are quite a few changes in between, and there appears to be some regression already in the parent of the fix (ba125eca551e3f5faee9f6d7ebed290f04cfc8fd). Next I will try to find the regression between bc7a1dc1fbd27e6064d3b40443fe242397668af7 and ba125eca551e3f5faee9f6d7ebed290f04cfc8fd. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Commit 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 is a merge from bb-10.2-ext, most notably adding st_mem_root::name. That is clearly costing some performance compared to its parent d6ed077fc82c2f4d20895db9a6b16ca295f23c33. I tried reverting the patch:
It seems that we must define st_mem_root::name in debug builds only. Still, the additional regression between 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 and eda142590f385b3d79e7b6a6e0cc13ba8fd25961 has to be narrowed down. It is outside InnoDB; the InnoDB differences in that range are trivial integer type mismatch warning fixes and fixes of source code comments. The only other commits in this range are CTE fixes ( | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
It is surprising that the performance improved in the middle, even though the st_mem_root::name is still present. The CTE fixes seem innocent to the performance. It is also surprising that the I would suggest introducing a compile-time option for enabling st_mem_root::name (disabled by default), and revisiting | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-05-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here's what I get on 6c279ad6a71c63cb595fde7c951aadb31c3dbebc:
Sometimes I experience similar turbulence in benchmarks, which I couldn't explain so far. It is usually gone by itself and I can't reliably analyse it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-05-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I get a feeling like original report is completely ignored, and everyone is running ramdisk benchmarks now, testing CPU caches, and compiler optimizations, on 40 threads, on 40 core machines, for 2 minutes | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-05-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
You're right, we need to concentrate on original report. Which I couldn't reproduce locally, most probably due to some known issues with Samsung SSD being slow on heavy fsync. I will try to look into ramdisk performance more closely, with longer run and more concurrent connections (like 1000). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I agree that we must address the I/O issue, but at the same time it could be worth looking for CPU regressions, that is, approach repeatable regressions from multiple angles. For what it is worth, I repeated some 180-second ramdisk benchmarks today:
The baseline was 123927.89tps before the weekend, and now it is 130213.08tps, so there indeed is some turbulence in the figures even after I switched to always do a full rebuild. Patch 1 reverts some struct or class member variables from size_t to a narrower type, such as uint or uint32. A minimal version of the patch (1m) is below:
In my test resuls above, this minimal patch appeared to cause a slight performance regression, but I think that we can explain it by saying that it falls within the 5% or 10% of fluctuation. Patch 2 removes st_mem_root::name and the name parameter of init_alloc_root():
I would suggest to include this patch in 10.3 in some form. With GCC 7.3.0 and the 3-minute CPU-bound ramdisk benchmark, we still appear to have a significant regression somewhere between 10.3.4 and 10.3.5. I will keep trying to trace it down while Svoj can concentrate on the originally reported regression. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-05-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can still repeat the huge drop caused by commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: dropping from 207119.69tps to 130029.74tps. I narrowed it down to the sql_class.h and sql_class.cc changes in that commit, but when I reverted that part of the patch in the latest 10.3, I did not see any improvement. In my benchmark, I would see contention in lock_sys.mutex and log_sys.mutex. The situation should be improved by MDEV-16232 and Also, at low concurrency, it seems that purge is wasting resources and creating unnecessary contention. Perhaps the amount of purge threads should be automatically scaled based on the amount of purgeable records? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-08-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A bad merge of Given that the performance regression reported in this ticket was reported on Windows, where the systemd interface is not present, this probably does not explain much of the regression. Also, I probably did not have libsystemd-dev installed when compiling code for running benchmarks. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-11-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
By now, we are all educated that |