[MDEV-25451] TPC-C in-memory performance degradation (dblwr + s.t. more) Created: 2021-04-19  Updated: 2021-09-16  Resolved: 2021-07-26

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.5.12, 10.6.3

Type: Task Priority: Major
Reporter: Alexander Krizhanovsky (Inactive) Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: performance

Attachments: File 10.5.9.metrics     File 10.6.metrics     PNG File checkpoint.png     File mariadb_10.5.9_116g_600.dstat     File mariadb_10.6_116g_600.dstat     PNG File mariadb_6_vs_59.png     PNG File mariadb_6_vs_59_boxplot.png     PNG File mariadb_6_vs_59_io.png     PNG File mariadb_6_vs_59_io_boxplot.png     PNG File mariadb_all.png     PNG File mariadb_all.png     PDF File tpcc2.pdf     PNG File tpcc_10.6_1h_multi.png     File tpcc_mariadb-10.5.9_116bp_600.stat     File tpcc_mariadb-10.5.9_116bp_600_nodblwr.stat     File tpcc_mariadb-10.6_110bp_3h.stat     File tpcc_mariadb-10.6_116bp_600.stat     File tpcc_mariadb-10.6_116bp_600_nodblwr.stat    
Issue Links:
Relates
relates to MDEV-25404 read-only performance regression in 10.6 Closed
Epic Link: adaptive flushing

 Description   

I'm evaluating MariaDB 10.5.9 performance against 10.5.5 as was described in https://www.percona.com/blog/2020/08/14/evaluating-performance-improvements-in-mariadb-10-5-5/ . Also I tried current 10.6 (as of 2ad61c678243dec2a808b1f67b5760a725163052) and noticed significant performance degradation against 10.5.9 on in-memory workload (see mariadb_all.png).

Data size produced with sysbench-tpcc is 104G and buffer pool is 116G, so this is pure in-memory workload (all buffer_LRU_batch* metrics are zero). The testbed machine has slow NVMe drive (26 KIOPS for random writes and 320 KIOPS for random reads), 128GB RAM and 40 hyperthreads.

While the original tests are 3 hours, 10.6 shows much slower performance right at the begin of the test, so I reproduced the problem on 10 minutes workloads. As tpcc_mariadb-10.6_116bp_600.stat and tpcc_mariadb-10.5.9_116bp_600.stat sysbench statistic reports there are 3845 TPS for 10.6 and 4965 TPS for 10.5.9.

10.6 and 10.5.9 have quite different I/O and CPU profiles shown in the .dstat files. Since this is in-memory workload, I suspected either doublewrite buffer or checkpointing. I reran the tests with doublewrite buffer switched off (_nodblwr.stat files): 10.6 is still slower than 10.5.9, but 10.6 is much more affected by switching off doublewrite buffer, than 10.5.9. So there is some problem with double write, but there is definitely something more.

I collected some InnoDB LRU, flushing, checkpointing, and log metrics with 10 second intervals (*.metrics files), but I didn't find any significant differences except log_lsn_checkpoint_age (see checkpoint.png). There were more or less the same number of checkpoints, page flushes, and log writes.

10.6 was built with io_uring, but I checked with gbd, that there were no aio_uring::submit_io() calls.



 Comments   
Comment by Marko Mäkelä [ 2021-04-19 ]

Can you please check the newest 10.6 where the fix of MDEV-25404 is present?

Comment by Alexander Krizhanovsky (Inactive) [ 2021-04-19 ]

The recent commit 8751aa7397b2e698fa0b46ec3e60abb9e2fd7e1b significantly improves 10.6 performance, but there is still small performance degradation observed on short 10 minutes and long 3 hours TPC-C tests. I'm attaching plots for 3 hours test.

Comment by Alexander Krizhanovsky (Inactive) [ 2021-04-20 ]

I also collected results for the IO bound TPC-C workloads. The test is exactly the same, except buffer pool is configured to 25GB and there are only 40 threads. The average TPS for 10.5.9 is 1337, while for 10.6 with the fix is 1284. The plots are attached (I'll try to tune out the waves later).

Comment by Alexander Krizhanovsky (Inactive) [ 2021-04-20 ]

Comment by Marko Mäkelä [ 2021-06-28 ]

I would suspect that the initial dip of performance could be related to MDEV-12288. MariaDB writes a single persistent undo log for transactions, and also INSERT would cause work for the purge of history, because the DB_TRX_ID would be reset to 0 for records that are visible to every read view. This also prevents bugs like https://bugs.mysql.com/bug.php?id=55283 by design.

It should be possible to avoid that overhead by making use of the (admittedly awkward) interface that was introduced in MDEV-24818, to disable row-level undo logging when inserting into empty tables. I made an attempt at that by applying the following patch to https://github.com/Percona-Lab/sysbench-tpcc but it did not seem to have the desired effect. During the ‘prepare’ phase, SHOW ENGINE INNODB STATUS would show transactions having several undo log records. Had the patch worked, there would have been only one undo log record per table.

diff --git a/tpcc_common.lua b/tpcc_common.lua
index 0db7ab0..d8712c1 100644
--- a/tpcc_common.lua
+++ b/tpcc_common.lua
@@ -330,6 +330,11 @@ function create_tables(drv, con, table_num)
 
    con:query(query)
 
+   if drv:name() == "mysql"
+   then
+      con:query("SET unique_checks=0")
+   end
+
    con:bulk_insert_init("INSERT INTO item" .. i .." (i_id, i_im_id, i_name, i_price, i_data) values")
    for j = 1 , MAXITEMS do
       local i_im_id = sysbench.rand.uniform(1,10000)
@@ -345,6 +350,11 @@ function create_tables(drv, con, table_num)
    end
    con:bulk_insert_done()
 
+   if drv:name() == "mysql"
+   then
+      con:query("SET unique_checks=1")
+   end
+
     print(string.format("Adding indexes %d ... \n", i))
     con:query("CREATE INDEX idx_customer"..i.." ON customer"..i.." (c_w_id,c_d_id,c_last,c_first)")
     con:query("CREATE INDEX idx_orders"..i.." ON orders"..i.." (o_w_id,o_d_id,o_c_id,o_id)")
@@ -422,6 +432,11 @@ function load_tables(drv, con, warehouse_num)
 
    print(string.format("loading tables: %d for warehouse: %d\n", table_num, warehouse_num))
 
+   if drv:name() == "mysql"
+   then
+      con:query("SET unique_checks=0")
+   end
+
     con:bulk_insert_init("INSERT INTO warehouse" .. table_num .. 
 	" (w_id, w_name, w_street_1, w_street_2, w_city, w_state, w_zip, w_tax, w_ytd) values")
 
@@ -592,6 +607,7 @@ function load_tables(drv, con, warehouse_num)
 
    if drv:name() == "mysql"
    then
+	con:query("SET unique_checks=1")
 --   	con:query("SET @trx = (SELECT @@global.innodb_flush_log_at_trx_commit=0)")
 --   	con:query("SET GLOBAL innodb_flush_log_at_trx_commit=@trx")
    end

Comment by Alexander Krizhanovsky (Inactive) [ 2021-06-28 ]

I tried the 10.6 as of commit 891a927e804c5a3a582f6137c2f316ef7abb25ca and unfortunately there is still the deep performance dip at the begin of the TPC-C workload. I'm also attaching the sysbench output for the graph. The good point is that after the dip, MariaDB shows pretty stable performance, there is no more minor waves (the graph uses 60 seconds average and sysbenech output still contains very short performance dips, but previously we saw the dips even in the averaged graphs).

tpcc_mariadb-10.6_110bp_3h.stat

Comment by Marko Mäkelä [ 2021-06-29 ]

I think that the easiest way to confirm my hypothesis about MDEV-12288 being related to the initial performance loss would be to shut down and restart the server between preparing the data and starting the benchmark. With innodb_fast_shutdown=0, shutdown will purge all history. If that makes the initial dip go away, the solution would be to tune the way the data is loaded: at most one transaction per table, so that each transaction will be inserting into an empty table. In a future version where MDEV-24621 has been implemented, data loads that are conducted in that way would become even faster.

Comment by Axel Schwenke [ 2021-07-26 ]

TPCC runs in tpcc2.pdf were done with a server shutdown between load and run stage of the benchmark. innodb_fast_shutdown=0 was set in my.cnf. While there is certainly some warmup phase to be seen, no performance dip happens.

Comment by Marko Mäkelä [ 2021-07-26 ]

Thank you, axel.

I believe that the reported regression has been fixed. To avoid the initial degradation of performance, the data load phase could be improved to take advantage of MDEV-515 in MariaDB 10.6: Use at most one transaction per table, so that each transaction will be inserting into an empty table. For now, MDEV-24818 can be used if the data load needs to be split into multiple statements; MDEV-25036 was not implemented in 10.6.

There still appears to be a scalability regression when the server is running on multiple NUMA nodes. Both 10.6.3 and the upcoming 10.5.12 show similar performance. On a 2-socket Intel server, when the number of concurrect connections is between 1 and 2 times the number of threads per CPU socket, MariaDB 10.5.6 (before MDEV-23399 and MDEV-23855) was observed to perform better. For any other number of concurrent connections, 10.5.6 performed slower or equal in a TPC-C like workload. I hope that this can be addressed by experimenting with NUMA friendly mutexes or rw-locks.

Comment by Alexander Krizhanovsky (Inactive) [ 2021-07-26 ]

Hi axel,

I noticed that you used 40GB buffer pool while I observed the performance for relatively large buffer pool sizes, 100GB and more. Also how slow the SSD is? The NVMe SSD on the original servers were able to handle 64K random writes by 16KB issued by 4 threads (I verified this with fio tool), so I wouldn't say that the drive was slow. However, Dimitry had issues with reproducing performance waves http://dimitrik.free.fr/blog/posts/mysql-80-innodb-checkpointing.html with fast enough drives.

I guess the deepness of the dip is caused by the relation of the buffer pool and the disk speed. When the workload begins InnoDB doesn't flush dirty pages at all (almost zero IO), but when we hit the maximum checkpoint age it goes into furious flushing and we're going into the dip. I described the behavior in https://jira.mariadb.org/browse/MDEV-25113

I'm wondering if InnoDB starts flushing right on the start of the workload? Is it possible to reproduce the original Percona tests with large buffer pool?

Comment by Marko Mäkelä [ 2021-07-26 ]

The primary purpose of axel’s benchmark was to test something else, on a different system. So, the results are not entirely comparable, but I was thinking that it should be close enough to say that we have an explanation for the initial dip of performance.

MDEV-12288 introduced special purging of undo log records for INSERT, which involves resetting the DB_TRX_ID in the clustered index records. These modifications will generate redo log and also cause ‘page write pressure’. We would basically have some ‘pseudo’ DML running in parallel with the normal DML workload until the backlog has been cleared.

Generated at Thu Feb 08 09:37:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.