[MDEV-19738] Doublewrite buffer is unnecessarily used for newly (re)initialized pages Created: 2019-04-18  Updated: 2023-11-27  Resolved: 2019-06-12

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.5.0

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance, recovery

Issue Links:
Blocks
is blocked by MDEV-12699 Improve crash recovery of corrupted d... Closed
PartOf
Relates
relates to MDEV-19739 Avoid unnecessary page reads on recov... Closed
relates to MDEV-19747 Deprecate and ignore innodb_log_optim... Closed
relates to MDEV-31816 buf_LRU_free_page() does not preserve... Closed
relates to MDEV-11125 Introduce a reduced doublewrite mode,... Closed
relates to MDEV-22369 Assertion `err == DB_SUCCESS' failed.... Closed

 Description   

Thanks to MDEV-12699, the doublewrite buffer will only be needed in those cases when a page is being updated in the data file. If the page had never been written to the data file since it was initialized, then recovery will be able to reconstruct the page based solely on the contents of the redo log files.

The doublewrite buffer is only really needed when recovery needs to read the page in order to apply redo log.

Avoiding unnecessary use of the doublewrite buffer should improve the write performance of InnoDB.



 Comments   
Comment by Marko Mäkelä [ 2019-06-11 ]

It seems that this revealed a problem in MDEV-12699 (we are unnecessarily reading a data page that is being completely initialized by redo log records). I will analyze a dataset of innodb.innodb-64k-crash failure deeper tomorrow. I repeated it with this patch on top of both 10.4 and 10.5.

Comment by Marko Mäkelä [ 2019-06-12 ]

It currently looks like the MDEV-12699 logic is innocent after all, and my code change for this task is faulty. We must not disable doublewrite buffering when redo logging is or was disabled. In the failure that I analyzed, space->enable_lsn was set due to MLOG_INDEX_LOAD records. I have not yet determined if this optimization must be dependent on innodb_log_optimize_ddl=OFF (it is ON by default).

Comment by Marko Mäkelä [ 2019-06-13 ]

To assess the performance impact of this, I developed a simple benchmark that attempts to initialize and flush many pages.

--source include/have_innodb.inc
--source include/have_sequence.inc
 
CREATE TABLE t1 (
 a BIGINT PRIMARY KEY,
 b CHAR(255) NOT NULL DEFAULT '',
 c CHAR(255) NOT NULL DEFAULT '',
 d CHAR(255) NOT NULL DEFAULT '',
 e CHAR(255) NOT NULL DEFAULT '',
 f CHAR(255) NOT NULL DEFAULT '',
 g CHAR(255) NOT NULL DEFAULT '',
 h CHAR(255) NOT NULL DEFAULT ''
) ENGINE=InnoDB;
INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000;
FLUSH TABLES t1 FOR EXPORT;
UNLOCK TABLES;
DROP TABLE t1;

Notes:

  • Without the FLUSH TABLES t1 FOR EXPORT, the test would complete very fast, because no pages would have been flushed before the DROP TABLES executes.
  • The first-time run of the test is slower, because undo log pages will be allocated for the huge transaction. On subsequent runs, the undo pages will be reused.
  • Due to MDEV-12288, the pages backed by t1.ibd will be modified twice: first by the initial INSERT, and then by purge, which will reset the DB_TRX_ID.

I made mysql-test/var a symlink to a directory on a Linux ext4 file system residing on an Intel Optane 960 NVMe SSD and ran the test:

for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite}
do ./mtr --repeat=10 $opt --mysqld=--innodb-page-size=4k innodb.writebench
done

Here are the results:

10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average
baseline (doublewrite buffer enabled) 36175ms 26611ms 28623ms 27673ms
--skip-log-optimize-ddl (optimization enabled) 35855ms 26415ms 28976ms 27508ms
--skip-innodb-doublewrite 33331ms 24392ms 25865ms 25166ms

Because the difference is so small, I repeated the test. On the subsequent test, the average times of the 9 subsequent runs were 27975ms, 26353ms, 25786ms. So, the patch seems to give some benefit. I got similar numbers on /dev/shm (only a little faster for the skip-innodb-doublewrite case).

Based on some additional instrumentation, it turned out that this test case is causing pages to be flushed twice, both before and after the purge:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index aa44a1c5e4c..3534f846211 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1043,6 +1043,8 @@ buf_flush_write_block_low(
 	const bool use_doublewrite = !bpage->init_on_flush
 		&& space->use_doublewrite();
 
+	ib::info() << "flush " << use_doublewrite << bpage->id;
+
 	if (!use_doublewrite) {
 		ulint	type = IORequest::WRITE | IORequest::DO_NOT_WAKE;
 
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
index e6bddd32438..5a5682236ed 100644
--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -15462,6 +15462,7 @@ ha_innobase::external_lock(
 				DBUG_RETURN(HA_ERR_NO_SUCH_TABLE);
 			}
 
+			ib::info() << "flush " << m_prebuilt->table->name;
 			row_quiesce_table_start(m_prebuilt->table, trx);
 
 			/* Use the transaction instance to track UNLOCK

I revised the test case accordingly (explicitly wait for purge before initiating the page flush) and reverted the patch for running tests.

--source include/have_innodb.inc
--source include/have_sequence.inc
 
SET @saved_frequency = @@GLOBAL.innodb_purge_rseg_truncate_frequency;
SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
 
CREATE TABLE t1 (
 a BIGINT PRIMARY KEY,
 b CHAR(255) NOT NULL DEFAULT '',
 c CHAR(255) NOT NULL DEFAULT '',
 d CHAR(255) NOT NULL DEFAULT '',
 e CHAR(255) NOT NULL DEFAULT '',
 f CHAR(255) NOT NULL DEFAULT '',
 g CHAR(255) NOT NULL DEFAULT '',
 h CHAR(255) NOT NULL DEFAULT ''
) ENGINE=InnoDB;
INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000;
--source include/wait_all_purged.inc
 
FLUSH TABLES t1 FOR EXPORT;
UNLOCK TABLES;
DROP TABLE t1;
SET GLOBAL innodb_purge_rseg_truncate_frequency=@saved_frequency;

I also ensured that the buffer pool is large enough:

for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite}
do ./mtr --repeat=5 $opt --mysqld=--innodb-buffer-pool-size=32m \
--mysqld=--skip-innodb-buffer-pool-{load-at-startup,dump-at-shutdown} \
--mysqld=--innodb-page-size=4k innodb.writebench
done

With this test, we are seeing a similar difference:

10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average
baseline (doublewrite buffer enabled) 68613ms 38739ms 39035ms 38871ms
--skip-log-optimize-ddl (optimization enabled) 67784ms 35784ms 38682ms 37828ms
--skip-innodb-doublewrite 64150ms 33085ms 34142ms 33524ms

Apparently, the test time is dominated by something else than file writes. Finally, I gathered some counters to confirm that the time savings are due to I/O. I changed the interesting part of the test to

INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000;
--source include/wait_all_purged.inc
 
show status like 'innodb_d%_writes';
FLUSH TABLES t1 FOR EXPORT;
show status like 'innodb_d%_writes';

(yes, also reduced the number of rows to make it run faster), and got the following numbers for running each configuration with --repeat=2:

configuration innodb_data_writes+innodb_dblwr_writes
baseline (doublewrite buffer enabled) 7+0,1024+508,1032+508,2049+1016
--skip-log-optimize-ddl (optimization enabled) 7+0,516+0,524+0,1033+0
--skip-innodb-doublewrite 7+0,516+0,524+0,1033+0

The numbers are cumulative from the start of the server instance. So, in this artificial example, the doublewrite buffer is being skipped altogether thanks to our optimization. (Apparently we do not get to flush pages in the system tablespace, because the test runs for such a short duration, too fast for a log checkpoint to be executed.)

The reason why disabling the doublewrite buffer altogether is improving performance further could be that we can combine writes in that case. This can be improved further in MDEV-16526.

Generated at Thu Feb 08 08:53:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.