[MDEV-19738] Doublewrite buffer is unnecessarily used for newly (re)initialized pages - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.5.0
Component/s: Storage Engine - InnoDB
Labels:
- performance
- recovery

Description

Thanks to ~~MDEV-12699~~, the doublewrite buffer will only be needed in those cases when a page is being updated in the data file. If the page had never been written to the data file since it was initialized, then recovery will be able to reconstruct the page based solely on the contents of the redo log files.

The doublewrite buffer is only really needed when recovery needs to read the page in order to apply redo log.

Avoiding unnecessary use of the doublewrite buffer should improve the write performance of InnoDB.

Attachments

Issue Links

is blocked by

MDEV-12699 Improve crash recovery of corrupted data pages

Closed

relates to

MDEV-19739 Avoid unnecessary page reads on recovery with innodb_log_optimize_ddl=ON

Closed

MDEV-19747 Deprecate and ignore innodb_log_optimize_ddl

Closed

MDEV-31816 buf_LRU_free_page() does not preserve ROW_FORMAT=COMPRESSED block state

Closed

MDEV-11125 Introduce a reduced doublewrite mode, handling error detection only

Closed

MDEV-22369 Assertion `err == DB_SUCCESS' failed. in btr_block_get_func during crash recovery

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-06-11 15:52

It seems that this revealed a problem in ~~MDEV-12699~~ (we are unnecessarily reading a data page that is being completely initialized by redo log records). I will analyze a dataset of innodb.innodb-64k-crash failure deeper tomorrow. I repeated it with this patch on top of both 10.4 and 10.5.

Marko Mäkelä added a comment - 2019-06-11 15:52 It seems that this revealed a problem in MDEV-12699 (we are unnecessarily reading a data page that is being completely initialized by redo log records). I will analyze a dataset of innodb.innodb-64k-crash failure deeper tomorrow. I repeated it with this patch on top of both 10.4 and 10.5.

Marko Mäkelä added a comment - 2019-06-12 07:41

It currently looks like the ~~MDEV-12699~~ logic is innocent after all, and my code change for this task is faulty. We must not disable doublewrite buffering when redo logging is or was disabled. In the failure that I analyzed, space->enable_lsn was set due to MLOG_INDEX_LOAD records. I have not yet determined if this optimization must be dependent on innodb_log_optimize_ddl=OFF (it is ON by default).

Marko Mäkelä added a comment - 2019-06-12 07:41 It currently looks like the MDEV-12699 logic is innocent after all, and my code change for this task is faulty. We must not disable doublewrite buffering when redo logging is or was disabled. In the failure that I analyzed, space->enable_lsn was set due to MLOG_INDEX_LOAD records. I have not yet determined if this optimization must be dependent on innodb_log_optimize_ddl=OFF (it is ON by default).

Marko Mäkelä added a comment - 2019-06-13 09:03

To assess the performance impact of this, I developed a simple benchmark that attempts to initialize and flush many pages.

--source include/have_innodb.inc

--source include/have_sequence.inc

CREATE TABLE t1 (

 a BIGINT PRIMARY KEY,

 b CHAR(255) NOT NULL DEFAULT '',

 c CHAR(255) NOT NULL DEFAULT '',

 d CHAR(255) NOT NULL DEFAULT '',

 e CHAR(255) NOT NULL DEFAULT '',

 f CHAR(255) NOT NULL DEFAULT '',

 g CHAR(255) NOT NULL DEFAULT '',

 h CHAR(255) NOT NULL DEFAULT ''

) ENGINE=InnoDB;

INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000;

FLUSH TABLES t1 FOR EXPORT;

UNLOCK TABLES;

DROP TABLE t1;

Notes:

Without the FLUSH TABLES t1 FOR EXPORT, the test would complete very fast, because no pages would have been flushed before the DROP TABLES executes.
The first-time run of the test is slower, because undo log pages will be allocated for the huge transaction. On subsequent runs, the undo pages will be reused.
Due to ~~MDEV-12288~~, the pages backed by t1.ibd will be modified twice: first by the initial INSERT, and then by purge, which will reset the DB_TRX_ID.

I made mysql-test/var a symlink to a directory on a Linux ext4 file system residing on an Intel Optane 960 NVMe SSD and ran the test:

for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite}

do ./mtr --repeat=10 $opt --mysqld=--innodb-page-size=4k innodb.writebench

done

Here are the results:

10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration	first round	minimum	maximum	average
baseline (doublewrite buffer enabled)	36175ms	26611ms	28623ms	27673ms
`--skip-log-optimize-ddl` (optimization enabled)	35855ms	26415ms	28976ms	27508ms
`--skip-innodb-doublewrite`	33331ms	24392ms	25865ms	25166ms

Because the difference is so small, I repeated the test. On the subsequent test, the average times of the 9 subsequent runs were 27975ms, 26353ms, 25786ms. So, the patch seems to give some benefit. I got similar numbers on /dev/shm (only a little faster for the skip-innodb-doublewrite case).

Based on some additional instrumentation, it turned out that this test case is causing pages to be flushed twice, both before and after the purge:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index aa44a1c5e4c..3534f846211 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -1043,6 +1043,8 @@ buf_flush_write_block_low(

 	const bool use_doublewrite = !bpage->init_on_flush

 		&& space->use_doublewrite();

+	ib::info() << "flush " << use_doublewrite << bpage->id;

 	if (!use_doublewrite) {

 		ulint	type = IORequest::WRITE | IORequest::DO_NOT_WAKE;

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index e6bddd32438..5a5682236ed 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -15462,6 +15462,7 @@ ha_innobase::external_lock(

 				DBUG_RETURN(HA_ERR_NO_SUCH_TABLE);

+			ib::info() << "flush " << m_prebuilt->table->name;

 			row_quiesce_table_start(m_prebuilt->table, trx);

 			/* Use the transaction instance to track UNLOCK

I revised the test case accordingly (explicitly wait for purge before initiating the page flush) and reverted the patch for running tests.

--source include/have_innodb.inc

--source include/have_sequence.inc

SET @saved_frequency = @@GLOBAL.innodb_purge_rseg_truncate_frequency;

SET GLOBAL innodb_purge_rseg_truncate_frequency=1;

CREATE TABLE t1 (

 a BIGINT PRIMARY KEY,

 b CHAR(255) NOT NULL DEFAULT '',

 c CHAR(255) NOT NULL DEFAULT '',

 d CHAR(255) NOT NULL DEFAULT '',

 e CHAR(255) NOT NULL DEFAULT '',

 f CHAR(255) NOT NULL DEFAULT '',

 g CHAR(255) NOT NULL DEFAULT '',

 h CHAR(255) NOT NULL DEFAULT ''

) ENGINE=InnoDB;

INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000;

--source include/wait_all_purged.inc

FLUSH TABLES t1 FOR EXPORT;

UNLOCK TABLES;

DROP TABLE t1;

SET GLOBAL innodb_purge_rseg_truncate_frequency=@saved_frequency;

I also ensured that the buffer pool is large enough:

for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite}

do ./mtr --repeat=5 $opt --mysqld=--innodb-buffer-pool-size=32m \

--mysqld=--skip-innodb-buffer-pool-{load-at-startup,dump-at-shutdown} \

--mysqld=--innodb-page-size=4k innodb.writebench

done

With this test, we are seeing a similar difference:

10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration	first round	minimum	maximum	average
baseline (doublewrite buffer enabled)	68613ms	38739ms	39035ms	38871ms
`--skip-log-optimize-ddl` (optimization enabled)	67784ms	35784ms	38682ms	37828ms
`--skip-innodb-doublewrite`	64150ms	33085ms	34142ms	33524ms

Apparently, the test time is dominated by something else than file writes. Finally, I gathered some counters to confirm that the time savings are due to I/O. I changed the interesting part of the test to

INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000;

--source include/wait_all_purged.inc

show status like 'innodb_d%_writes';

FLUSH TABLES t1 FOR EXPORT;

show status like 'innodb_d%_writes';

(yes, also reduced the number of rows to make it run faster), and got the following numbers for running each configuration with --repeat=2:

configuration	innodb_data_writes+innodb_dblwr_writes
baseline (doublewrite buffer enabled)	7+0,1024+508,1032+508,2049+1016
`--skip-log-optimize-ddl` (optimization enabled)	7+0,516+0,524+0,1033+0
`--skip-innodb-doublewrite`	7+0,516+0,524+0,1033+0

The numbers are cumulative from the start of the server instance. So, in this artificial example, the doublewrite buffer is being skipped altogether thanks to our optimization. (Apparently we do not get to flush pages in the system tablespace, because the test runs for such a short duration, too fast for a log checkpoint to be executed.)

The reason why disabling the doublewrite buffer altogether is improving performance further could be that we can combine writes in that case. This can be improved further in ~~MDEV-16526~~.

Marko Mäkelä added a comment - 2019-06-13 09:03 To assess the performance impact of this, I developed a simple benchmark that attempts to initialize and flush many pages. --source include/have_innodb.inc --source include/have_sequence.inc CREATE TABLE t1 ( a BIGINT PRIMARY KEY , b CHAR (255) NOT NULL DEFAULT '' , c CHAR (255) NOT NULL DEFAULT '' , d CHAR (255) NOT NULL DEFAULT '' , e CHAR (255) NOT NULL DEFAULT '' , f CHAR (255) NOT NULL DEFAULT '' , g CHAR (255) NOT NULL DEFAULT '' , h CHAR (255) NOT NULL DEFAULT '' ) ENGINE=InnoDB; INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000; FLUSH TABLES t1 FOR EXPORT; UNLOCK TABLES; DROP TABLE t1; Notes: Without the FLUSH TABLES t1 FOR EXPORT , the test would complete very fast, because no pages would have been flushed before the DROP TABLES executes. The first-time run of the test is slower, because undo log pages will be allocated for the huge transaction. On subsequent runs, the undo pages will be reused. Due to MDEV-12288 , the pages backed by t1.ibd will be modified twice: first by the initial INSERT , and then by purge, which will reset the DB_TRX_ID . I made mysql-test/var a symlink to a directory on a Linux ext4 file system residing on an Intel Optane 960 NVMe SSD and ran the test: for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite} do . /mtr --repeat=10 $opt --mysqld=--innodb-page-size=4k innodb.writebench done Here are the results: 10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average baseline (doublewrite buffer enabled) 36175ms 26611ms 28623ms 27673ms --skip-log-optimize-ddl (optimization enabled) 35855ms 26415ms 28976ms 27508ms --skip-innodb-doublewrite 33331ms 24392ms 25865ms 25166ms Because the difference is so small, I repeated the test. On the subsequent test, the average times of the 9 subsequent runs were 27975ms, 26353ms, 25786ms. So, the patch seems to give some benefit. I got similar numbers on /dev/shm (only a little faster for the skip-innodb-doublewrite case). Based on some additional instrumentation, it turned out that this test case is causing pages to be flushed twice, both before and after the purge: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index aa44a1c5e4c..3534f846211 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1043,6 +1043,8 @@ buf_flush_write_block_low( const bool use_doublewrite = !bpage->init_on_flush && space->use_doublewrite(); + ib::info() << "flush " << use_doublewrite << bpage->id; + if (!use_doublewrite) { ulint type = IORequest::WRITE | IORequest::DO_NOT_WAKE; diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index e6bddd32438..5a5682236ed 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -15462,6 +15462,7 @@ ha_innobase::external_lock( DBUG_RETURN(HA_ERR_NO_SUCH_TABLE); } + ib::info() << "flush " << m_prebuilt->table->name; row_quiesce_table_start(m_prebuilt->table, trx); /* Use the transaction instance to track UNLOCK I revised the test case accordingly (explicitly wait for purge before initiating the page flush) and reverted the patch for running tests. --source include/have_innodb.inc --source include/have_sequence.inc SET @saved_frequency = @@ GLOBAL .innodb_purge_rseg_truncate_frequency; SET GLOBAL innodb_purge_rseg_truncate_frequency=1; CREATE TABLE t1 ( a BIGINT PRIMARY KEY , b CHAR (255) NOT NULL DEFAULT '' , c CHAR (255) NOT NULL DEFAULT '' , d CHAR (255) NOT NULL DEFAULT '' , e CHAR (255) NOT NULL DEFAULT '' , f CHAR (255) NOT NULL DEFAULT '' , g CHAR (255) NOT NULL DEFAULT '' , h CHAR (255) NOT NULL DEFAULT '' ) ENGINE=InnoDB; INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000; --source include/wait_all_purged.inc FLUSH TABLES t1 FOR EXPORT; UNLOCK TABLES; DROP TABLE t1; SET GLOBAL innodb_purge_rseg_truncate_frequency=@saved_frequency; I also ensured that the buffer pool is large enough: for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite} do ./mtr --repeat=5 $opt --mysqld=--innodb-buffer-pool-size=32m \ --mysqld=--skip-innodb-buffer-pool-{load-at-startup,dump-at-shutdown} \ --mysqld=--innodb-page-size=4k innodb.writebench done With this test, we are seeing a similar difference: 10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average baseline (doublewrite buffer enabled) 68613ms 38739ms 39035ms 38871ms --skip-log-optimize-ddl (optimization enabled) 67784ms 35784ms 38682ms 37828ms --skip-innodb-doublewrite 64150ms 33085ms 34142ms 33524ms Apparently, the test time is dominated by something else than file writes. Finally, I gathered some counters to confirm that the time savings are due to I/O. I changed the interesting part of the test to INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000; --source include/wait_all_purged.inc show status like 'innodb_d%_writes' ; FLUSH TABLES t1 FOR EXPORT; show status like 'innodb_d%_writes' ; (yes, also reduced the number of rows to make it run faster), and got the following numbers for running each configuration with --repeat=2 : configuration innodb_data_writes+innodb_dblwr_writes baseline (doublewrite buffer enabled) 7+0,1024+508,1032+508,2049+1016 --skip-log-optimize-ddl (optimization enabled) 7+0,516+0,524+0,1033+0 --skip-innodb-doublewrite 7+0,516+0,524+0,1033+0 The numbers are cumulative from the start of the server instance. So, in this artificial example, the doublewrite buffer is being skipped altogether thanks to our optimization. (Apparently we do not get to flush pages in the system tablespace, because the test runs for such a short duration, too fast for a log checkpoint to be executed.) The reason why disabling the doublewrite buffer altogether is improving performance further could be that we can combine writes in that case. This can be improved further in MDEV-16526 .

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2019-04-18 14:20

Updated:: 2024-06-05 09:53

Resolved:: 2019-06-12 17:33

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server