Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19738

Doublewrite buffer is unnecessarily used for newly (re)initialized pages

Details

    Description

      Thanks to MDEV-12699, the doublewrite buffer will only be needed in those cases when a page is being updated in the data file. If the page had never been written to the data file since it was initialized, then recovery will be able to reconstruct the page based solely on the contents of the redo log files.

      The doublewrite buffer is only really needed when recovery needs to read the page in order to apply redo log.

      Avoiding unnecessary use of the doublewrite buffer should improve the write performance of InnoDB.

      Attachments

        Issue Links

          Activity

            It seems that this revealed a problem in MDEV-12699 (we are unnecessarily reading a data page that is being completely initialized by redo log records). I will analyze a dataset of innodb.innodb-64k-crash failure deeper tomorrow. I repeated it with this patch on top of both 10.4 and 10.5.

            marko Marko Mäkelä added a comment - It seems that this revealed a problem in MDEV-12699 (we are unnecessarily reading a data page that is being completely initialized by redo log records). I will analyze a dataset of innodb.innodb-64k-crash failure deeper tomorrow. I repeated it with this patch on top of both 10.4 and 10.5.

            It currently looks like the MDEV-12699 logic is innocent after all, and my code change for this task is faulty. We must not disable doublewrite buffering when redo logging is or was disabled. In the failure that I analyzed, space->enable_lsn was set due to MLOG_INDEX_LOAD records. I have not yet determined if this optimization must be dependent on innodb_log_optimize_ddl=OFF (it is ON by default).

            marko Marko Mäkelä added a comment - It currently looks like the MDEV-12699 logic is innocent after all, and my code change for this task is faulty. We must not disable doublewrite buffering when redo logging is or was disabled. In the failure that I analyzed, space->enable_lsn was set due to MLOG_INDEX_LOAD records. I have not yet determined if this optimization must be dependent on innodb_log_optimize_ddl=OFF (it is ON by default).

            To assess the performance impact of this, I developed a simple benchmark that attempts to initialize and flush many pages.

            --source include/have_innodb.inc
            --source include/have_sequence.inc
             
            CREATE TABLE t1 (
             a BIGINT PRIMARY KEY,
             b CHAR(255) NOT NULL DEFAULT '',
             c CHAR(255) NOT NULL DEFAULT '',
             d CHAR(255) NOT NULL DEFAULT '',
             e CHAR(255) NOT NULL DEFAULT '',
             f CHAR(255) NOT NULL DEFAULT '',
             g CHAR(255) NOT NULL DEFAULT '',
             h CHAR(255) NOT NULL DEFAULT ''
            ) ENGINE=InnoDB;
            INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000;
            FLUSH TABLES t1 FOR EXPORT;
            UNLOCK TABLES;
            DROP TABLE t1;
            

            Notes:

            • Without the FLUSH TABLES t1 FOR EXPORT, the test would complete very fast, because no pages would have been flushed before the DROP TABLES executes.
            • The first-time run of the test is slower, because undo log pages will be allocated for the huge transaction. On subsequent runs, the undo pages will be reused.
            • Due to MDEV-12288, the pages backed by t1.ibd will be modified twice: first by the initial INSERT, and then by purge, which will reset the DB_TRX_ID.

            I made mysql-test/var a symlink to a directory on a Linux ext4 file system residing on an Intel Optane 960 NVMe SSD and ran the test:

            for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite}
            do ./mtr --repeat=10 $opt --mysqld=--innodb-page-size=4k innodb.writebench
            done
            

            Here are the results:

            10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average
            baseline (doublewrite buffer enabled) 36175ms 26611ms 28623ms 27673ms
            --skip-log-optimize-ddl (optimization enabled) 35855ms 26415ms 28976ms 27508ms
            --skip-innodb-doublewrite 33331ms 24392ms 25865ms 25166ms

            Because the difference is so small, I repeated the test. On the subsequent test, the average times of the 9 subsequent runs were 27975ms, 26353ms, 25786ms. So, the patch seems to give some benefit. I got similar numbers on /dev/shm (only a little faster for the skip-innodb-doublewrite case).

            Based on some additional instrumentation, it turned out that this test case is causing pages to be flushed twice, both before and after the purge:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index aa44a1c5e4c..3534f846211 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -1043,6 +1043,8 @@ buf_flush_write_block_low(
             	const bool use_doublewrite = !bpage->init_on_flush
             		&& space->use_doublewrite();
             
            +	ib::info() << "flush " << use_doublewrite << bpage->id;
            +
             	if (!use_doublewrite) {
             		ulint	type = IORequest::WRITE | IORequest::DO_NOT_WAKE;
             
            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index e6bddd32438..5a5682236ed 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -15462,6 +15462,7 @@ ha_innobase::external_lock(
             				DBUG_RETURN(HA_ERR_NO_SUCH_TABLE);
             			}
             
            +			ib::info() << "flush " << m_prebuilt->table->name;
             			row_quiesce_table_start(m_prebuilt->table, trx);
             
             			/* Use the transaction instance to track UNLOCK
            

            I revised the test case accordingly (explicitly wait for purge before initiating the page flush) and reverted the patch for running tests.

            --source include/have_innodb.inc
            --source include/have_sequence.inc
             
            SET @saved_frequency = @@GLOBAL.innodb_purge_rseg_truncate_frequency;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
             
            CREATE TABLE t1 (
             a BIGINT PRIMARY KEY,
             b CHAR(255) NOT NULL DEFAULT '',
             c CHAR(255) NOT NULL DEFAULT '',
             d CHAR(255) NOT NULL DEFAULT '',
             e CHAR(255) NOT NULL DEFAULT '',
             f CHAR(255) NOT NULL DEFAULT '',
             g CHAR(255) NOT NULL DEFAULT '',
             h CHAR(255) NOT NULL DEFAULT ''
            ) ENGINE=InnoDB;
            INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000;
            --source include/wait_all_purged.inc
             
            FLUSH TABLES t1 FOR EXPORT;
            UNLOCK TABLES;
            DROP TABLE t1;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=@saved_frequency;
            

            I also ensured that the buffer pool is large enough:

            for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite}
            do ./mtr --repeat=5 $opt --mysqld=--innodb-buffer-pool-size=32m \
            --mysqld=--skip-innodb-buffer-pool-{load-at-startup,dump-at-shutdown} \
            --mysqld=--innodb-page-size=4k innodb.writebench
            done
            

            With this test, we are seeing a similar difference:

            10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average
            baseline (doublewrite buffer enabled) 68613ms 38739ms 39035ms 38871ms
            --skip-log-optimize-ddl (optimization enabled) 67784ms 35784ms 38682ms 37828ms
            --skip-innodb-doublewrite 64150ms 33085ms 34142ms 33524ms

            Apparently, the test time is dominated by something else than file writes. Finally, I gathered some counters to confirm that the time savings are due to I/O. I changed the interesting part of the test to

            INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000;
            --source include/wait_all_purged.inc
             
            show status like 'innodb_d%_writes';
            FLUSH TABLES t1 FOR EXPORT;
            show status like 'innodb_d%_writes';
            

            (yes, also reduced the number of rows to make it run faster), and got the following numbers for running each configuration with --repeat=2:

            configuration innodb_data_writes+innodb_dblwr_writes
            baseline (doublewrite buffer enabled) 7+0,1024+508,1032+508,2049+1016
            --skip-log-optimize-ddl (optimization enabled) 7+0,516+0,524+0,1033+0
            --skip-innodb-doublewrite 7+0,516+0,524+0,1033+0

            The numbers are cumulative from the start of the server instance. So, in this artificial example, the doublewrite buffer is being skipped altogether thanks to our optimization. (Apparently we do not get to flush pages in the system tablespace, because the test runs for such a short duration, too fast for a log checkpoint to be executed.)

            The reason why disabling the doublewrite buffer altogether is improving performance further could be that we can combine writes in that case. This can be improved further in MDEV-16526.

            marko Marko Mäkelä added a comment - To assess the performance impact of this, I developed a simple benchmark that attempts to initialize and flush many pages. --source include/have_innodb.inc --source include/have_sequence.inc   CREATE TABLE t1 ( a BIGINT PRIMARY KEY , b CHAR (255) NOT NULL DEFAULT '' , c CHAR (255) NOT NULL DEFAULT '' , d CHAR (255) NOT NULL DEFAULT '' , e CHAR (255) NOT NULL DEFAULT '' , f CHAR (255) NOT NULL DEFAULT '' , g CHAR (255) NOT NULL DEFAULT '' , h CHAR (255) NOT NULL DEFAULT '' ) ENGINE=InnoDB; INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000; FLUSH TABLES t1 FOR EXPORT; UNLOCK TABLES; DROP TABLE t1; Notes: Without the FLUSH TABLES t1 FOR EXPORT , the test would complete very fast, because no pages would have been flushed before the DROP TABLES executes. The first-time run of the test is slower, because undo log pages will be allocated for the huge transaction. On subsequent runs, the undo pages will be reused. Due to MDEV-12288 , the pages backed by t1.ibd will be modified twice: first by the initial INSERT , and then by purge, which will reset the DB_TRX_ID . I made mysql-test/var a symlink to a directory on a Linux ext4 file system residing on an Intel Optane 960 NVMe SSD and ran the test: for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite} do . /mtr --repeat=10 $opt --mysqld=--innodb-page-size=4k innodb.writebench done Here are the results: 10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average baseline (doublewrite buffer enabled) 36175ms 26611ms 28623ms 27673ms --skip-log-optimize-ddl (optimization enabled) 35855ms 26415ms 28976ms 27508ms --skip-innodb-doublewrite 33331ms 24392ms 25865ms 25166ms Because the difference is so small, I repeated the test. On the subsequent test, the average times of the 9 subsequent runs were 27975ms, 26353ms, 25786ms. So, the patch seems to give some benefit. I got similar numbers on /dev/shm (only a little faster for the skip-innodb-doublewrite case). Based on some additional instrumentation, it turned out that this test case is causing pages to be flushed twice, both before and after the purge: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index aa44a1c5e4c..3534f846211 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1043,6 +1043,8 @@ buf_flush_write_block_low( const bool use_doublewrite = !bpage->init_on_flush && space->use_doublewrite(); + ib::info() << "flush " << use_doublewrite << bpage->id; + if (!use_doublewrite) { ulint type = IORequest::WRITE | IORequest::DO_NOT_WAKE; diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index e6bddd32438..5a5682236ed 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -15462,6 +15462,7 @@ ha_innobase::external_lock( DBUG_RETURN(HA_ERR_NO_SUCH_TABLE); } + ib::info() << "flush " << m_prebuilt->table->name; row_quiesce_table_start(m_prebuilt->table, trx); /* Use the transaction instance to track UNLOCK I revised the test case accordingly (explicitly wait for purge before initiating the page flush) and reverted the patch for running tests. --source include/have_innodb.inc --source include/have_sequence.inc   SET @saved_frequency = @@ GLOBAL .innodb_purge_rseg_truncate_frequency; SET GLOBAL innodb_purge_rseg_truncate_frequency=1;   CREATE TABLE t1 ( a BIGINT PRIMARY KEY , b CHAR (255) NOT NULL DEFAULT '' , c CHAR (255) NOT NULL DEFAULT '' , d CHAR (255) NOT NULL DEFAULT '' , e CHAR (255) NOT NULL DEFAULT '' , f CHAR (255) NOT NULL DEFAULT '' , g CHAR (255) NOT NULL DEFAULT '' , h CHAR (255) NOT NULL DEFAULT '' ) ENGINE=InnoDB; INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000000; --source include/wait_all_purged.inc   FLUSH TABLES t1 FOR EXPORT; UNLOCK TABLES; DROP TABLE t1; SET GLOBAL innodb_purge_rseg_truncate_frequency=@saved_frequency; I also ensured that the buffer pool is large enough: for opt in '' --mysqld=--skip-innodb-{log-optimize-ddl,doublewrite} do ./mtr --repeat=5 $opt --mysqld=--innodb-buffer-pool-size=32m \ --mysqld=--skip-innodb-buffer-pool-{load-at-startup,dump-at-shutdown} \ --mysqld=--innodb-page-size=4k innodb.writebench done With this test, we are seeing a similar difference: 10.5 d46db415cef910c65b0e550a9d00fe188cfd8910 configuration first round minimum maximum average baseline (doublewrite buffer enabled) 68613ms 38739ms 39035ms 38871ms --skip-log-optimize-ddl (optimization enabled) 67784ms 35784ms 38682ms 37828ms --skip-innodb-doublewrite 64150ms 33085ms 34142ms 33524ms Apparently, the test time is dominated by something else than file writes. Finally, I gathered some counters to confirm that the time savings are due to I/O. I changed the interesting part of the test to INSERT INTO t1 (a) SELECT seq FROM seq_1_to_1000; --source include/wait_all_purged.inc   show status like 'innodb_d%_writes' ; FLUSH TABLES t1 FOR EXPORT; show status like 'innodb_d%_writes' ; (yes, also reduced the number of rows to make it run faster), and got the following numbers for running each configuration with --repeat=2 : configuration innodb_data_writes+innodb_dblwr_writes baseline (doublewrite buffer enabled) 7+0,1024+508,1032+508,2049+1016 --skip-log-optimize-ddl (optimization enabled) 7+0,516+0,524+0,1033+0 --skip-innodb-doublewrite 7+0,516+0,524+0,1033+0 The numbers are cumulative from the start of the server instance. So, in this artificial example, the doublewrite buffer is being skipped altogether thanks to our optimization. (Apparently we do not get to flush pages in the system tablespace, because the test runs for such a short duration, too fast for a log checkpoint to be executed.) The reason why disabling the doublewrite buffer altogether is improving performance further could be that we can combine writes in that case. This can be improved further in MDEV-16526 .

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.