Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-12353

Efficient InnoDB redo log record format

Details

    Description

      The InnoDB crash recovery performance can be improved a little while not changing the file format. MDEV-12699 removed unnecessary reads of pages that can be initialized via the redo log. MDEV-19586 will make recovery read the pages in more sequential order.

      We should fix some fundamental issues that exist with the current InnoDB redo log record format:

      • Records do not contain their length. When buffering records, we must painstakingly parse entire records in order to determine the length. This idea was mentioned in MySQL Bug#82937.
      • For B-tree operations, we are writing a lot of redundant data for mlog_parse_index(). We should use a lower-level format and lower-level apply functions. MySQL Bug#82176 merely speeds up the code around mlog_parse_index().
      • If a mini-transaction is writing multiple records to a page, the page identifier is being repeated for every record. We should omit the page identifier if multiple consecutive records are modifying the same page.

      In this task, we will only improve the redo log record format. Format changes to the redo log blocks and files will be covered by MDEV-14425.

      Attachments

        Issue Links

          Activity

            The following test demonstrates that we are doing fine with in-place UPDATE:

            --source include/have_innodb.inc
            CREATE TABLE t2 (a BIGINT) ENGINE=InnoDB;
            delimiter |;
            CREATE PROCEDURE p(low BIGINT, high BIGINT)
            begin
              SHOW STATUS LIKE 'innodb_lsn_current';
              INSERT INTO t2 VALUES(low);
              WHILE low<high DO
                UPDATE t2 SET a=low;
                SET low = low + 1;
              END WHILE;
              SHOW STATUS LIKE 'innodb_lsn_current';
            end|
            delimiter ;|
            CALL p(9223372036854644735,9223372036854775807);
            DROP PROCEDURE p;
            DROP TABLE t2;
            

            ./mtr --mysqld=--innodb-force-recovery=2 main.ibu
            

            revision LSN1 LSN2-LSN1 time
            before 61,792 27,098,532 3.92s
            after 52,361 22,173,911 3.41s

            Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log):

            ./mtr --mysqld=--skip-autocommit --mysqld=--innodb-force-recovery=2 main.ibu
            

            revision LSN1 LSN2-LSN1 time
            before 61,792 11,512,365 1.49s
            after 52,361 8,299,770 1.38s

            In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed.

            marko Marko Mäkelä added a comment - The following test demonstrates that we are doing fine with in-place UPDATE : --source include/have_innodb.inc CREATE TABLE t2 (a BIGINT ) ENGINE=InnoDB; delimiter |; CREATE PROCEDURE p(low BIGINT , high BIGINT ) begin SHOW STATUS LIKE 'innodb_lsn_current' ; INSERT INTO t2 VALUES (low); WHILE low<high DO UPDATE t2 SET a=low; SET low = low + 1; END WHILE; SHOW STATUS LIKE 'innodb_lsn_current' ; end | delimiter ;| CALL p(9223372036854644735,9223372036854775807); DROP PROCEDURE p; DROP TABLE t2; ./mtr --mysqld=--innodb-force-recovery=2 main.ibu revision LSN1 LSN2-LSN1 time before 61,792 27,098,532 3.92s after 52,361 22,173,911 3.41s Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log): ./mtr --mysqld=--skip-autocommit --mysqld=--innodb-force-recovery=2 main.ibu revision LSN1 LSN2-LSN1 time before 61,792 11,512,365 1.49s after 52,361 8,299,770 1.38s In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed.

            as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for MDEV-21534 (basically, update_index sysbench). it even beats the log_write_up_to() in most scenarios in baseline,

            wlad Vladislav Vaintroub added a comment - as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for MDEV-21534 (basically, update_index sysbench). it even beats the log_write_up_to() in most scenarios in baseline,

            Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page.

            Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725, we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression.

            marko Marko Mäkelä added a comment - Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page. Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725 , we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression.

            I now believe that MDEV-21724 has an even larger impact on performance. The amount of redo log written for an INSERT of an index record has almost doubled for records with small numbers of fields. To address that, we must introduce custom log records for those operations.

            marko Marko Mäkelä added a comment - I now believe that MDEV-21724 has an even larger impact on performance. The amount of redo log written for an INSERT of an index record has almost doubled for records with small numbers of fields. To address that, we must introduce custom log records for those operations.
            marko Marko Mäkelä added a comment - - edited

            I collected statistics on the amount of redo log written for the following SQL:

            --source include/have_innodb.inc
            CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t1 VALUES (1),(2);
            DROP TABLE t1;
            

            I used the following commands in GDB to identify the mini-transactions:

            break ha_innobase::write_row
            run
            break mtr_t::finish_write()
            command 2
            up 2
            up
            end
            continue
            continue
            …
            

            Here are the mini-transactions (excluding purge and log checkpoints):

            bytes(old) bytes(new) bytes(revised) source operation
            48 59 48 trx_undo_report_row_operation() INSERT (1) undo log
            36 47 27 row_ins_clust_index_entry_low() INSERT (1) b-tree
            14 22 14 trx_undo_report_row_operation() INSERT (2) undo log
            30 52 24 row_ins_clust_index_entry_low() INSERT (2) b-tree
            88 49 49 trx_commit_low() COMMIT (INSERT)
            107 108 97 trx_undo_report_row_operation() DROP TABLE
            20 17 17 row_upd_del_mark_clust_rec()
            8 8 8 row_upd_clust_step()
            63 72 64 trx_undo_report_row_operation()
            20 17 17 row_upd_del_mark_clust_rec()
            55 64 56 trx_undo_report_row_operation()
            20 17 17 row_upd_del_mark_clust_rec()
            52 61 53 trx_undo_report_row_operation()
            20 17 17 row_upd_del_mark_clust_rec()
            17 10 10 row_upd_sec_index_entry()
            36 46 37 trx_undo_report_row_operation()
            21 18 18 row_upd_del_mark_clust_rec()
            36 45 37 trx_undo_report_row_operation()
            21 18 18 row_upd_del_mark_clust_rec()
            19 18 18 fil_delete_tablespace()
            88 49 49 trx_commit_low() COMMIT (DROP TABLE)

            The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT. Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT.

            The high-level log records for inserting index tree records will be introduced in MDEV-21724. The deletion of records is not covered by this micro-benchmark, but it was improved earlier .

            For updates, the new encoding is outperforming the old encoding, as expected.

            marko Marko Mäkelä added a comment - - edited I collected statistics on the amount of redo log written for the following SQL: --source include/have_innodb.inc CREATE TABLE t1 (a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t1 VALUES (1),(2); DROP TABLE t1; I used the following commands in GDB to identify the mini-transactions: break ha_innobase::write_row run break mtr_t::finish_write() command 2 up 2 up end continue continue … Here are the mini-transactions (excluding purge and log checkpoints): bytes(old) bytes(new) bytes(revised) source operation 48 59 48 trx_undo_report_row_operation() INSERT (1) undo log 36 47 27 row_ins_clust_index_entry_low() INSERT (1) b-tree 14 22 14 trx_undo_report_row_operation() INSERT (2) undo log 30 52 24 row_ins_clust_index_entry_low() INSERT (2) b-tree 88 49 49 trx_commit_low() COMMIT (INSERT) 107 108 97 trx_undo_report_row_operation() DROP TABLE 20 17 17 row_upd_del_mark_clust_rec() 8 8 8 row_upd_clust_step() 63 72 64 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 55 64 56 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 52 61 53 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 17 10 10 row_upd_sec_index_entry() 36 46 37 trx_undo_report_row_operation() 21 18 18 row_upd_del_mark_clust_rec() 36 45 37 trx_undo_report_row_operation() 21 18 18 row_upd_del_mark_clust_rec() 19 18 18 fil_delete_tablespace() 88 49 49 trx_commit_low() COMMIT (DROP TABLE) The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT . Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT . The high-level log records for inserting index tree records will be introduced in MDEV-21724 . The deletion of records is not covered by this micro-benchmark, but it was improved earlier . For updates, the new encoding is outperforming the old encoding, as expected.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              4 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.