Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-12353

Efficient InnoDB redo log record format

Details

    Description

      The InnoDB crash recovery performance can be improved a little while not changing the file format. MDEV-12699 removed unnecessary reads of pages that can be initialized via the redo log. MDEV-19586 will make recovery read the pages in more sequential order.

      We should fix some fundamental issues that exist with the current InnoDB redo log record format:

      • Records do not contain their length. When buffering records, we must painstakingly parse entire records in order to determine the length. This idea was mentioned in MySQL Bug#82937.
      • For B-tree operations, we are writing a lot of redundant data for mlog_parse_index(). We should use a lower-level format and lower-level apply functions. MySQL Bug#82176 merely speeds up the code around mlog_parse_index().
      • If a mini-transaction is writing multiple records to a page, the page identifier is being repeated for every record. We should omit the page identifier if multiple consecutive records are modifying the same page.

      In this task, we will only improve the redo log record format. Format changes to the redo log blocks and files will be covered by MDEV-14425.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            julien.fritsch Julien Fritsch made changes -
            Comment [ A comment with security level 'Developers' was removed. ]

            Related to this, in MariaDB 10.3, I removed the following redo log record types:

            Simplifying the redo logging of basic B-tree operations would bring much more benefit.

            marko Marko Mäkelä added a comment - Related to this, in MariaDB 10.3, I removed the following redo log record types: Remove MLOG_UNDO_ERASE_END Replace MLOG_UNDO_INSERT with MLOG_WRITE_STRING, MLOG_2BYTES Replace MLOG_UNDO_INIT with MLOG_2BYTES, MLOG_4BYTES Simplifying the redo logging of basic B-tree operations would bring much more benefit.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            I had to put back the MLOG_UNDO_INSERT and MLOG_UNDO_INIT redo log record types, because the lower-level records are occupying more space in the redo log, and thus measurably slowing down the server. This is a problem with the redo log record format: if a mini-transaction contains multiple log records for the same page, each record will repeat the tablespace ID and page number.

            marko Marko Mäkelä added a comment - - edited I had to put back the MLOG_UNDO_INSERT and MLOG_UNDO_INIT redo log record types, because the lower-level records are occupying more space in the redo log, and thus measurably slowing down the server. This is a problem with the redo log record format: if a mini-transaction contains multiple log records for the same page, each record will repeat the tablespace ID and page number.
            marko Marko Mäkelä made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.3 [ 22126 ]
            marko Marko Mäkelä made changes -
            Affects Version/s 10.3 [ 22126 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.4 [ 22408 ]
            NRE Projects RM_105_CANDIDATE
            Affects Version/s 10.4 [ 22408 ]
            elenst Elena Stepanova made changes -
            Fix Version/s 10.5 [ 23123 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Affects Version/s 10.2 [ 14601 ]
            Affects Version/s 10.0 [ 16000 ]
            Affects Version/s 10.1 [ 16100 ]
            Affects Version/s 10.3 [ 22126 ]
            Affects Version/s 10.4 [ 22408 ]
            Issue Type Bug [ 1 ] Task [ 3 ]
            marko Marko Mäkelä made changes -
            Description The InnoDB crash recovery performance could be improved. Some ideas (file format changes cannot be done in GA versions):
            # Read the to-be-recovered pages sorted by page number. Currently, recv_apply_hashed_log_recs() picks a ‘random page number’ from recv_sys->addr_hash and then reads a number of pages starting from that number.
            # Use a simpler redo log record format, and reduce the number of operations that require mlog_write_index(). For example, use MLOG_1BYTE instead of MLOG_REC_SEC_DELETE_MARK. This should also reduce the redo log volume.
            # Look at some contributed patches, such as [MySQL Bug#82937|https://bugs.mysql.com/bug.php?id=82937] (format change), [MySQL Bug#82176|https://bugs.mysql.com/bug.php?id=82176].
            The InnoDB crash recovery performance can be improved a little while not changing the file format. MDEV-12699 removed unnecessary reads of pages that can be initialized via the redo log. MDEV-19586 will make recovery read the pages in more sequential order.

            We should fix some fundamental issues that exist with the current InnoDB redo log record format:
            * Records do not contain their length. When buffering records, we must painstakingly parse entire records in order to determine the length. This idea was mentioned in [MySQL Bug#82937|https://bugs.mysql.com/bug.php?id=82937].
            * For B-tree operations, we are writing a lot of redundant data for {{mlog_parse_index()}}. We should use a lower-level format and lower-level apply functions. [MySQL Bug#82176|https://bugs.mysql.com/bug.php?id=82176] merely speeds up the code around {{mlog_parse_index()}}.
            * If a mini-transaction is writing multiple records to a page, the page identifier is being repeated for every record. We should omit the page identifier if multiple consecutive records are modifying the same page.

            In this task, we will only improve the redo log record format. Format changes to the redo log blocks and files will be covered by MDEV-14425.
            Summary Speed up InnoDB crash recovery Efficient InnoDB redo log record format
            marko Marko Mäkelä made changes -
            Comment [ While fixing MDEV-17680, it occurred to me that in {{recv_apply_hashed_log_recs()}} we could easily filter out any log records that precede {{MLOG_INIT_FILE_PAGE2}}. If such a record is present for a page, we can simply create the page and apply the log records. There is absolutely no need to read the page, and possibly unnecessarily fail if the to-be-ignored page was corrupted. This should be fixed in MDEV-12699. ]

            I think that the records of a mini-transaction could be a stream of records that is always terminated by a NUL byte. The NUL byte would also act as padding in log blocks. We would remove the MLOG_SINGLE_REC_FLAG that currently identifies single-record mini-transactions.

            The first byte of the record would contain a record type, flags, and a part of length.
            The optional second byte of the record will contain more length. (Not needed for short records.)
            Optional (unless a flag "same page" is set): encoded tablespace identifier and page number. (The flag cannot be set on the first record of a mini-transaction.)
            Optional (depending on the type code): byte offset on the page
            Finally, the payload bytes of the record.

            At the minimum, we would seem to need the following record types:

            • INIT: corresponds to MLOG_INIT_FILE_PAGE2
            • LOAD: corresponds to MLOG_INDEX_LOAD
            • WRITE: replaces MLOG_nBYTES, MLOG_WRITE_STRING
            • MEMSET: corresponds to the 10.4 MLOG_MEMSET record
            • MEMMOVE: used as a building block for logging page reorganize
            • INDEX_INIT: initialize a B-tree or R-tree page

            For ROW_FORMAT=COMPRESSED pages, we would mostly write WRITE records that would refer to the compressed page frames. Currently the MLOG_ records refer to the uncompressed page frames.

            We would seem to need 3 bits for the redo log record type. MLOG_FILE_ records and MLOG_CHECKPOINT can be represented by setting the "same page" flag at the start of the mini-transaction. After the first record within a mini-transaction that has this flag clear, there could not be any non-page redo log records.

            This would leave 4 bits for record length in the first byte. Values 1‥15 would represent lengths of 1 to 15 bytes. If the total length of the record is longer than 15 bytes, then the value 0 would be used to indicate that 1 to 3 length bytes will follow.

            The encoded tablespace identifier and page number could use variable-length encoding, instead of always using 4+4 bytes.

            For encoding the byte offset for WRITE and MEMSET operations, we will use a variable-length encoding of 1‥3 bytes, instead of always logging 2 bytes. I would not use any delta coding where the byte offsets would be relative to preceding operations in the same mini-transaction. Keeping it simple allows recovery to group together redo log records for the same page from multiple independent mini-transactions.

            As mentioned at the start of this comment, the type byte 0 would be special, marking the end of a mini-transaction. We could use the corresponding flagged value 0x80 for something special, such as a future extension when more type codes are needed, or for encoding rarely needed redo log records.

            Examples:

            • INIT could be logged as 0x12 0x34 0x56, meaning "type code 1 (INIT), 2 bytes to follow" and "tablespace ID 0x34", "page number 0x56".
            • WRITE could be logged as 0x36 0x40 0x57 0x60 0x12 0x34 0x56, meaning "type code 3 (WRITE), 6 bytes to follow" and "tablespace ID 0x40", "page number 0x57", "byte offset 0x60", data 0x34,0x56.
            • A subsequent WRITE to the same page could be logged 0xb5 0x7f 0x23 0x34 0x56 0x78, meaning "same page, type code 3 (WRITE), 5 bytes to follow", "byte offset 0x7f", bytes 0x23,0x34,0x56,0x78.
            • The end of the mini-transaction would be indicated by a NUL byte.

            kaamos, you expressed interest in this work in the 2019 New York Unconference. I would like to know your opinion about this.

            marko Marko Mäkelä added a comment - I think that the records of a mini-transaction could be a stream of records that is always terminated by a NUL byte. The NUL byte would also act as padding in log blocks. We would remove the MLOG_SINGLE_REC_FLAG that currently identifies single-record mini-transactions. The first byte of the record would contain a record type, flags, and a part of length. The optional second byte of the record will contain more length. (Not needed for short records.) Optional (unless a flag "same page" is set): encoded tablespace identifier and page number. (The flag cannot be set on the first record of a mini-transaction.) Optional (depending on the type code): byte offset on the page Finally, the payload bytes of the record. At the minimum, we would seem to need the following record types: INIT: corresponds to MLOG_INIT_FILE_PAGE2 LOAD: corresponds to MLOG_INDEX_LOAD WRITE: replaces MLOG_nBYTES , MLOG_WRITE_STRING MEMSET: corresponds to the 10.4 MLOG_MEMSET record MEMMOVE: used as a building block for logging page reorganize INDEX_INIT: initialize a B-tree or R-tree page For ROW_FORMAT=COMPRESSED pages, we would mostly write WRITE records that would refer to the compressed page frames. Currently the MLOG_ records refer to the uncompressed page frames. We would seem to need 3 bits for the redo log record type. MLOG_FILE_ records and MLOG_CHECKPOINT can be represented by setting the "same page" flag at the start of the mini-transaction. After the first record within a mini-transaction that has this flag clear, there could not be any non-page redo log records. This would leave 4 bits for record length in the first byte. Values 1‥15 would represent lengths of 1 to 15 bytes. If the total length of the record is longer than 15 bytes, then the value 0 would be used to indicate that 1 to 3 length bytes will follow. The encoded tablespace identifier and page number could use variable-length encoding, instead of always using 4+4 bytes. For encoding the byte offset for WRITE and MEMSET operations, we will use a variable-length encoding of 1‥3 bytes, instead of always logging 2 bytes. I would not use any delta coding where the byte offsets would be relative to preceding operations in the same mini-transaction. Keeping it simple allows recovery to group together redo log records for the same page from multiple independent mini-transactions. As mentioned at the start of this comment, the type byte 0 would be special, marking the end of a mini-transaction. We could use the corresponding flagged value 0x80 for something special, such as a future extension when more type codes are needed, or for encoding rarely needed redo log records. Examples: INIT could be logged as 0x12 0x34 0x56, meaning "type code 1 (INIT), 2 bytes to follow" and "tablespace ID 0x34", "page number 0x56". WRITE could be logged as 0x36 0x40 0x57 0x60 0x12 0x34 0x56, meaning "type code 3 (WRITE), 6 bytes to follow" and "tablespace ID 0x40", "page number 0x57", "byte offset 0x60", data 0x34,0x56. A subsequent WRITE to the same page could be logged 0xb5 0x7f 0x23 0x34 0x56 0x78, meaning "same page, type code 3 (WRITE), 5 bytes to follow", "byte offset 0x7f", bytes 0x23,0x34,0x56,0x78. The end of the mini-transaction would be indicated by a NUL byte. kaamos , you expressed interest in this work in the 2019 New York Unconference. I would like to know your opinion about this.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            As explained in MDEV-19747, I hope that we can remove innodb_log_optimize_ddl and MLOG_INDEX_LOAD and related code. If the redo log volume is considerably reduced due to this format change, there should be no need to disable redo logging when ALTER TABLE is rebuilding tables or creating indexes. (There could be a separate global setting for disabling redo logging altogether, to speed up database initialization when crash recovery is not expected to work.)

            For MDEV-15528, MariaDB Server 10.4 introduced the record MLOG_INIT_FREE_PAGE that would allow us to punch holes or scrub pages after they have been freed.

            If MDEV-14425 introduces a separate log file for checkpoint and file metadata, we will not need MLOG_FILE_ or MLOG_CHECKPOINT records in the actual redo log files that contain changes that are to be applied to the data files. Thus, the ‘same page’ flag could never be set on the first record of a mini-transaction, and this could be used for future extension of the code space.

            This would leave us with the following format:

            First byte: ‘same page’ flag, record type, and a part of length.
            Record type:

            • WRITE: replaces MLOG_nBYTES, MLOG_WRITE_STRING
            • MEMSET: corresponds to the 10.4 MLOG_MEMSET record
            • MEMMOVE: used as a building block for logging page reorganize
            • INIT: corresponds to MLOG_INIT_FILE_PAGE2
            • INDEX_INIT: initializing a B-tree or R-tree page
            • FREE: corresponds to MLOG_INIT_FREE_PAGE (MDEV-15528)
            • RESERVED: reserved for future use (its presence prevents crash-downgrade)
            • OPTION: optional record that may be safely ignored; examples:
              • LSN of the previous change to a page (‘same page’ flag identifies the page)
              • MDEV-18976 page checksum at the start of the mini-transaction (‘same page’ flag identifies the page)
              • binlog record

            Note: The ‘same page’ flag can never be set on the INIT record, because the INIT record should only be issued for a freshly allocated page, and a single mini-transaction would not free and then allocate the same page. The ‘same page’ flag is also unlikely (but not impossible) to be set on the FREE record.

            Next, I will try to estimate the logged size of some operations when using these lower-level records. The most interesting ones are btr_page_reorganize() (which would use MEMMOVE) and page_move_rec_list_start() (which would use WRITE and MEMSET). The mode MTR_LOG_SHORT_INSERTS would be removed.

            marko Marko Mäkelä added a comment - - edited As explained in MDEV-19747 , I hope that we can remove innodb_log_optimize_ddl and MLOG_INDEX_LOAD and related code. If the redo log volume is considerably reduced due to this format change, there should be no need to disable redo logging when ALTER TABLE is rebuilding tables or creating indexes. (There could be a separate global setting for disabling redo logging altogether, to speed up database initialization when crash recovery is not expected to work.) For MDEV-15528 , MariaDB Server 10.4 introduced the record MLOG_INIT_FREE_PAGE that would allow us to punch holes or scrub pages after they have been freed. If MDEV-14425 introduces a separate log file for checkpoint and file metadata, we will not need MLOG_FILE_ or MLOG_CHECKPOINT records in the actual redo log files that contain changes that are to be applied to the data files. Thus, the ‘same page’ flag could never be set on the first record of a mini-transaction, and this could be used for future extension of the code space. This would leave us with the following format: First byte: ‘same page’ flag, record type, and a part of length. Record type: WRITE: replaces MLOG_nBYTES , MLOG_WRITE_STRING MEMSET: corresponds to the 10.4 MLOG_MEMSET record MEMMOVE: used as a building block for logging page reorganize INIT: corresponds to MLOG_INIT_FILE_PAGE2 INDEX_INIT: initializing a B-tree or R-tree page FREE: corresponds to MLOG_INIT_FREE_PAGE ( MDEV-15528 ) RESERVED: reserved for future use (its presence prevents crash-downgrade) OPTION: optional record that may be safely ignored; examples: LSN of the previous change to a page (‘same page’ flag identifies the page) MDEV-18976 page checksum at the start of the mini-transaction (‘same page’ flag identifies the page) binlog record Note: The ‘same page’ flag can never be set on the INIT record, because the INIT record should only be issued for a freshly allocated page, and a single mini-transaction would not free and then allocate the same page. The ‘same page’ flag is also unlikely (but not impossible) to be set on the FREE record. Next, I will try to estimate the logged size of some operations when using these lower-level records. The most interesting ones are btr_page_reorganize() (which would use MEMMOVE) and page_move_rec_list_start() (which would use WRITE and MEMSET). The mode MTR_LOG_SHORT_INSERTS would be removed.

            Note: for index pages of ROW_FORMAT≠REDUNDANT, MLOG_COMP_ records will be written, with index field lengths. That will be omitted in the new format.

            Inserting records:

            Old format (MLOG_REC_INSERT and MLOG_LIST_END_COPY_CREATED), in page_cur_insert_rec_write_log():

            • MLOG_REC_INSERT(space_id,page_number,preceding_record_offset,inserted_record_offset). Omitted for MTR_LOG_SHORT_INSERTS. Instead, one MLOG_LIST_END_COPY_CREATED) record will encapsulate multiple inserts.
            • record size (including header) and a flag whether either the size or the header size differs from the preceding record
            • If the size differs: 1 byte of info_bits of the to-be-inserted record, followed by header size and the first differing byte in the header
            • data bytes of the record (and optionally header)

            Note: When the changes are applied by page_cur_parse_insert_rec(), the page header and footer will be updated by page_cur_rec_insert() without them having been mentioned in the log. Applying the record can cause a crash if the previous page contents is inconsistent with what was logged, say, if an incorrect version of the page is available.

            New format:

            • MEMMOVE (offset,len,old_offset): copy some of the preceding record header and prefix of contents
            • WRITE (offset,len,data): write the part of the record that differs from the preceding record

            The ‘same page’ flag will be set on all but the first record for the page. Thus, the (space_id,page_number) will be logged only once.

            In the new format, we must explicitly update the page header and footer as well:

            • WRITE (offset,len,data) of PAGE_LAST_INSERT and possibly PAGE_FREE, PAGE_GARBAGE, PAGE_DIRECTION_B, PAGE_N_DIRECTION
            • WRITE (offset,len,data) of PAGE_N_RECS and possibly PAGE_MAX_TRX_ID (this was logged separately earlier)
            • If a page directory slot is added: WRITE (offset, 1, data) of n_owned and MEMMOVE and WRITE (offset, 2, data) to add a page directory slot

            At the minimum, we must update 1 or 2 bytes of PAGE_LAST_INSERT and PAGE_N_RECS. The byte offset of both will fit in 1 byte, so these records will occupy 2 bytes plus the data length, that is, 2*(2+1) = 6 bytes minimum.

            For inserting multiple records, we need to update the page header and footer only once.

            Reorganizing a page: btr_page_reorganize_low()

            Old format:

            • MLOG_PAGE_REORGANIZE(space_id,page_number)

            Note: the index field lengths in MLOG_COMP_PAGE_REORGANIZE and MLOG_ZIP_PAGE_REORGANIZE can be very long.

            New format:

            • Compare the old and reorganized page (for ROW_FORMAT=COMPRESSED, the compressed frame)
            • WRITE the modified part of the page header
            • MEMSET the unused portion of the page
            • MEMMOVE and WRITE the record payload and page footer

            For reorganizing a page, we will obviously generate more log than earlier. Reorganizing pages should be a rather rare operation, so a possible size increase should be acceptable. Applying the records will be much simpler and faster.

            Creating an index page: page_empty(), page_create()

            • Optionally: INIT the page (not part of page_empty()
            • INDEX_INIT will create most of the page header and trailer and zero-fill the payload area
            • Optionally: WRITE to set FIL_PAGE_TYPE, PAGE_LEVEL, PAGE_MAX_TRX_ID, because INDEX_INIT will not touch them

            The log volume should not be larger than with the old logging.

            Writing an undo log record:

            Old format:

            • MLOG_UNDO_INSERT(space_id, page_number, length, data)

            This is 1+1‥5+1‥5+2+length bytes, that is, 5‥13+length bytes.

            New format:

            • WRITE (space_id,page_number,TRX_UNDO_PAGE_HDR+TRX_UNDO_PAGE_FREE, 2, data)
            • WRITE (offset, length + 4, data)

            Size: 1+1+1‥5+1‥5+2 bytes for the first record, 1+1‥3+1‥3+length+4 bytes for the second (assuming that length + 4 > 15), or total 13‥25+length bytes.

            The overhead is 8‥12 bytes. Hopefully it will be more than compensated when logging record insertion (omitting index field length information).

            marko Marko Mäkelä added a comment - Note: for index pages of ROW_FORMAT≠REDUNDANT , MLOG_COMP_ records will be written, with index field lengths. That will be omitted in the new format. Inserting records: Old format ( MLOG_REC_INSERT and MLOG_LIST_END_COPY_CREATED ), in page_cur_insert_rec_write_log() : MLOG_REC_INSERT (space_id,page_number,preceding_record_offset,inserted_record_offset). Omitted for MTR_LOG_SHORT_INSERTS . Instead, one MLOG_LIST_END_COPY_CREATED ) record will encapsulate multiple inserts. record size (including header) and a flag whether either the size or the header size differs from the preceding record If the size differs: 1 byte of info_bits of the to-be-inserted record, followed by header size and the first differing byte in the header data bytes of the record (and optionally header) Note: When the changes are applied by page_cur_parse_insert_rec() , the page header and footer will be updated by page_cur_rec_insert() without them having been mentioned in the log. Applying the record can cause a crash if the previous page contents is inconsistent with what was logged, say, if an incorrect version of the page is available. New format: MEMMOVE (offset,len,old_offset): copy some of the preceding record header and prefix of contents WRITE (offset,len,data): write the part of the record that differs from the preceding record The ‘same page’ flag will be set on all but the first record for the page. Thus, the (space_id,page_number) will be logged only once. In the new format, we must explicitly update the page header and footer as well: WRITE (offset,len,data) of PAGE_LAST_INSERT and possibly PAGE_FREE, PAGE_GARBAGE, PAGE_DIRECTION_B, PAGE_N_DIRECTION WRITE (offset,len,data) of PAGE_N_RECS and possibly PAGE_MAX_TRX_ID (this was logged separately earlier) If a page directory slot is added: WRITE (offset, 1, data) of n_owned and MEMMOVE and WRITE (offset, 2, data) to add a page directory slot At the minimum, we must update 1 or 2 bytes of PAGE_LAST_INSERT and PAGE_N_RECS. The byte offset of both will fit in 1 byte, so these records will occupy 2 bytes plus the data length, that is, 2*(2+1) = 6 bytes minimum. For inserting multiple records, we need to update the page header and footer only once. Reorganizing a page: btr_page_reorganize_low() Old format: MLOG_PAGE_REORGANIZE (space_id,page_number) Note: the index field lengths in MLOG_COMP_PAGE_REORGANIZE and MLOG_ZIP_PAGE_REORGANIZE can be very long. New format: Compare the old and reorganized page (for ROW_FORMAT=COMPRESSED , the compressed frame) WRITE the modified part of the page header MEMSET the unused portion of the page MEMMOVE and WRITE the record payload and page footer For reorganizing a page, we will obviously generate more log than earlier. Reorganizing pages should be a rather rare operation, so a possible size increase should be acceptable. Applying the records will be much simpler and faster. Creating an index page: page_empty() , page_create() Optionally: INIT the page (not part of page_empty() INDEX_INIT will create most of the page header and trailer and zero-fill the payload area Optionally: WRITE to set FIL_PAGE_TYPE, PAGE_LEVEL, PAGE_MAX_TRX_ID, because INDEX_INIT will not touch them The log volume should not be larger than with the old logging. Writing an undo log record: Old format: MLOG_UNDO_INSERT (space_id, page_number, length, data) This is 1+1‥5+1‥5+2+length bytes, that is, 5‥13+length bytes. New format: WRITE (space_id,page_number,TRX_UNDO_PAGE_HDR+TRX_UNDO_PAGE_FREE, 2, data) WRITE (offset, length + 4, data) Size: 1+1+1‥5+1‥5+2 bytes for the first record, 1+1‥3+1‥3+length+4 bytes for the second (assuming that length + 4 > 15), or total 13‥25+length bytes. The overhead is 8‥12 bytes. Hopefully it will be more than compensated when logging record insertion (omitting index field length information).
            Elkin Andrei Elkin made changes -
            Assignee Marko Mäkelä [ marko ] Andrei Elkin [ elkin ]
            Elkin Andrei Elkin made changes -
            Assignee Andrei Elkin [ elkin ] Marko Mäkelä [ marko ]

            The function page_copy_rec_list_end_no_locks() should be extended with an option for logging page reorganize operations. Otherwise, page reorganize would have to iterate over the record lists twice: first, to copy the records, and then, to write log for crash recovery. Reorganize can emit the smallest possible number of MEMMOVE records followed by WRITE records (to adjust the page header, next-record links and the footer) and a MEMSET to clear the unused area.

            marko Marko Mäkelä added a comment - The function page_copy_rec_list_end_no_locks() should be extended with an option for logging page reorganize operations. Otherwise, page reorganize would have to iterate over the record lists twice: first, to copy the records, and then, to write log for crash recovery. Reorganize can emit the smallest possible number of MEMMOVE records followed by WRITE records (to adjust the page header, next-record links and the footer) and a MEMSET to clear the unused area.
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            marko Marko Mäkelä added a comment - - edited

            Deleting records

            Some overhead will be introduced for record deletion. In particular, MLOG_LIST_END_DELETE and MLOG_LIST_START_DELETE will only record the reference record. (Their MLOG_COMP_ variants will write the index field lengths as well, so in that case there could be savings.)

            In the new low-level format, we must log each deleted record separately:

            • MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the succeeding record
            • MEMMOVE (offset,2,old_offset) to make the record point to the start of the PAGE_GARBAGE list
            • WRITE to update various page header and trailer fields (this can be done once for deleting multiple records)
            • optionally, MEMSET to clean the payload area of the record in the PAGE_GARBAGE list

            Currently, page_create_empty() optimizes the case where the entire page becomes empty as a result of a deletion.

            We could optimize one more special case of deletion to reduce both redo log volume and the frequency of page reorganize operations: When deleting the last inserted record(s) such that the maximum heap number will be decremented, we could free the space altogether instead of putting the record to the PAGE_GARBAGE list for potential future same-or-smaller-size reallocation. In this way, the logging would be as follows:

            • MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the successor of the being-deleted record(s)
            • MEMSET to zero out the being-deleted records
            • WRITE of page header and footer fields

            Updating records

            In the clustered index (which stores the data ordered by PRIMARY KEY), records can be ‘updated in place’ when neither the size nor the PRIMARY KEY of the record is not changing. In secondary indexes, ‘update in place’ is very rare, usually only happening when the case of a case-insensitive PRIMARY KEY is changing.

            If an ‘update in place’ is not possible, InnoDB will execute delete and insert in the same page and possibly split the page. All this is covered by operations described above.

            For ‘update in place’, we previously wrote a MLOG_REC_UPDATE_IN_PLACE record, which includes some information that can be useless:

            • flags for undo logging and locking (basically, whether the following DB_TRX_ID,DB_ROLL_PTR fields should be ignored
            • possibly ignored position of DB_TRX_ID column (1‥3 bytes)
            • possibly ignored value of DB_ROLL_PTR (7 bytes)
            • possibly ignored value of DB_TRX_ID (1‥7 bytes)
            • start offset of the record (2 bytes)
            • MLOG_COMP_REC_UPDATE_IN_PLACE for ROW_FORMAT≠REDUNDANT will encode index field lengths as well.

            In contrast to this, the new format would only write WRITE records, possibly optimized to MEMMOVE if the page already contains the to-be-written value somewhere else. MEMMOVE could save space not only for frequently occurring user column values, but also for logging
            DB_TRX_ID or part of DB_ROLL_PTR when the same transaction is updating multiple records in the same clustered index leaf page. Significant savings should be expected for logging updates.

            marko Marko Mäkelä added a comment - - edited Deleting records Some overhead will be introduced for record deletion. In particular, MLOG_LIST_END_DELETE and MLOG_LIST_START_DELETE will only record the reference record. (Their MLOG_COMP_ variants will write the index field lengths as well, so in that case there could be savings.) In the new low-level format, we must log each deleted record separately: MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the succeeding record MEMMOVE (offset,2,old_offset) to make the record point to the start of the PAGE_GARBAGE list WRITE to update various page header and trailer fields (this can be done once for deleting multiple records) optionally, MEMSET to clean the payload area of the record in the PAGE_GARBAGE list Currently, page_create_empty() optimizes the case where the entire page becomes empty as a result of a deletion. We could optimize one more special case of deletion to reduce both redo log volume and the frequency of page reorganize operations: When deleting the last inserted record(s) such that the maximum heap number will be decremented, we could free the space altogether instead of putting the record to the PAGE_GARBAGE list for potential future same-or-smaller-size reallocation. In this way, the logging would be as follows: MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the successor of the being-deleted record(s) MEMSET to zero out the being-deleted records WRITE of page header and footer fields Updating records In the clustered index (which stores the data ordered by PRIMARY KEY ), records can be ‘updated in place’ when neither the size nor the PRIMARY KEY of the record is not changing. In secondary indexes, ‘update in place’ is very rare, usually only happening when the case of a case-insensitive PRIMARY KEY is changing. If an ‘update in place’ is not possible, InnoDB will execute delete and insert in the same page and possibly split the page. All this is covered by operations described above. For ‘update in place’, we previously wrote a MLOG_REC_UPDATE_IN_PLACE record, which includes some information that can be useless: flags for undo logging and locking (basically, whether the following DB_TRX_ID,DB_ROLL_PTR  fields should be ignored possibly ignored position of DB_TRX_ID column (1‥3 bytes) possibly ignored value of DB_ROLL_PTR (7 bytes) possibly ignored value of DB_TRX_ID (1‥7 bytes) start offset of the record (2 bytes) MLOG_COMP_REC_UPDATE_IN_PLACE for ROW_FORMAT≠REDUNDANT will encode index field lengths as well. In contrast to this, the new format would only write WRITE records, possibly optimized to MEMMOVE if the page already contains the to-be-written value somewhere else. MEMMOVE could save space not only for frequently occurring user column values, but also for logging DB_TRX_ID or part of DB_ROLL_PTR when the same transaction is updating multiple records in the same clustered index leaf page. Significant savings should be expected for logging updates.

            When it comes to data file creation, we should implement strict write-ahead logging. Even with MDEV-18128 fixed, InnoDB is currently doing it wrong:

            1. create the file
            2. preallocate the file, zero-initialized
            3. write a dummy page 0
            4. write a MLOG_FILE_ record with the file name
            5. at some later point of time, write the actual contents of the data file from the buffer pool

            This is performing the change first, and logging it only after the fact. Currently, recovery never creates or initializes data files.

            If we did it properly, we could simply reduce it to the following:

            1. write log records to create the file and initialize page 0 with final contents (in a single mini-transaction)
            2. create the file
            3. preallocate the file, zero-initialized
            4. at some later point of time, write the actual contents of the data file from the buffer pool

            Recovery would create the file if needed, and there would be no issue whatsoever if recovery would encounter an empty file or a file filled with zeroes.

            marko Marko Mäkelä added a comment - When it comes to data file creation, we should implement strict write-ahead logging. Even with MDEV-18128 fixed, InnoDB is currently doing it wrong: create the file preallocate the file, zero-initialized write a dummy page 0 write a MLOG_FILE_ record with the file name at some later point of time, write the actual contents of the data file from the buffer pool This is performing the change first, and logging it only after the fact. Currently, recovery never creates or initializes data files. If we did it properly, we could simply reduce it to the following: write log records to create the file and initialize page 0 with final contents (in a single mini-transaction) create the file preallocate the file, zero-initialized at some later point of time, write the actual contents of the data file from the buffer pool Recovery would create the file if needed, and there would be no issue whatsoever if recovery would encounter an empty file or a file filled with zeroes.
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            To prepare for this, I merged mtr_t::Impl and mtr_t::Command to mtr_t and removed unused or redundant data fields of mtr_t::Command.
            Further changes will be made to replace various mlog_write_ functions with member functions of mtr_t.

            marko Marko Mäkelä added a comment - To prepare for this, I merged mtr_t::Impl and mtr_t::Command to mtr_t and removed unused or redundant data fields of mtr_t::Command . Further changes will be made to replace various mlog_write_ functions with member functions of mtr_t .
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            mleich Matthias Leich made changes -
            Attachment simp_page_rec_set_n_owned.test [ 50212 ]
            mleich Matthias Leich made changes -
            mleich Matthias Leich made changes -
            Attachment prt [ 50214 ]

            I am still debugging changes of the new, efficient redo log record format. Several recovery tests are still failing, but some do pass.
            I rebased the bb-10.5-MDEV-12353 branch to the current 10.5, with preparatory commits that replace all high-level InnoDB redo log record types with lower-level ones (and temporarily introduces a record MLOG_ZIP_WRITE_STRING for physical logging on ROW_FORMAT=COMPRESSED pages). This will actually increase the redo log volume and could decrease performance.

            Once I have debugged the new redo log encoding (which I did not push yet), I will remove the code to recover from the old-format redo log. An upgrade after a crash of an earlier server would not be supported.

            marko Marko Mäkelä added a comment - I am still debugging changes of the new, efficient redo log record format. Several recovery tests are still failing, but some do pass. I rebased the bb-10.5- MDEV-12353 branch to the current 10.5, with preparatory commits that replace all high-level InnoDB redo log record types with lower-level ones (and temporarily introduces a record MLOG_ZIP_WRITE_STRING for physical logging on ROW_FORMAT=COMPRESSED pages). This will actually increase the redo log volume and could decrease performance. Once I have debugged the new redo log encoding (which I did not push yet), I will remove the code to recover from the old-format redo log. An upgrade after a crash of an earlier server would not be supported.

            The branch now includes an implementation of the new format, with the exception of the same_page flag. That is, even if a mini-transaction is writing multiple subsequent records for the same page, it will encode the page identifier in each record.

            marko Marko Mäkelä added a comment - The branch now includes an implementation of the new format, with the exception of the same_page flag. That is, even if a mini-transaction is writing multiple subsequent records for the same page, it will encode the page identifier in each record.
            marko Marko Mäkelä made changes -

            I have now implemented the same_page encoding. In some cases, we are writing more redo log than earlier, and I must start assessing and fixing those. For inserting records, we can optimize some long WRITE records by emitting MEMMOVE for copying the preceding record, and then only writing the last bytes that differ between the records.

            marko Marko Mäkelä added a comment - I have now implemented the same_page encoding. In some cases, we are writing more redo log than earlier, and I must start assessing and fixing those. For inserting records, we can optimize some long WRITE records by emitting MEMMOVE for copying the preceding record, and then only writing the last bytes that differ between the records.
            marko Marko Mäkelä made changes -

            I optimized page_cur_insert_rec_low() and BtrBulk so that they write fewer WRITE records and try to copy data from the preceding record with MEMMOVE. A crude benchmark in MDEV-19747 indicates that the amount of redo log written is comparable to the old format. But we have not used all optimization potential yet; btr_page_reorganize() could be optimized further.

            I tested a variation of the microbenchmark that I originally developed for MDEV-19747:

            --source include/have_innodb.inc
            --source include/have_sequence.inc
             
            show status like 'innodb_lsn_current';
            SET profiling = 1;
            CREATE TABLE t1 (
             a BIGINT PRIMARY KEY,
             b CHAR(255) NOT NULL DEFAULT '',
             c CHAR(255) NOT NULL DEFAULT '',
             d CHAR(255) NOT NULL DEFAULT '',
             e CHAR(255) NOT NULL DEFAULT '',
             f CHAR(255) NOT NULL DEFAULT '',
             g CHAR(255) NOT NULL DEFAULT '',
             h CHAR(255) NOT NULL DEFAULT ''
            ) ENGINE=InnoDB;
            INSERT INTO t1 (a) SELECT seq FROM seq_1_to_500000;
            SHOW profiles;
            show status like 'innodb_lsn_current';
             
            --let $shutdown_timeout=0
            --source include/restart_mysqld.inc
            DROP TABLE t1;
            

            I executed it as follows:

            ./mtr --mysqld=--innodb-{page-size=4k,buffer-pool-size=64m,log-file-size=512m,force-recovery=2} innodb.recovery
            

            The test execution time seems comparable between the two branches. There is bound to be some variation, because log checkpoints and page flushing are nondeterministic. Up to the CREATE TABLE, the MDEV-12353 branch is writing slightly less log. At the end of the test, the LSN is slightly bigger for that branch. The new recovery code seems to be slightly faster, as expected.

            I specified innodb_force_recovery=2 in order to shut down the purge of transaction history. (MDEV-12288 resetting the DB_TRX_ID after the INSERT would be nondeterministic.)

            A debugging session with this test showed that the physical redo log recovery code is only invoking malloc() for adding log snippets to recv_sys.pages and related to some operations on file names. For the old redo log format, we would be invoking malloc() at least in mlog_parse_index().

            This exercise revealed a recent performance regression which I fixed.

            marko Marko Mäkelä added a comment - I optimized page_cur_insert_rec_low() and BtrBulk so that they write fewer WRITE records and try to copy data from the preceding record with MEMMOVE . A crude benchmark in MDEV-19747 indicates that the amount of redo log written is comparable to the old format. But we have not used all optimization potential yet; btr_page_reorganize() could be optimized further. I tested a variation of the microbenchmark that I originally developed for MDEV-19747 : --source include/have_innodb.inc --source include/have_sequence.inc   show status like 'innodb_lsn_current' ; SET profiling = 1; CREATE TABLE t1 ( a BIGINT PRIMARY KEY , b CHAR (255) NOT NULL DEFAULT '' , c CHAR (255) NOT NULL DEFAULT '' , d CHAR (255) NOT NULL DEFAULT '' , e CHAR (255) NOT NULL DEFAULT '' , f CHAR (255) NOT NULL DEFAULT '' , g CHAR (255) NOT NULL DEFAULT '' , h CHAR (255) NOT NULL DEFAULT '' ) ENGINE=InnoDB; INSERT INTO t1 (a) SELECT seq FROM seq_1_to_500000; SHOW profiles; show status like 'innodb_lsn_current' ;   --let $shutdown_timeout=0 --source include/restart_mysqld.inc DROP TABLE t1; I executed it as follows: ./mtr --mysqld=--innodb-{page-size=4k,buffer-pool-size=64m,log-file-size=512m,force-recovery=2} innodb.recovery The test execution time seems comparable between the two branches. There is bound to be some variation, because log checkpoints and page flushing are nondeterministic. Up to the CREATE TABLE , the MDEV-12353 branch is writing slightly less log. At the end of the test, the LSN is slightly bigger for that branch. The new recovery code seems to be slightly faster, as expected. I specified innodb_force_recovery=2 in order to shut down the purge of transaction history. ( MDEV-12288 resetting the DB_TRX_ID after the INSERT would be nondeterministic.) A debugging session with this test showed that the physical redo log recovery code is only invoking malloc() for adding log snippets to recv_sys.pages and related to some operations on file names. For the old redo log format, we would be invoking malloc() at least in mlog_parse_index() . This exercise revealed a recent performance regression which I fixed.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            Single-threaded performance is affected somewhat. Here is what I got for testing 10.5 immediately before/after MDEV-12353, built with clang 9.0.1 -O2 -march=native -mtune=native on an Intel Xeon E5-2630 (Haswell microarchitecture):

            ./mtr main.sum_distinct-big
            

            version Debug RelWithDebInfo
            10.5 ac51bcfd8d3 (before) 152.600s 43.994s
            10.5 f8a9f906679 (after) 147.608s 51.003s

            If we omit the MyISAM part of the test and use CREATE TEMPORARY TABLE t2 or persistent CREATE TABLE t2 for the InnoDB test, the difference becomes as follows:

            version temporary persistent
            10.5 ac51bcfd8d3 (before) 12.458s 21.831s
            10.5 f8a9f906679 (after) 12.808s 30.727s

            There seems to be some performance regression that I had overlooked earlier. This ought to be due to the INSERT logging that we will be optimizing further in MDEV-21724.

            marko Marko Mäkelä added a comment - Single-threaded performance is affected somewhat. Here is what I got for testing 10.5 immediately before/after MDEV-12353 , built with clang 9.0.1 -O2 -march=native -mtune=native on an Intel Xeon E5-2630 (Haswell microarchitecture): ./mtr main.sum_distinct-big version Debug RelWithDebInfo 10.5 ac51bcfd8d3 (before) 152.600s 43.994s 10.5 f8a9f906679 (after) 147.608s 51.003s If we omit the MyISAM part of the test and use CREATE TEMPORARY TABLE t2 or persistent CREATE TABLE t2 for the InnoDB test, the difference becomes as follows: version temporary persistent 10.5 ac51bcfd8d3 (before) 12.458s 21.831s 10.5 f8a9f906679 (after) 12.808s 30.727s There seems to be some performance regression that I had overlooked earlier. This ought to be due to the INSERT logging that we will be optimizing further in MDEV-21724 .
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2020-02-14 05:01:19.0 2020-02-14 05:01:19.48
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5.1 [ 24029 ]
            Fix Version/s 10.5 [ 23123 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]

            The regression is due to increased redo log volume, and it does not affect small INSERT or UPDATE. It will be addressed in MDEV-21724 and MDEV-21725.

            marko Marko Mäkelä added a comment - The regression is due to increased redo log volume, and it does not affect small INSERT or UPDATE . It will be addressed in MDEV-21724 and MDEV-21725 .

            The following test demonstrates that we are doing fine with in-place UPDATE:

            --source include/have_innodb.inc
            CREATE TABLE t2 (a BIGINT) ENGINE=InnoDB;
            delimiter |;
            CREATE PROCEDURE p(low BIGINT, high BIGINT)
            begin
              SHOW STATUS LIKE 'innodb_lsn_current';
              INSERT INTO t2 VALUES(low);
              WHILE low<high DO
                UPDATE t2 SET a=low;
                SET low = low + 1;
              END WHILE;
              SHOW STATUS LIKE 'innodb_lsn_current';
            end|
            delimiter ;|
            CALL p(9223372036854644735,9223372036854775807);
            DROP PROCEDURE p;
            DROP TABLE t2;
            

            ./mtr --mysqld=--innodb-force-recovery=2 main.ibu
            

            revision LSN1 LSN2-LSN1 time
            before 61,792 27,098,532 3.92s
            after 52,361 22,173,911 3.41s

            Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log):

            ./mtr --mysqld=--skip-autocommit --mysqld=--innodb-force-recovery=2 main.ibu
            

            revision LSN1 LSN2-LSN1 time
            before 61,792 11,512,365 1.49s
            after 52,361 8,299,770 1.38s

            In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed.

            marko Marko Mäkelä added a comment - The following test demonstrates that we are doing fine with in-place UPDATE : --source include/have_innodb.inc CREATE TABLE t2 (a BIGINT ) ENGINE=InnoDB; delimiter |; CREATE PROCEDURE p(low BIGINT , high BIGINT ) begin SHOW STATUS LIKE 'innodb_lsn_current' ; INSERT INTO t2 VALUES (low); WHILE low<high DO UPDATE t2 SET a=low; SET low = low + 1; END WHILE; SHOW STATUS LIKE 'innodb_lsn_current' ; end | delimiter ;| CALL p(9223372036854644735,9223372036854775807); DROP PROCEDURE p; DROP TABLE t2; ./mtr --mysqld=--innodb-force-recovery=2 main.ibu revision LSN1 LSN2-LSN1 time before 61,792 27,098,532 3.92s after 52,361 22,173,911 3.41s Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log): ./mtr --mysqld=--skip-autocommit --mysqld=--innodb-force-recovery=2 main.ibu revision LSN1 LSN2-LSN1 time before 61,792 11,512,365 1.49s after 52,361 8,299,770 1.38s In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed.
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5.2 [ 24030 ]
            Fix Version/s 10.5.1 [ 24029 ]

            as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for MDEV-21534 (basically, update_index sysbench). it even beats the log_write_up_to() in most scenarios in baseline,

            wlad Vladislav Vaintroub added a comment - as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for MDEV-21534 (basically, update_index sysbench). it even beats the log_write_up_to() in most scenarios in baseline,

            Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page.

            Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725, we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression.

            marko Marko Mäkelä added a comment - Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page. Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725 , we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            I now believe that MDEV-21724 has an even larger impact on performance. The amount of redo log written for an INSERT of an index record has almost doubled for records with small numbers of fields. To address that, we must introduce custom log records for those operations.

            marko Marko Mäkelä added a comment - I now believe that MDEV-21724 has an even larger impact on performance. The amount of redo log written for an INSERT of an index record has almost doubled for records with small numbers of fields. To address that, we must introduce custom log records for those operations.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            I collected statistics on the amount of redo log written for the following SQL:

            --source include/have_innodb.inc
            CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t1 VALUES (1),(2);
            DROP TABLE t1;
            

            I used the following commands in GDB to identify the mini-transactions:

            break ha_innobase::write_row
            run
            break mtr_t::finish_write()
            command 2
            up 2
            up
            end
            continue
            continue
            …
            

            Here are the mini-transactions (excluding purge and log checkpoints):

            bytes(old) bytes(new) bytes(revised) source operation
            48 59 48 trx_undo_report_row_operation() INSERT (1) undo log
            36 47 27 row_ins_clust_index_entry_low() INSERT (1) b-tree
            14 22 14 trx_undo_report_row_operation() INSERT (2) undo log
            30 52 24 row_ins_clust_index_entry_low() INSERT (2) b-tree
            88 49 49 trx_commit_low() COMMIT (INSERT)
            107 108 97 trx_undo_report_row_operation() DROP TABLE
            20 17 17 row_upd_del_mark_clust_rec()
            8 8 8 row_upd_clust_step()
            63 72 64 trx_undo_report_row_operation()
            20 17 17 row_upd_del_mark_clust_rec()
            55 64 56 trx_undo_report_row_operation()
            20 17 17 row_upd_del_mark_clust_rec()
            52 61 53 trx_undo_report_row_operation()
            20 17 17 row_upd_del_mark_clust_rec()
            17 10 10 row_upd_sec_index_entry()
            36 46 37 trx_undo_report_row_operation()
            21 18 18 row_upd_del_mark_clust_rec()
            36 45 37 trx_undo_report_row_operation()
            21 18 18 row_upd_del_mark_clust_rec()
            19 18 18 fil_delete_tablespace()
            88 49 49 trx_commit_low() COMMIT (DROP TABLE)

            The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT. Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT.

            The high-level log records for inserting index tree records will be introduced in MDEV-21724. The deletion of records is not covered by this micro-benchmark, but it was improved earlier .

            For updates, the new encoding is outperforming the old encoding, as expected.

            marko Marko Mäkelä added a comment - - edited I collected statistics on the amount of redo log written for the following SQL: --source include/have_innodb.inc CREATE TABLE t1 (a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t1 VALUES (1),(2); DROP TABLE t1; I used the following commands in GDB to identify the mini-transactions: break ha_innobase::write_row run break mtr_t::finish_write() command 2 up 2 up end continue continue … Here are the mini-transactions (excluding purge and log checkpoints): bytes(old) bytes(new) bytes(revised) source operation 48 59 48 trx_undo_report_row_operation() INSERT (1) undo log 36 47 27 row_ins_clust_index_entry_low() INSERT (1) b-tree 14 22 14 trx_undo_report_row_operation() INSERT (2) undo log 30 52 24 row_ins_clust_index_entry_low() INSERT (2) b-tree 88 49 49 trx_commit_low() COMMIT (INSERT) 107 108 97 trx_undo_report_row_operation() DROP TABLE 20 17 17 row_upd_del_mark_clust_rec() 8 8 8 row_upd_clust_step() 63 72 64 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 55 64 56 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 52 61 53 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 17 10 10 row_upd_sec_index_entry() 36 46 37 trx_undo_report_row_operation() 21 18 18 row_upd_del_mark_clust_rec() 36 45 37 trx_undo_report_row_operation() 21 18 18 row_upd_del_mark_clust_rec() 19 18 18 fil_delete_tablespace() 88 49 49 trx_commit_low() COMMIT (DROP TABLE) The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT . Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT . The high-level log records for inserting index tree records will be introduced in MDEV-21724 . The deletion of records is not covered by this micro-benchmark, but it was improved earlier . For updates, the new encoding is outperforming the old encoding, as expected.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            thiru Thirunarayanan Balathandayuthapani made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels performance recovery ServiceNow performance recovery
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels ServiceNow performance recovery 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z performance recovery
            serg Sergei Golubchik made changes -
            Labels 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z performance recovery performance recovery
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 80109 ] MariaDB v4 [ 133192 ]
            jplindst Jan Lindström (Inactive) made changes -
            jplindst Jan Lindström (Inactive) made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            midenok Aleksey Midenkov made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201658 167030
            Zendesk active tickets 201658

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              4 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.