[MDEV-12353] Efficient InnoDB redo log record format - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.5.2
Component/s: Storage Engine - InnoDB
Labels:
- performance
- recovery

Description

The InnoDB crash recovery performance can be improved a little while not changing the file format. ~~MDEV-12699~~ removed unnecessary reads of pages that can be initialized via the redo log. ~~MDEV-19586~~ will make recovery read the pages in more sequential order.

We should fix some fundamental issues that exist with the current InnoDB redo log record format:

Records do not contain their length. When buffering records, we must painstakingly parse entire records in order to determine the length. This idea was mentioned in MySQL Bug#82937.
For B-tree operations, we are writing a lot of redundant data for mlog_parse_index(). We should use a lower-level format and lower-level apply functions. MySQL Bug#82176 merely speeds up the code around mlog_parse_index().
If a mini-transaction is writing multiple records to a page, the page identifier is being repeated for every record. We should omit the page identifier if multiple consecutive records are modifying the same page.

In this task, we will only improve the redo log record format. Format changes to the redo log blocks and files will be covered by ~~MDEV-14425~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

prt
60 kB
2020-01-06 18:43
simp_page_rec_set_n_owned.test
0.7 kB
2020-01-06 18:43
simp_page_rec_set_n_owned-master.opt
0.1 kB
2020-01-06 18:43

Issue Links

causes

MDEV-21724 Optimize page_cur_insert_rec_low() redo logging

Closed

MDEV-21725 Optimize btr_page_reorganize_low() redo logging

Stalled

MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED

Open

MDEV-21744 Assertion `!rec_offs_nth_sql_null(offsets, n)' failed in btr_cur_upd_rec_in_place

Closed

MDEV-21748 ASAN use-after-poison in PageBulk::insertPage on table rebuild

Closed

MDEV-21749 storage/innobase/page/page0cur.cc:1306: rec_t* page_cur_insert_rec_low(const page_cur_t*, dict_index_t*, const rec_t*, offset_t*, mtr_t*): Assertion `rdm - rd + bd <= insert_buf + rec_size' failed.

Closed

MDEV-21779 Assertion `!memcmp(insert_buf, rec - extra_size, rec_size)' failed in page_cur_insert_rec_low

Closed

MDEV-21850 ASAN use-after-poison in page_cur_insert_rec_low

Closed

MDEV-21892 Assertion `index != clust_index || row_get_rec_trx_id(rec, index, offsets)' failed in row_search_mvcc

Closed

MDEV-21893 Assertion `log_block_get_start_lsn(lsn, log_block_get_hdr_no(buf)) == lsn' failed upon attempt to start on encrypted datadir from previous versions

Closed

MDEV-21945 Assertion w==OPT failed in trx_purge_add_undo_to_history()

Closed

MDEV-21949 key rotation for innodb_encrypt_log is not working in 10.5

Closed

MDEV-22107 Restore accidentally orphaned MTR_MEMO_MODIFY

Closed

MDEV-22108 Crash recovery fails with [ERROR] InnoDB: Malformed log record

Closed

MDEV-22242 B-trees can become extremely skewed

Closed

MDEV-23806 Undo page corruption on recovery

Closed

MDEV-24196 WITH_UBSAN runtime error: member call on null pointer of type 'struct log_phys_t'

Closed

MDEV-24412 Mariadb 10.5: InnoDB: Upgrade after a crash is not supported

Closed

MDEV-24652 mtr fails while reusing the cached undo log block

Closed

MDEV-24848 Assertion `rlen < llen' failed in log_phys_t::apply_status log_phys_t::apply upon MariaBackup prepare

Closed

MDEV-27059 page_zip_dir_insert() may corrupt ROW_FORMAT=COMPRESSED tables

Closed

MDEV-27444 Perform backup prepare using mariabbackup 10.4 version when performing rolling upgrade on joiner node with 10.5

Stalled

MDEV-27486 Refuse Galera SST if major version of donor and joiner are different

Stalled

MDEV-28731 Race condition on log checkpoint

Closed

is blocked by

MDEV-19747 Deprecate and ignore innodb_log_optimize_ddl

Closed

MDEV-21174 Refactor mlog_write_ulint, mlog_memset, mlog_write_string

Closed

MDEV-21351 Replace recv_sys.heap with list of buf_block_t*

Closed

MDEV-21674 purge_sys.stop() no longer waits for purge workers to complete

Closed

relates to

MDEV-12699 Improve crash recovery of corrupted data pages

Closed

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-18976 Implement a CHECKSUM redo log record for improved validation

Closed

MDEV-19586 Replace recv_sys_t::addr_hash with a std::map

Closed

MDEV-20562 btr_cur_open_at_rnd_pos() fails to return error for corrupted page

Closed

MDEV-20584 Clearing delete-mark on R-tree pages does not appear crash-safe

Closed

MDEV-20636 Potential SPATIAL INDEX corruption with ROW_FORMAT=COMPRESSED

Open

MDEV-21024 InnoDB is issuing redundant writes to redo log

Closed

MDEV-22126 Rename confusing constant mtr_t::OPT

Closed

MDEV-23136 InnoDB init fail after upgrade from 10.4 to 10.5

Closed

MDEV-23986 [ERROR] [FATAL] InnoDB: Page ... name ... page_type ... key_version 1 lsn ... compressed_len ...

Closed

MDEV-27437 Galera snapshot transfer fails to upgrade between some major versions

Closed

MDEV-28256 Galera SST fails with 'InnoDB: Upgrade after a crash is not supported'

Closed

MDEV-29153 2022-07-22 18:12:56 0x7f3319fff700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.4.25/storage/innobase/page/page0cur.cc line 11

Closed

MDEV-32144 Debug assertion failure w == MAYBE_NOP in mtr_t::memcpy(), trx_undo_write_xid()

Closed

MDEV-32445 InnoDB may corrupt its log before upgrading it on startup

Closed

MDEV-33274 The test encryption.innodb-redo-nokeys often fails

Closed

MDEV-15274 innodb_gis.rtree_recovery failed in buildbot with error on check table

Closed

MDEV-19747 Deprecate and ignore innodb_log_optimize_ddl

Closed

MDEV-20608 innodb_log_optimize_ddl=OFF may omit some redo log

Closed

MDEV-21899 INSERT into a secondary index with zero-data-length key is not crash-safe

Closed

MDEV-22097 Not applying DELETE_ROW_FORMAT_REDUNDANT due to corruption

Closed

MDEV-23474 InnoDB fails to start with Invalid log block checksum after unsetting innodb_log_checksums dynamically

Closed

MDEV-24719 backport MDEV-24705 from 10.5 to 10.2, 10.3, 10.4

Closed

MDEV-26322 Last binlog file and position are "empty" in mariabackup --prepare output

Closed

MDEV-31042 Certain tables are missing in innodb storage engine of a TSE enabled database after an incremental backup restore

Closed

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(19 causes, 4 is blocked by, 26 relates to, 6 mentioned in)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä created issue - 2017-03-24 08:35

Marko Mäkelä made changes - 2017-09-17 13:55

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-13830~~ [ ~~MDEV-13830~~ ]

Julien Fritsch made changes - 2017-09-22 07:14

Comment

[ A comment with security level 'Developers' was removed. ]

Marko Mäkelä added a comment - 2017-12-19 16:38

Related to this, in MariaDB 10.3, I removed the following redo log record types:

Simplifying the redo logging of basic B-tree operations would bring much more benefit.

Marko Mäkelä added a comment - 2017-12-19 16:38 Related to this, in MariaDB 10.3, I removed the following redo log record types: Remove MLOG_UNDO_ERASE_END Replace MLOG_UNDO_INSERT with MLOG_WRITE_STRING, MLOG_2BYTES Replace MLOG_UNDO_INIT with MLOG_2BYTES, MLOG_4BYTES Simplifying the redo logging of basic B-tree operations would bring much more benefit.

Marko Mäkelä made changes - 2018-06-07 09:25

Link

This issue relates to ~~MDEV-14425~~ [ ~~MDEV-14425~~ ]

Marko Mäkelä added a comment - 2018-06-07 09:25 - edited

I had to put back the MLOG_UNDO_INSERT and MLOG_UNDO_INIT redo log record types, because the lower-level records are occupying more space in the redo log, and thus measurably slowing down the server. This is a problem with the redo log record format: if a mini-transaction contains multiple log records for the same page, each record will repeat the tablespace ID and page number.

Marko Mäkelä added a comment - 2018-06-07 09:25 - edited I had to put back the MLOG_UNDO_INSERT and MLOG_UNDO_INIT redo log record types, because the lower-level records are occupying more space in the redo log, and thus measurably slowing down the server. This is a problem with the redo log record format: if a mini-transaction contains multiple log records for the same page, each record will repeat the tablespace ID and page number.

Marko Mäkelä made changes - 2018-06-07 09:25

Fix Version/s		10.4 [ 22408 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.1 [ 16100 ]
Fix Version/s	10.3 [ 22126 ]

Marko Mäkelä made changes - 2018-06-07 09:26

Affects Version/s

10.3 [ 22126 ]

Marko Mäkelä made changes - 2018-11-20 14:14

Link

This issue relates to ~~MDEV-12699~~ [ ~~MDEV-12699~~ ]

Marko Mäkelä made changes - 2019-03-28 11:20

Fix Version/s	10.4 [ 22408 ]
NRE Projects		RM_105_CANDIDATE
Affects Version/s		10.4 [ 22408 ]

Elena Stepanova made changes - 2019-04-28 15:39

Fix Version/s

10.5 [ 23123 ]

Marko Mäkelä made changes - 2019-05-24 12:51

Link

This issue relates to ~~MDEV-19586~~ [ ~~MDEV-19586~~ ]

Marko Mäkelä made changes - 2019-05-24 12:54

Affects Version/s	10.2 [ 14601 ]
Affects Version/s	10.0 [ 16000 ]
Affects Version/s	10.1 [ 16100 ]
Affects Version/s	10.3 [ 22126 ]
Affects Version/s	10.4 [ 22408 ]
Issue Type	Bug [ 1 ]	Task [ 3 ]

Marko Mäkelä made changes - 2019-05-24 13:09

Description	The InnoDB crash recovery performance could be improved. Some ideas (file format changes cannot be done in GA versions): # Read the to-be-recovered pages sorted by page number. Currently, recv_apply_hashed_log_recs() picks a ‘random page number’ from recv_sys->addr_hash and then reads a number of pages starting from that number. # Use a simpler redo log record format, and reduce the number of operations that require mlog_write_index(). For example, use MLOG_1BYTE instead of MLOG_REC_SEC_DELETE_MARK. This should also reduce the redo log volume. # Look at some contributed patches, such as [MySQL Bug#82937\|https://bugs.mysql.com/bug.php?id=82937] (format change), [MySQL Bug#82176\|https://bugs.mysql.com/bug.php?id=82176].	The InnoDB crash recovery performance can be improved a little while not changing the file format. ~~MDEV-12699~~ removed unnecessary reads of pages that can be initialized via the redo log. ~~MDEV-19586~~ will make recovery read the pages in more sequential order. We should fix some fundamental issues that exist with the current InnoDB redo log record format: * Records do not contain their length. When buffering records, we must painstakingly parse entire records in order to determine the length. This idea was mentioned in [MySQL Bug#82937\|https://bugs.mysql.com/bug.php?id=82937]. * For B-tree operations, we are writing a lot of redundant data for {{mlog_parse_index()}}. We should use a lower-level format and lower-level apply functions. [MySQL Bug#82176\|https://bugs.mysql.com/bug.php?id=82176] merely speeds up the code around {{mlog_parse_index()}}. * If a mini-transaction is writing multiple records to a page, the page identifier is being repeated for every record. We should omit the page identifier if multiple consecutive records are modifying the same page. In this task, we will only improve the redo log record format. Format changes to the redo log blocks and files will be covered by ~~MDEV-14425~~.
Summary	Speed up InnoDB crash recovery	Efficient InnoDB redo log record format

Marko Mäkelä made changes - 2019-05-24 13:40

Comment

[ While fixing ~~MDEV-17680~~, it occurred to me that in {{recv_apply_hashed_log_recs()}} we could easily filter out any log records that precede {{MLOG_INIT_FILE_PAGE2}}. If such a record is present for a page, we can simply create the page and apply the log records. There is absolutely no need to read the page, and possibly unnecessarily fail if the to-be-ignored page was corrupted. This should be fixed in ~~MDEV-12699~~. ]

Marko Mäkelä added a comment - 2019-05-24 14:49

I think that the records of a mini-transaction could be a stream of records that is always terminated by a NUL byte. The NUL byte would also act as padding in log blocks. We would remove the MLOG_SINGLE_REC_FLAG that currently identifies single-record mini-transactions.

The first byte of the record would contain a record type, flags, and a part of length.
The optional second byte of the record will contain more length. (Not needed for short records.)
Optional (unless a flag "same page" is set): encoded tablespace identifier and page number. (The flag cannot be set on the first record of a mini-transaction.)
Optional (depending on the type code): byte offset on the page
Finally, the payload bytes of the record.

At the minimum, we would seem to need the following record types:

INIT: corresponds to MLOG_INIT_FILE_PAGE2
LOAD: corresponds to MLOG_INDEX_LOAD
WRITE: replaces MLOG_nBYTES, MLOG_WRITE_STRING
MEMSET: corresponds to the 10.4 MLOG_MEMSET record
MEMMOVE: used as a building block for logging page reorganize
INDEX_INIT: initialize a B-tree or R-tree page

For ROW_FORMAT=COMPRESSED pages, we would mostly write WRITE records that would refer to the compressed page frames. Currently the MLOG_ records refer to the uncompressed page frames.

We would seem to need 3 bits for the redo log record type. MLOG_FILE_ records and MLOG_CHECKPOINT can be represented by setting the "same page" flag at the start of the mini-transaction. After the first record within a mini-transaction that has this flag clear, there could not be any non-page redo log records.

This would leave 4 bits for record length in the first byte. Values 1‥15 would represent lengths of 1 to 15 bytes. If the total length of the record is longer than 15 bytes, then the value 0 would be used to indicate that 1 to 3 length bytes will follow.

The encoded tablespace identifier and page number could use variable-length encoding, instead of always using 4+4 bytes.

For encoding the byte offset for WRITE and MEMSET operations, we will use a variable-length encoding of 1‥3 bytes, instead of always logging 2 bytes. I would not use any delta coding where the byte offsets would be relative to preceding operations in the same mini-transaction. Keeping it simple allows recovery to group together redo log records for the same page from multiple independent mini-transactions.

As mentioned at the start of this comment, the type byte 0 would be special, marking the end of a mini-transaction. We could use the corresponding flagged value 0x80 for something special, such as a future extension when more type codes are needed, or for encoding rarely needed redo log records.

Examples:

INIT could be logged as 0x12 0x34 0x56, meaning "type code 1 (INIT), 2 bytes to follow" and "tablespace ID 0x34", "page number 0x56".
WRITE could be logged as 0x36 0x40 0x57 0x60 0x12 0x34 0x56, meaning "type code 3 (WRITE), 6 bytes to follow" and "tablespace ID 0x40", "page number 0x57", "byte offset 0x60", data 0x34,0x56.
A subsequent WRITE to the same page could be logged 0xb5 0x7f 0x23 0x34 0x56 0x78, meaning "same page, type code 3 (WRITE), 5 bytes to follow", "byte offset 0x7f", bytes 0x23,0x34,0x56,0x78.
The end of the mini-transaction would be indicated by a NUL byte.

kaamos, you expressed interest in this work in the 2019 New York Unconference. I would like to know your opinion about this.

Marko Mäkelä added a comment - 2019-05-24 14:49 I think that the records of a mini-transaction could be a stream of records that is always terminated by a NUL byte. The NUL byte would also act as padding in log blocks. We would remove the MLOG_SINGLE_REC_FLAG that currently identifies single-record mini-transactions. The first byte of the record would contain a record type, flags, and a part of length. The optional second byte of the record will contain more length. (Not needed for short records.) Optional (unless a flag "same page" is set): encoded tablespace identifier and page number. (The flag cannot be set on the first record of a mini-transaction.) Optional (depending on the type code): byte offset on the page Finally, the payload bytes of the record. At the minimum, we would seem to need the following record types: INIT: corresponds to MLOG_INIT_FILE_PAGE2 LOAD: corresponds to MLOG_INDEX_LOAD WRITE: replaces MLOG_nBYTES , MLOG_WRITE_STRING MEMSET: corresponds to the 10.4 MLOG_MEMSET record MEMMOVE: used as a building block for logging page reorganize INDEX_INIT: initialize a B-tree or R-tree page For ROW_FORMAT=COMPRESSED pages, we would mostly write WRITE records that would refer to the compressed page frames. Currently the MLOG_ records refer to the uncompressed page frames. We would seem to need 3 bits for the redo log record type. MLOG_FILE_ records and MLOG_CHECKPOINT can be represented by setting the "same page" flag at the start of the mini-transaction. After the first record within a mini-transaction that has this flag clear, there could not be any non-page redo log records. This would leave 4 bits for record length in the first byte. Values 1‥15 would represent lengths of 1 to 15 bytes. If the total length of the record is longer than 15 bytes, then the value 0 would be used to indicate that 1 to 3 length bytes will follow. The encoded tablespace identifier and page number could use variable-length encoding, instead of always using 4+4 bytes. For encoding the byte offset for WRITE and MEMSET operations, we will use a variable-length encoding of 1‥3 bytes, instead of always logging 2 bytes. I would not use any delta coding where the byte offsets would be relative to preceding operations in the same mini-transaction. Keeping it simple allows recovery to group together redo log records for the same page from multiple independent mini-transactions. As mentioned at the start of this comment, the type byte 0 would be special, marking the end of a mini-transaction. We could use the corresponding flagged value 0x80 for something special, such as a future extension when more type codes are needed, or for encoding rarely needed redo log records. Examples: INIT could be logged as 0x12 0x34 0x56, meaning "type code 1 (INIT), 2 bytes to follow" and "tablespace ID 0x34", "page number 0x56". WRITE could be logged as 0x36 0x40 0x57 0x60 0x12 0x34 0x56, meaning "type code 3 (WRITE), 6 bytes to follow" and "tablespace ID 0x40", "page number 0x57", "byte offset 0x60", data 0x34,0x56. A subsequent WRITE to the same page could be logged 0xb5 0x7f 0x23 0x34 0x56 0x78, meaning "same page, type code 3 (WRITE), 5 bytes to follow", "byte offset 0x7f", bytes 0x23,0x34,0x56,0x78. The end of the mini-transaction would be indicated by a NUL byte. kaamos , you expressed interest in this work in the 2019 New York Unconference. I would like to know your opinion about this.

Marko Mäkelä made changes - 2019-06-11 08:57

Link

This issue relates to TODO-1874 [ TODO-1874 ]

Marko Mäkelä added a comment - 2019-06-14 05:32 - edited

As explained in ~~MDEV-19747~~, I hope that we can remove innodb_log_optimize_ddl and MLOG_INDEX_LOAD and related code. If the redo log volume is considerably reduced due to this format change, there should be no need to disable redo logging when ALTER TABLE is rebuilding tables or creating indexes. (There could be a separate global setting for disabling redo logging altogether, to speed up database initialization when crash recovery is not expected to work.)

For ~~MDEV-15528~~, MariaDB Server 10.4 introduced the record MLOG_INIT_FREE_PAGE that would allow us to punch holes or scrub pages after they have been freed.

If ~~MDEV-14425~~ introduces a separate log file for checkpoint and file metadata, we will not need MLOG_FILE_ or MLOG_CHECKPOINT records in the actual redo log files that contain changes that are to be applied to the data files. Thus, the ‘same page’ flag could never be set on the first record of a mini-transaction, and this could be used for future extension of the code space.

This would leave us with the following format:

First byte: ‘same page’ flag, record type, and a part of length.
Record type:

WRITE: replaces MLOG_nBYTES, MLOG_WRITE_STRING
MEMSET: corresponds to the 10.4 MLOG_MEMSET record
MEMMOVE: used as a building block for logging page reorganize
INIT: corresponds to MLOG_INIT_FILE_PAGE2
INDEX_INIT: initializing a B-tree or R-tree page
FREE: corresponds to MLOG_INIT_FREE_PAGE (~~MDEV-15528~~)
RESERVED: reserved for future use (its presence prevents crash-downgrade)
OPTION: optional record that may be safely ignored; examples:
- LSN of the previous change to a page (‘same page’ flag identifies the page)
- ~~MDEV-18976~~ page checksum at the start of the mini-transaction (‘same page’ flag identifies the page)
- binlog record

Note: The ‘same page’ flag can never be set on the INIT record, because the INIT record should only be issued for a freshly allocated page, and a single mini-transaction would not free and then allocate the same page. The ‘same page’ flag is also unlikely (but not impossible) to be set on the FREE record.

Next, I will try to estimate the logged size of some operations when using these lower-level records. The most interesting ones are btr_page_reorganize() (which would use MEMMOVE) and page_move_rec_list_start() (which would use WRITE and MEMSET). The mode MTR_LOG_SHORT_INSERTS would be removed.

Marko Mäkelä added a comment - 2019-06-14 05:32 - edited As explained in MDEV-19747 , I hope that we can remove innodb_log_optimize_ddl and MLOG_INDEX_LOAD and related code. If the redo log volume is considerably reduced due to this format change, there should be no need to disable redo logging when ALTER TABLE is rebuilding tables or creating indexes. (There could be a separate global setting for disabling redo logging altogether, to speed up database initialization when crash recovery is not expected to work.) For MDEV-15528 , MariaDB Server 10.4 introduced the record MLOG_INIT_FREE_PAGE that would allow us to punch holes or scrub pages after they have been freed. If MDEV-14425 introduces a separate log file for checkpoint and file metadata, we will not need MLOG_FILE_ or MLOG_CHECKPOINT records in the actual redo log files that contain changes that are to be applied to the data files. Thus, the ‘same page’ flag could never be set on the first record of a mini-transaction, and this could be used for future extension of the code space. This would leave us with the following format: First byte: ‘same page’ flag, record type, and a part of length. Record type: WRITE: replaces MLOG_nBYTES , MLOG_WRITE_STRING MEMSET: corresponds to the 10.4 MLOG_MEMSET record MEMMOVE: used as a building block for logging page reorganize INIT: corresponds to MLOG_INIT_FILE_PAGE2 INDEX_INIT: initializing a B-tree or R-tree page FREE: corresponds to MLOG_INIT_FREE_PAGE ( MDEV-15528 ) RESERVED: reserved for future use (its presence prevents crash-downgrade) OPTION: optional record that may be safely ignored; examples: LSN of the previous change to a page (‘same page’ flag identifies the page) MDEV-18976 page checksum at the start of the mini-transaction (‘same page’ flag identifies the page) binlog record Note: The ‘same page’ flag can never be set on the INIT record, because the INIT record should only be issued for a freshly allocated page, and a single mini-transaction would not free and then allocate the same page. The ‘same page’ flag is also unlikely (but not impossible) to be set on the FREE record. Next, I will try to estimate the logged size of some operations when using these lower-level records. The most interesting ones are btr_page_reorganize() (which would use MEMMOVE) and page_move_rec_list_start() (which would use WRITE and MEMSET). The mode MTR_LOG_SHORT_INSERTS would be removed.

Marko Mäkelä added a comment - 2019-06-14 09:53

Note: for index pages of ROW_FORMAT≠REDUNDANT, MLOG_COMP_ records will be written, with index field lengths. That will be omitted in the new format.

Inserting records:

Old format (MLOG_REC_INSERT and MLOG_LIST_END_COPY_CREATED), in page_cur_insert_rec_write_log():

MLOG_REC_INSERT(space_id,page_number,preceding_record_offset,inserted_record_offset). Omitted for MTR_LOG_SHORT_INSERTS. Instead, one MLOG_LIST_END_COPY_CREATED) record will encapsulate multiple inserts.
record size (including header) and a flag whether either the size or the header size differs from the preceding record
If the size differs: 1 byte of info_bits of the to-be-inserted record, followed by header size and the first differing byte in the header
data bytes of the record (and optionally header)

Note: When the changes are applied by page_cur_parse_insert_rec(), the page header and footer will be updated by page_cur_rec_insert() without them having been mentioned in the log. Applying the record can cause a crash if the previous page contents is inconsistent with what was logged, say, if an incorrect version of the page is available.

New format:

MEMMOVE (offset,len,old_offset): copy some of the preceding record header and prefix of contents
WRITE (offset,len,data): write the part of the record that differs from the preceding record

The ‘same page’ flag will be set on all but the first record for the page. Thus, the (space_id,page_number) will be logged only once.

In the new format, we must explicitly update the page header and footer as well:

WRITE (offset,len,data) of PAGE_LAST_INSERT and possibly PAGE_FREE, PAGE_GARBAGE, PAGE_DIRECTION_B, PAGE_N_DIRECTION
WRITE (offset,len,data) of PAGE_N_RECS and possibly PAGE_MAX_TRX_ID (this was logged separately earlier)
If a page directory slot is added: WRITE (offset, 1, data) of n_owned and MEMMOVE and WRITE (offset, 2, data) to add a page directory slot

At the minimum, we must update 1 or 2 bytes of PAGE_LAST_INSERT and PAGE_N_RECS. The byte offset of both will fit in 1 byte, so these records will occupy 2 bytes plus the data length, that is, 2*(2+1) = 6 bytes minimum.

For inserting multiple records, we need to update the page header and footer only once.

Reorganizing a page: `btr_page_reorganize_low()`

Old format:

MLOG_PAGE_REORGANIZE(space_id,page_number)

Note: the index field lengths in MLOG_COMP_PAGE_REORGANIZE and MLOG_ZIP_PAGE_REORGANIZE can be very long.

New format:

Compare the old and reorganized page (for ROW_FORMAT=COMPRESSED, the compressed frame)
WRITE the modified part of the page header
MEMSET the unused portion of the page
MEMMOVE and WRITE the record payload and page footer

For reorganizing a page, we will obviously generate more log than earlier. Reorganizing pages should be a rather rare operation, so a possible size increase should be acceptable. Applying the records will be much simpler and faster.

Creating an index page: `page_empty()`, `page_create()`

Optionally: INIT the page (not part of page_empty()
INDEX_INIT will create most of the page header and trailer and zero-fill the payload area
Optionally: WRITE to set FIL_PAGE_TYPE, PAGE_LEVEL, PAGE_MAX_TRX_ID, because INDEX_INIT will not touch them

The log volume should not be larger than with the old logging.

Writing an undo log record:

Old format:

MLOG_UNDO_INSERT(space_id, page_number, length, data)

This is 1+1‥5+1‥5+2+length bytes, that is, 5‥13+length bytes.

New format:

WRITE (space_id,page_number,TRX_UNDO_PAGE_HDR+TRX_UNDO_PAGE_FREE, 2, data)
WRITE (offset, length + 4, data)

Size: 1+1+1‥5+1‥5+2 bytes for the first record, 1+1‥3+1‥3+length+4 bytes for the second (assuming that length + 4 > 15), or total 13‥25+length bytes.

The overhead is 8‥12 bytes. Hopefully it will be more than compensated when logging record insertion (omitting index field length information).

Marko Mäkelä added a comment - 2019-06-14 09:53 Note: for index pages of ROW_FORMAT≠REDUNDANT , MLOG_COMP_ records will be written, with index field lengths. That will be omitted in the new format. Inserting records: Old format ( MLOG_REC_INSERT and MLOG_LIST_END_COPY_CREATED ), in page_cur_insert_rec_write_log() : MLOG_REC_INSERT (space_id,page_number,preceding_record_offset,inserted_record_offset). Omitted for MTR_LOG_SHORT_INSERTS . Instead, one MLOG_LIST_END_COPY_CREATED ) record will encapsulate multiple inserts. record size (including header) and a flag whether either the size or the header size differs from the preceding record If the size differs: 1 byte of info_bits of the to-be-inserted record, followed by header size and the first differing byte in the header data bytes of the record (and optionally header) Note: When the changes are applied by page_cur_parse_insert_rec() , the page header and footer will be updated by page_cur_rec_insert() without them having been mentioned in the log. Applying the record can cause a crash if the previous page contents is inconsistent with what was logged, say, if an incorrect version of the page is available. New format: MEMMOVE (offset,len,old_offset): copy some of the preceding record header and prefix of contents WRITE (offset,len,data): write the part of the record that differs from the preceding record The ‘same page’ flag will be set on all but the first record for the page. Thus, the (space_id,page_number) will be logged only once. In the new format, we must explicitly update the page header and footer as well: WRITE (offset,len,data) of PAGE_LAST_INSERT and possibly PAGE_FREE, PAGE_GARBAGE, PAGE_DIRECTION_B, PAGE_N_DIRECTION WRITE (offset,len,data) of PAGE_N_RECS and possibly PAGE_MAX_TRX_ID (this was logged separately earlier) If a page directory slot is added: WRITE (offset, 1, data) of n_owned and MEMMOVE and WRITE (offset, 2, data) to add a page directory slot At the minimum, we must update 1 or 2 bytes of PAGE_LAST_INSERT and PAGE_N_RECS. The byte offset of both will fit in 1 byte, so these records will occupy 2 bytes plus the data length, that is, 2*(2+1) = 6 bytes minimum. For inserting multiple records, we need to update the page header and footer only once. Reorganizing a page: btr_page_reorganize_low() Old format: MLOG_PAGE_REORGANIZE (space_id,page_number) Note: the index field lengths in MLOG_COMP_PAGE_REORGANIZE and MLOG_ZIP_PAGE_REORGANIZE can be very long. New format: Compare the old and reorganized page (for ROW_FORMAT=COMPRESSED , the compressed frame) WRITE the modified part of the page header MEMSET the unused portion of the page MEMMOVE and WRITE the record payload and page footer For reorganizing a page, we will obviously generate more log than earlier. Reorganizing pages should be a rather rare operation, so a possible size increase should be acceptable. Applying the records will be much simpler and faster. Creating an index page: page_empty() , page_create() Optionally: INIT the page (not part of page_empty() INDEX_INIT will create most of the page header and trailer and zero-fill the payload area Optionally: WRITE to set FIL_PAGE_TYPE, PAGE_LEVEL, PAGE_MAX_TRX_ID, because INDEX_INIT will not touch them The log volume should not be larger than with the old logging. Writing an undo log record: Old format: MLOG_UNDO_INSERT (space_id, page_number, length, data) This is 1+1‥5+1‥5+2+length bytes, that is, 5‥13+length bytes. New format: WRITE (space_id,page_number,TRX_UNDO_PAGE_HDR+TRX_UNDO_PAGE_FREE, 2, data) WRITE (offset, length + 4, data) Size: 1+1+1‥5+1‥5+2 bytes for the first record, 1+1‥3+1‥3+length+4 bytes for the second (assuming that length + 4 > 15), or total 13‥25+length bytes. The overhead is 8‥12 bytes. Hopefully it will be more than compensated when logging record insertion (omitting index field length information).

Andrei Elkin made changes - 2019-06-14 12:28

Assignee

Marko Mäkelä [ marko ]

Andrei Elkin [ elkin ]

Andrei Elkin made changes - 2019-06-14 12:29

Assignee

Andrei Elkin [ elkin ]

Marko Mäkelä [ marko ]

Marko Mäkelä added a comment - 2019-06-16 19:57

The function page_copy_rec_list_end_no_locks() should be extended with an option for logging page reorganize operations. Otherwise, page reorganize would have to iterate over the record lists twice: first, to copy the records, and then, to write log for crash recovery. Reorganize can emit the smallest possible number of MEMMOVE records followed by WRITE records (to adjust the page header, next-record links and the footer) and a MEMSET to clear the unused area.

Marko Mäkelä added a comment - 2019-06-16 19:57 The function page_copy_rec_list_end_no_locks() should be extended with an option for logging page reorganize operations. Otherwise, page reorganize would have to iterate over the record lists twice: first, to copy the records, and then, to write log for crash recovery. Reorganize can emit the smallest possible number of MEMMOVE records followed by WRITE records (to adjust the page header, next-record links and the footer) and a MEMSET to clear the unused area.

Marko Mäkelä made changes - 2019-06-17 14:35

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2019-06-19 13:19 - edited

Deleting records

Some overhead will be introduced for record deletion. In particular, MLOG_LIST_END_DELETE and MLOG_LIST_START_DELETE will only record the reference record. (Their MLOG_COMP_ variants will write the index field lengths as well, so in that case there could be savings.)

In the new low-level format, we must log each deleted record separately:

MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the succeeding record
MEMMOVE (offset,2,old_offset) to make the record point to the start of the PAGE_GARBAGE list
WRITE to update various page header and trailer fields (this can be done once for deleting multiple records)
optionally, MEMSET to clean the payload area of the record in the PAGE_GARBAGE list

Currently, page_create_empty() optimizes the case where the entire page becomes empty as a result of a deletion.

We could optimize one more special case of deletion to reduce both redo log volume and the frequency of page reorganize operations: When deleting the last inserted record(s) such that the maximum heap number will be decremented, we could free the space altogether instead of putting the record to the PAGE_GARBAGE list for potential future same-or-smaller-size reallocation. In this way, the logging would be as follows:

MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the successor of the being-deleted record(s)
MEMSET to zero out the being-deleted records
WRITE of page header and footer fields

Updating records

In the clustered index (which stores the data ordered by PRIMARY KEY), records can be ‘updated in place’ when neither the size nor the PRIMARY KEY of the record is not changing. In secondary indexes, ‘update in place’ is very rare, usually only happening when the case of a case-insensitive PRIMARY KEY is changing.

If an ‘update in place’ is not possible, InnoDB will execute delete and insert in the same page and possibly split the page. All this is covered by operations described above.

For ‘update in place’, we previously wrote a MLOG_REC_UPDATE_IN_PLACE record, which includes some information that can be useless:

flags for undo logging and locking (basically, whether the following DB_TRX_ID,DB_ROLL_PTR fields should be ignored
possibly ignored position of DB_TRX_ID column (1‥3 bytes)
possibly ignored value of DB_ROLL_PTR (7 bytes)
possibly ignored value of DB_TRX_ID (1‥7 bytes)
start offset of the record (2 bytes)
MLOG_COMP_REC_UPDATE_IN_PLACE for ROW_FORMAT≠REDUNDANT will encode index field lengths as well.

In contrast to this, the new format would only write WRITE records, possibly optimized to MEMMOVE if the page already contains the to-be-written value somewhere else. MEMMOVE could save space not only for frequently occurring user column values, but also for logging
DB_TRX_ID or part of DB_ROLL_PTR when the same transaction is updating multiple records in the same clustered index leaf page. Significant savings should be expected for logging updates.

Marko Mäkelä added a comment - 2019-06-19 13:19 - edited Deleting records Some overhead will be introduced for record deletion. In particular, MLOG_LIST_END_DELETE and MLOG_LIST_START_DELETE will only record the reference record. (Their MLOG_COMP_ variants will write the index field lengths as well, so in that case there could be savings.) In the new low-level format, we must log each deleted record separately: MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the succeeding record MEMMOVE (offset,2,old_offset) to make the record point to the start of the PAGE_GARBAGE list WRITE to update various page header and trailer fields (this can be done once for deleting multiple records) optionally, MEMSET to clean the payload area of the record in the PAGE_GARBAGE list Currently, page_create_empty() optimizes the case where the entire page becomes empty as a result of a deletion. We could optimize one more special case of deletion to reduce both redo log volume and the frequency of page reorganize operations: When deleting the last inserted record(s) such that the maximum heap number will be decremented, we could free the space altogether instead of putting the record to the PAGE_GARBAGE list for potential future same-or-smaller-size reallocation. In this way, the logging would be as follows: MEMMOVE (space_id,page_number,offset,2,old_offset) to make the preceding record point to the successor of the being-deleted record(s) MEMSET to zero out the being-deleted records WRITE of page header and footer fields Updating records In the clustered index (which stores the data ordered by PRIMARY KEY ), records can be ‘updated in place’ when neither the size nor the PRIMARY KEY of the record is not changing. In secondary indexes, ‘update in place’ is very rare, usually only happening when the case of a case-insensitive PRIMARY KEY is changing. If an ‘update in place’ is not possible, InnoDB will execute delete and insert in the same page and possibly split the page. All this is covered by operations described above. For ‘update in place’, we previously wrote a MLOG_REC_UPDATE_IN_PLACE record, which includes some information that can be useless: flags for undo logging and locking (basically, whether the following DB_TRX_ID,DB_ROLL_PTR fields should be ignored possibly ignored position of DB_TRX_ID column (1‥3 bytes) possibly ignored value of DB_ROLL_PTR (7 bytes) possibly ignored value of DB_TRX_ID (1‥7 bytes) start offset of the record (2 bytes) MLOG_COMP_REC_UPDATE_IN_PLACE for ROW_FORMAT≠REDUNDANT will encode index field lengths as well. In contrast to this, the new format would only write WRITE records, possibly optimized to MEMMOVE if the page already contains the to-be-written value somewhere else. MEMMOVE could save space not only for frequently occurring user column values, but also for logging DB_TRX_ID or part of DB_ROLL_PTR when the same transaction is updating multiple records in the same clustered index leaf page. Significant savings should be expected for logging updates.

Marko Mäkelä added a comment - 2019-08-15 10:17

When it comes to data file creation, we should implement strict write-ahead logging. Even with ~~MDEV-18128~~ fixed, InnoDB is currently doing it wrong:

create the file
preallocate the file, zero-initialized
write a dummy page 0
write a MLOG_FILE_ record with the file name
at some later point of time, write the actual contents of the data file from the buffer pool

This is performing the change first, and logging it only after the fact. Currently, recovery never creates or initializes data files.

If we did it properly, we could simply reduce it to the following:

write log records to create the file and initialize page 0 with final contents (in a single mini-transaction)
create the file
preallocate the file, zero-initialized
at some later point of time, write the actual contents of the data file from the buffer pool

Recovery would create the file if needed, and there would be no issue whatsoever if recovery would encounter an empty file or a file filled with zeroes.

Marko Mäkelä added a comment - 2019-08-15 10:17 When it comes to data file creation, we should implement strict write-ahead logging. Even with MDEV-18128 fixed, InnoDB is currently doing it wrong: create the file preallocate the file, zero-initialized write a dummy page 0 write a MLOG_FILE_ record with the file name at some later point of time, write the actual contents of the data file from the buffer pool This is performing the change first, and logging it only after the fact. Currently, recovery never creates or initializes data files. If we did it properly, we could simply reduce it to the following: write log records to create the file and initialize page 0 with final contents (in a single mini-transaction) create the file preallocate the file, zero-initialized at some later point of time, write the actual contents of the data file from the buffer pool Recovery would create the file if needed, and there would be no issue whatsoever if recovery would encounter an empty file or a file filled with zeroes.

Sergei Golubchik made changes - 2019-08-15 11:39

Priority

Major [ 3 ]

Critical [ 2 ]

Marko Mäkelä made changes - 2019-09-11 12:07

Link

This issue relates to ~~MDEV-20562~~ [ ~~MDEV-20562~~ ]

Marko Mäkelä made changes - 2019-09-13 10:43

Link

This issue relates to ~~MDEV-20584~~ [ ~~MDEV-20584~~ ]

Marko Mäkelä made changes - 2019-09-17 07:58

Link

This issue relates to ~~MDEV-20608~~ [ ~~MDEV-20608~~ ]

Marko Mäkelä made changes - 2019-09-17 08:20

Link

This issue relates to ~~MDEV-19747~~ [ ~~MDEV-19747~~ ]

Marko Mäkelä made changes - 2019-11-11 11:24

Link

This issue relates to ~~MDEV-21024~~ [ ~~MDEV-21024~~ ]

Marko Mäkelä added a comment - 2019-11-12 14:20

To prepare for this, I merged mtr_t::Impl and mtr_t::Command to mtr_t and removed unused or redundant data fields of mtr_t::Command.
Further changes will be made to replace various mlog_write_ functions with member functions of mtr_t.

Marko Mäkelä added a comment - 2019-11-12 14:20 To prepare for this, I merged mtr_t::Impl and mtr_t::Command to mtr_t and removed unused or redundant data fields of mtr_t::Command . Further changes will be made to replace various mlog_write_ functions with member functions of mtr_t .

Marko Mäkelä made changes - 2019-11-28 17:44

Link

This issue is blocked by ~~MDEV-21174~~ [ ~~MDEV-21174~~ ]

Marko Mäkelä made changes - 2020-01-02 11:50

Link

This issue is blocked by ~~MDEV-21351~~ [ ~~MDEV-21351~~ ]

Matthias Leich made changes - 2020-01-06 18:43

Attachment

simp_page_rec_set_n_owned.test [ 50212 ]

Matthias Leich made changes - 2020-01-06 18:43

Attachment

simp_page_rec_set_n_owned-master.opt [ 50213 ]

Matthias Leich made changes - 2020-01-06 18:43

Attachment

prt [ 50214 ]

Marko Mäkelä added a comment - 2020-01-09 15:48

I am still debugging changes of the new, efficient redo log record format. Several recovery tests are still failing, but some do pass.
I rebased the bb-10.5-~~MDEV-12353~~ branch to the current 10.5, with preparatory commits that replace all high-level InnoDB redo log record types with lower-level ones (and temporarily introduces a record MLOG_ZIP_WRITE_STRING for physical logging on ROW_FORMAT=COMPRESSED pages). This will actually increase the redo log volume and could decrease performance.

Once I have debugged the new redo log encoding (which I did not push yet), I will remove the code to recover from the old-format redo log. An upgrade after a crash of an earlier server would not be supported.

Marko Mäkelä added a comment - 2020-01-09 15:48 I am still debugging changes of the new, efficient redo log record format. Several recovery tests are still failing, but some do pass. I rebased the bb-10.5- MDEV-12353 branch to the current 10.5, with preparatory commits that replace all high-level InnoDB redo log record types with lower-level ones (and temporarily introduces a record MLOG_ZIP_WRITE_STRING for physical logging on ROW_FORMAT=COMPRESSED pages). This will actually increase the redo log volume and could decrease performance. Once I have debugged the new redo log encoding (which I did not push yet), I will remove the code to recover from the old-format redo log. An upgrade after a crash of an earlier server would not be supported.

Marko Mäkelä added a comment - 2020-01-22 15:55

The branch now includes an implementation of the new format, with the exception of the same_page flag. That is, even if a mini-transaction is writing multiple subsequent records for the same page, it will encode the page identifier in each record.

Marko Mäkelä added a comment - 2020-01-22 15:55 The branch now includes an implementation of the new format, with the exception of the same_page flag. That is, even if a mini-transaction is writing multiple subsequent records for the same page, it will encode the page identifier in each record.

Marko Mäkelä made changes - 2020-01-24 15:21

Link

This issue relates to ~~MDEV-13830~~ [ ~~MDEV-13830~~ ]

Marko Mäkelä added a comment - 2020-01-28 11:03

I have now implemented the same_page encoding. In some cases, we are writing more redo log than earlier, and I must start assessing and fixing those. For inserting records, we can optimize some long WRITE records by emitting MEMMOVE for copying the preceding record, and then only writing the last bytes that differ between the records.

Marko Mäkelä added a comment - 2020-01-28 11:03 I have now implemented the same_page encoding. In some cases, we are writing more redo log than earlier, and I must start assessing and fixing those. For inserting records, we can optimize some long WRITE records by emitting MEMMOVE for copying the preceding record, and then only writing the last bytes that differ between the records.

Marko Mäkelä made changes - 2020-02-06 10:39

Link

This issue is blocked by ~~MDEV-21674~~ [ ~~MDEV-21674~~ ]

Marko Mäkelä added a comment - 2020-02-11 15:28

I optimized page_cur_insert_rec_low() and BtrBulk so that they write fewer WRITE records and try to copy data from the preceding record with MEMMOVE. A crude benchmark in ~~MDEV-19747~~ indicates that the amount of redo log written is comparable to the old format. But we have not used all optimization potential yet; btr_page_reorganize() could be optimized further.

I tested a variation of the microbenchmark that I originally developed for ~~MDEV-19747~~:

--source include/have_innodb.inc

--source include/have_sequence.inc

show status like 'innodb_lsn_current';

SET profiling = 1;

CREATE TABLE t1 (

 a BIGINT PRIMARY KEY,

 b CHAR(255) NOT NULL DEFAULT '',

 c CHAR(255) NOT NULL DEFAULT '',

 d CHAR(255) NOT NULL DEFAULT '',

 e CHAR(255) NOT NULL DEFAULT '',

 f CHAR(255) NOT NULL DEFAULT '',

 g CHAR(255) NOT NULL DEFAULT '',

 h CHAR(255) NOT NULL DEFAULT ''

) ENGINE=InnoDB;

INSERT INTO t1 (a) SELECT seq FROM seq_1_to_500000;

SHOW profiles;

show status like 'innodb_lsn_current';

--let $shutdown_timeout=0

--source include/restart_mysqld.inc

DROP TABLE t1;

I executed it as follows:

./mtr --mysqld=--innodb-{page-size=4k,buffer-pool-size=64m,log-file-size=512m,force-recovery=2} innodb.recovery

The test execution time seems comparable between the two branches. There is bound to be some variation, because log checkpoints and page flushing are nondeterministic. Up to the CREATE TABLE, the ~~MDEV-12353~~ branch is writing slightly less log. At the end of the test, the LSN is slightly bigger for that branch. The new recovery code seems to be slightly faster, as expected.

I specified innodb_force_recovery=2 in order to shut down the purge of transaction history. (~~MDEV-12288~~ resetting the DB_TRX_ID after the INSERT would be nondeterministic.)

A debugging session with this test showed that the physical redo log recovery code is only invoking malloc() for adding log snippets to recv_sys.pages and related to some operations on file names. For the old redo log format, we would be invoking malloc() at least in mlog_parse_index().

This exercise revealed a recent performance regression which I fixed.

Marko Mäkelä added a comment - 2020-02-11 15:28 I optimized page_cur_insert_rec_low() and BtrBulk so that they write fewer WRITE records and try to copy data from the preceding record with MEMMOVE . A crude benchmark in MDEV-19747 indicates that the amount of redo log written is comparable to the old format. But we have not used all optimization potential yet; btr_page_reorganize() could be optimized further. I tested a variation of the microbenchmark that I originally developed for MDEV-19747 : --source include/have_innodb.inc --source include/have_sequence.inc show status like 'innodb_lsn_current' ; SET profiling = 1; CREATE TABLE t1 ( a BIGINT PRIMARY KEY , b CHAR (255) NOT NULL DEFAULT '' , c CHAR (255) NOT NULL DEFAULT '' , d CHAR (255) NOT NULL DEFAULT '' , e CHAR (255) NOT NULL DEFAULT '' , f CHAR (255) NOT NULL DEFAULT '' , g CHAR (255) NOT NULL DEFAULT '' , h CHAR (255) NOT NULL DEFAULT '' ) ENGINE=InnoDB; INSERT INTO t1 (a) SELECT seq FROM seq_1_to_500000; SHOW profiles; show status like 'innodb_lsn_current' ; --let $shutdown_timeout=0 --source include/restart_mysqld.inc DROP TABLE t1; I executed it as follows: ./mtr --mysqld=--innodb-{page-size=4k,buffer-pool-size=64m,log-file-size=512m,force-recovery=2} innodb.recovery The test execution time seems comparable between the two branches. There is bound to be some variation, because log checkpoints and page flushing are nondeterministic. Up to the CREATE TABLE , the MDEV-12353 branch is writing slightly less log. At the end of the test, the LSN is slightly bigger for that branch. The new recovery code seems to be slightly faster, as expected. I specified innodb_force_recovery=2 in order to shut down the purge of transaction history. ( MDEV-12288 resetting the DB_TRX_ID after the INSERT would be nondeterministic.) A debugging session with this test showed that the physical redo log recovery code is only invoking malloc() for adding log snippets to recv_sys.pages and related to some operations on file names. For the old redo log format, we would be invoking malloc() at least in mlog_parse_index() . This exercise revealed a recent performance regression which I fixed.

Marko Mäkelä made changes - 2020-02-11 16:28

Link

This issue is blocked by ~~MDEV-19747~~ [ ~~MDEV-19747~~ ]

Marko Mäkelä made changes - 2020-02-13 09:26

Link

This issue causes ~~MDEV-21724~~ [ ~~MDEV-21724~~ ]

Marko Mäkelä made changes - 2020-02-13 09:29

Link

This issue causes MDEV-21725 [ MDEV-21725 ]

Marko Mäkelä made changes - 2020-02-13 09:52

Link

This issue causes MDEV-21727 [ MDEV-21727 ]

Marko Mäkelä added a comment - 2020-02-14 05:01

Single-threaded performance is affected somewhat. Here is what I got for testing 10.5 immediately before/after ~~MDEV-12353~~, built with clang 9.0.1 -O2 -march=native -mtune=native on an Intel Xeon E5-2630 (Haswell microarchitecture):

./mtr main.sum_distinct-big

version	Debug	RelWithDebInfo
10.5 ac51bcfd8d3 (before)	152.600s	43.994s
10.5 f8a9f906679 (after)	147.608s	51.003s

If we omit the MyISAM part of the test and use CREATE TEMPORARY TABLE t2 or persistent CREATE TABLE t2 for the InnoDB test, the difference becomes as follows:

version	temporary	persistent
10.5 ac51bcfd8d3 (before)	12.458s	21.831s
10.5 f8a9f906679 (after)	12.808s	30.727s

There seems to be some performance regression that I had overlooked earlier. This ought to be due to the INSERT logging that we will be optimizing further in ~~MDEV-21724~~.

Marko Mäkelä added a comment - 2020-02-14 05:01 Single-threaded performance is affected somewhat. Here is what I got for testing 10.5 immediately before/after MDEV-12353 , built with clang 9.0.1 -O2 -march=native -mtune=native on an Intel Xeon E5-2630 (Haswell microarchitecture): ./mtr main.sum_distinct-big version Debug RelWithDebInfo 10.5 ac51bcfd8d3 (before) 152.600s 43.994s 10.5 f8a9f906679 (after) 147.608s 51.003s If we omit the MyISAM part of the test and use CREATE TEMPORARY TABLE t2 or persistent CREATE TABLE t2 for the InnoDB test, the difference becomes as follows: version temporary persistent 10.5 ac51bcfd8d3 (before) 12.458s 21.831s 10.5 f8a9f906679 (after) 12.808s 30.727s There seems to be some performance regression that I had overlooked earlier. This ought to be due to the INSERT logging that we will be optimizing further in MDEV-21724 .

Marko Mäkelä made changes - 2020-02-14 05:01

issue.field.resolutiondate

2020-02-14 05:01:19.0

2020-02-14 05:01:19.48

Marko Mäkelä made changes - 2020-02-14 05:01

Fix Version/s		10.5.1 [ 24029 ]
Fix Version/s	10.5 [ 23123 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Marko Mäkelä added a comment - 2020-02-14 09:19

The regression is due to increased redo log volume, and it does not affect small INSERT or UPDATE. It will be addressed in ~~MDEV-21724~~ and MDEV-21725.

Marko Mäkelä added a comment - 2020-02-14 09:19 The regression is due to increased redo log volume, and it does not affect small INSERT or UPDATE . It will be addressed in MDEV-21724 and MDEV-21725 .

Marko Mäkelä added a comment - 2020-02-14 09:36

The following test demonstrates that we are doing fine with in-place UPDATE:

--source include/have_innodb.inc

CREATE TABLE t2 (a BIGINT) ENGINE=InnoDB;

delimiter |;

CREATE PROCEDURE p(low BIGINT, high BIGINT)

begin

  SHOW STATUS LIKE 'innodb_lsn_current';

  INSERT INTO t2 VALUES(low);

  WHILE low<high DO

    UPDATE t2 SET a=low;

    SET low = low + 1;

  END WHILE;

  SHOW STATUS LIKE 'innodb_lsn_current';

end|

delimiter ;|

CALL p(9223372036854644735,9223372036854775807);

DROP PROCEDURE p;

DROP TABLE t2;

./mtr --mysqld=--innodb-force-recovery=2 main.ibu

revision	LSN1	LSN2-LSN1	time
before	61,792	27,098,532	3.92s
after	52,361	22,173,911	3.41s

Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log):

./mtr --mysqld=--skip-autocommit --mysqld=--innodb-force-recovery=2 main.ibu

revision	LSN1	LSN2-LSN1	time
before	61,792	11,512,365	1.49s
after	52,361	8,299,770	1.38s

In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed.

Marko Mäkelä added a comment - 2020-02-14 09:36 The following test demonstrates that we are doing fine with in-place UPDATE : --source include/have_innodb.inc CREATE TABLE t2 (a BIGINT ) ENGINE=InnoDB; delimiter |; CREATE PROCEDURE p(low BIGINT , high BIGINT ) begin SHOW STATUS LIKE 'innodb_lsn_current' ; INSERT INTO t2 VALUES (low); WHILE low<high DO UPDATE t2 SET a=low; SET low = low + 1; END WHILE; SHOW STATUS LIKE 'innodb_lsn_current' ; end | delimiter ;| CALL p(9223372036854644735,9223372036854775807); DROP PROCEDURE p; DROP TABLE t2; ./mtr --mysqld=--innodb-force-recovery=2 main.ibu revision LSN1 LSN2-LSN1 time before 61,792 27,098,532 3.92s after 52,361 22,173,911 3.41s Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log): ./mtr --mysqld=--skip-autocommit --mysqld=--innodb-force-recovery=2 main.ibu revision LSN1 LSN2-LSN1 time before 61,792 11,512,365 1.49s after 52,361 8,299,770 1.38s In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed.

Marko Mäkelä made changes - 2020-02-14 13:25

Fix Version/s		10.5.2 [ 24030 ]
Fix Version/s	10.5.1 [ 24029 ]

Vladislav Vaintroub added a comment - 2020-02-14 15:57

as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for ~~MDEV-21534~~ (basically, update_index sysbench). it even beats the log_write_up_to() in most scenarios in baseline,

Vladislav Vaintroub added a comment - 2020-02-14 15:57 as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for MDEV-21534 (basically, update_index sysbench). it even beats the log_write_up_to() in most scenarios in baseline,

Marko Mäkelä added a comment - 2020-02-16 12:26

Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page.

Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725, we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression.

Marko Mäkelä added a comment - 2020-02-16 12:26 Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page. Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725 , we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression.

Marko Mäkelä made changes - 2020-02-17 10:00

Link

This issue causes ~~MDEV-21744~~ [ ~~MDEV-21744~~ ]

Marko Mäkelä made changes - 2020-02-17 17:36

Link

This issue causes ~~MDEV-21749~~ [ ~~MDEV-21749~~ ]

Marko Mäkelä made changes - 2020-02-17 18:14

Link

This issue causes ~~MDEV-21751~~ [ ~~MDEV-21751~~ ]

Marko Mäkelä added a comment - 2020-02-18 06:47

I now believe that ~~MDEV-21724~~ has an even larger impact on performance. The amount of redo log written for an INSERT of an index record has almost doubled for records with small numbers of fields. To address that, we must introduce custom log records for those operations.

Marko Mäkelä added a comment - 2020-02-18 06:47 I now believe that MDEV-21724 has an even larger impact on performance. The amount of redo log written for an INSERT of an index record has almost doubled for records with small numbers of fields. To address that, we must introduce custom log records for those operations.

Marko Mäkelä made changes - 2020-02-18 09:38

Link

This issue causes ~~MDEV-21751~~ [ ~~MDEV-21751~~ ]

Marko Mäkelä added a comment - 2020-02-19 06:50 - edited

I collected statistics on the amount of redo log written for the following SQL:

--source include/have_innodb.inc

CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1),(2);

DROP TABLE t1;

I used the following commands in GDB to identify the mini-transactions:

break ha_innobase::write_row

run

break mtr_t::finish_write()

command 2

up 2

up

end

continue

continue

…

Here are the mini-transactions (excluding purge and log checkpoints):

bytes(old)	bytes(new)	bytes(revised)	source	operation
48	59	48	trx_undo_report_row_operation()	INSERT (1) undo log
36	47	27	row_ins_clust_index_entry_low()	INSERT (1) b-tree
14	22	14	trx_undo_report_row_operation()	INSERT (2) undo log
30	52	24	row_ins_clust_index_entry_low()	INSERT (2) b-tree
88	49	49	trx_commit_low()	COMMIT (INSERT)
107	108	97	trx_undo_report_row_operation()	DROP TABLE
20	17	17	row_upd_del_mark_clust_rec()
8	8	8	row_upd_clust_step()
63	72	64	trx_undo_report_row_operation()
20	17	17	row_upd_del_mark_clust_rec()
55	64	56	trx_undo_report_row_operation()
20	17	17	row_upd_del_mark_clust_rec()
52	61	53	trx_undo_report_row_operation()
20	17	17	row_upd_del_mark_clust_rec()
17	10	10	row_upd_sec_index_entry()
36	46	37	trx_undo_report_row_operation()
21	18	18	row_upd_del_mark_clust_rec()
36	45	37	trx_undo_report_row_operation()
21	18	18	row_upd_del_mark_clust_rec()
19	18	18	fil_delete_tablespace()
88	49	49	trx_commit_low()	COMMIT (DROP TABLE)

The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT. Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT.

The high-level log records for inserting index tree records will be introduced in ~~MDEV-21724~~. The deletion of records is not covered by this micro-benchmark, but it was improved earlier .

For updates, the new encoding is outperforming the old encoding, as expected.

Marko Mäkelä added a comment - 2020-02-19 06:50 - edited I collected statistics on the amount of redo log written for the following SQL: --source include/have_innodb.inc CREATE TABLE t1 (a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t1 VALUES (1),(2); DROP TABLE t1; I used the following commands in GDB to identify the mini-transactions: break ha_innobase::write_row run break mtr_t::finish_write() command 2 up 2 up end continue continue … Here are the mini-transactions (excluding purge and log checkpoints): bytes(old) bytes(new) bytes(revised) source operation 48 59 48 trx_undo_report_row_operation() INSERT (1) undo log 36 47 27 row_ins_clust_index_entry_low() INSERT (1) b-tree 14 22 14 trx_undo_report_row_operation() INSERT (2) undo log 30 52 24 row_ins_clust_index_entry_low() INSERT (2) b-tree 88 49 49 trx_commit_low() COMMIT (INSERT) 107 108 97 trx_undo_report_row_operation() DROP TABLE 20 17 17 row_upd_del_mark_clust_rec() 8 8 8 row_upd_clust_step() 63 72 64 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 55 64 56 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 52 61 53 trx_undo_report_row_operation() 20 17 17 row_upd_del_mark_clust_rec() 17 10 10 row_upd_sec_index_entry() 36 46 37 trx_undo_report_row_operation() 21 18 18 row_upd_del_mark_clust_rec() 36 45 37 trx_undo_report_row_operation() 21 18 18 row_upd_del_mark_clust_rec() 19 18 18 fil_delete_tablespace() 88 49 49 trx_commit_low() COMMIT (DROP TABLE) The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT . Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT . The high-level log records for inserting index tree records will be introduced in MDEV-21724 . The deletion of records is not covered by this micro-benchmark, but it was improved earlier . For updates, the new encoding is outperforming the old encoding, as expected.

Marko Mäkelä made changes - 2020-03-09 09:41

Link

This issue causes ~~MDEV-21892~~ [ ~~MDEV-21892~~ ]

Marko Mäkelä made changes - 2020-03-09 09:44

Link

This issue causes ~~MDEV-21893~~ [ ~~MDEV-21893~~ ]

Marko Mäkelä made changes - 2020-03-09 16:56

Link

This issue relates to ~~MDEV-18976~~ [ ~~MDEV-18976~~ ]

Marko Mäkelä made changes - 2020-03-09 17:49

Link

This issue relates to ~~MDEV-21899~~ [ ~~MDEV-21899~~ ]

Marko Mäkelä made changes - 2020-03-10 07:09

Link

This issue causes ~~MDEV-21748~~ [ ~~MDEV-21748~~ ]

Marko Mäkelä made changes - 2020-03-15 18:21

Link

This issue causes ~~MDEV-21945~~ [ ~~MDEV-21945~~ ]

Marko Mäkelä made changes - 2020-03-16 08:14

Link

This issue causes ~~MDEV-21949~~ [ ~~MDEV-21949~~ ]

Marko Mäkelä made changes - 2020-03-26 14:11

Link

This issue causes ~~MDEV-21850~~ [ ~~MDEV-21850~~ ]

Marko Mäkelä made changes - 2020-03-26 15:41

Link

This issue causes ~~MDEV-21779~~ [ ~~MDEV-21779~~ ]

Marko Mäkelä made changes - 2020-03-31 16:53

Link

This issue causes ~~MDEV-22097~~ [ ~~MDEV-22097~~ ]

Marko Mäkelä made changes - 2020-04-01 14:58

Link

This issue causes ~~MDEV-22107~~ [ ~~MDEV-22107~~ ]

Marko Mäkelä made changes - 2020-04-02 15:09

Link

This issue causes ~~MDEV-22108~~ [ ~~MDEV-22108~~ ]

Marko Mäkelä made changes - 2020-04-02 16:34

Link

This issue relates to ~~MDEV-22126~~ [ ~~MDEV-22126~~ ]

Marko Mäkelä made changes - 2020-04-08 05:39

Link

This issue relates to ~~MDEV-22097~~ [ ~~MDEV-22097~~ ]

Marko Mäkelä made changes - 2020-04-08 05:39

Link

This issue causes ~~MDEV-22097~~ [ ~~MDEV-22097~~ ]

Marko Mäkelä made changes - 2020-04-14 15:52

Link

This issue causes ~~MDEV-22242~~ [ ~~MDEV-22242~~ ]

Marko Mäkelä made changes - 2020-07-10 07:13

Link

This issue relates to ~~MDEV-23136~~ [ ~~MDEV-23136~~ ]

Marko Mäkelä made changes - 2020-08-14 06:55

Link

This issue relates to ~~MDEV-23474~~ [ ~~MDEV-23474~~ ]

Marko Mäkelä made changes - 2020-09-24 11:09

Link

This issue causes ~~MDEV-23806~~ [ ~~MDEV-23806~~ ]

Marko Mäkelä made changes - 2020-10-20 13:34

Link

This issue relates to ~~MDEV-23986~~ [ ~~MDEV-23986~~ ]

Marko Mäkelä made changes - 2020-11-11 13:39

Link

This issue causes ~~MDEV-24196~~ [ ~~MDEV-24196~~ ]

Marko Mäkelä made changes - 2020-11-18 11:32

Link

This issue relates to MDEV-20636 [ MDEV-20636 ]

Thirunarayanan Balathandayuthapani made changes - 2021-01-22 09:24

Link

This issue causes ~~MDEV-24652~~ [ ~~MDEV-24652~~ ]

Marko Mäkelä made changes - 2021-01-28 13:56

Link

This issue relates to ~~MDEV-24719~~ [ ~~MDEV-24719~~ ]

Marko Mäkelä made changes - 2021-02-15 10:31

Link

This issue causes ~~MDEV-24848~~ [ ~~MDEV-24848~~ ]

Rob Schwyzer (Inactive) made changes - 2021-06-23 22:52

Labels

performance recovery

ServiceNow performance recovery

Rob Schwyzer (Inactive) made changes - 2021-07-02 19:59

Labels

ServiceNow performance recovery

76qDvLB8Gju6Hs7nk3VY3EX42G795W5z performance recovery

Sergei Golubchik made changes - 2021-08-13 22:34

Labels

76qDvLB8Gju6Hs7nk3VY3EX42G795W5z performance recovery

performance recovery

Marko Mäkelä made changes - 2021-08-26 15:55

Link

This issue relates to ~~MDEV-26322~~ [ ~~MDEV-26322~~ ]

Rob Schwyzer (Inactive) made changes - 2021-09-30 16:38

Remote Link

This issue links to "Page (Confluence)" [ 32008 ]

Rob Schwyzer (Inactive) made changes - 2021-09-30 16:41

Remote Link

This issue links to "Page (Confluence)" [ 32020 ]

Rob Schwyzer (Inactive) made changes - 2021-10-07 16:39

Remote Link

This issue links to "Page (Confluence)" [ 32113 ]

Rob Schwyzer (Inactive) made changes - 2021-10-14 16:47

Remote Link

This issue links to "Page (Confluence)" [ 32225 ]

Rob Schwyzer (Inactive) made changes - 2021-10-21 16:52

Remote Link

This issue links to "Page (Confluence)" [ 32258 ]

Rob Schwyzer (Inactive) made changes - 2021-10-28 15:50

Remote Link

This issue links to "Page (Confluence)" [ 32307 ]

Rob Schwyzer (Inactive) made changes - 2021-11-03 22:46

Remote Link

This issue links to "Page (MariaDB Confluence)" [ 32326 ]

Rob Schwyzer (Inactive) made changes - 2021-11-04 19:14

Remote Link

This issue links to "Page (Confluence)" [ 32008 ]

Marko Mäkelä made changes - 2021-11-16 14:54

Link

This issue causes ~~MDEV-27059~~ [ ~~MDEV-27059~~ ]

Sergei Golubchik made changes - 2021-12-06 21:23

Workflow

MariaDB v3 [ 80109 ]

MariaDB v4 [ 133192 ]

Jan Lindström (Inactive) made changes - 2022-01-13 08:12

Link

This issue causes MDEV-27444 [ MDEV-27444 ]

Jan Lindström (Inactive) made changes - 2022-01-14 07:44

Link

This issue causes MDEV-27486 [ MDEV-27486 ]

Marko Mäkelä made changes - 2022-01-14 15:04

Link

This issue relates to ~~MDEV-27437~~ [ ~~MDEV-27437~~ ]

Marko Mäkelä made changes - 2022-04-07 12:36

Link

This issue relates to ~~MDEV-28256~~ [ ~~MDEV-28256~~ ]

Marko Mäkelä made changes - 2022-06-02 14:32

Link

This issue causes ~~MDEV-28731~~ [ ~~MDEV-28731~~ ]

Marko Mäkelä made changes - 2022-07-22 16:40

Link

This issue relates to ~~MDEV-29153~~ [ ~~MDEV-29153~~ ]

Marko Mäkelä made changes - 2022-11-25 08:06

Link

This issue causes ~~MDEV-24412~~ [ ~~MDEV-24412~~ ]

Aleksey Midenkov made changes - 2023-09-04 11:50

Link

This issue relates to ~~MDEV-31042~~ [ ~~MDEV-31042~~ ]

Marko Mäkelä made changes - 2023-09-11 08:46

Link

This issue relates to ~~MDEV-32144~~ [ ~~MDEV-32144~~ ]

Marko Mäkelä made changes - 2023-10-11 09:18

Link

This issue relates to ~~MDEV-32445~~ [ ~~MDEV-32445~~ ]

Marko Mäkelä made changes - 2023-12-11 14:01

Link

This issue relates to ~~MDEV-15274~~ [ ~~MDEV-15274~~ ]

Marko Mäkelä made changes - 2024-01-18 08:42

Link

This issue relates to ~~MDEV-33274~~ [ ~~MDEV-33274~~ ]

Rob Schwyzer (Inactive) made changes - 2024-04-04 19:53

Remote Link

This issue links to "Page (MariaDB Confluence)" [ 36698 ]

Rob Schwyzer (Inactive) made changes - 2024-04-05 18:33

Remote Link

This issue links to "Page (MariaDB Confluence)" [ 36698 ]

Jira Automation (IT) made changes - 2024-07-04 08:35

Zendesk Related Tickets		201658 167030
Zendesk active tickets		201658

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 4 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 2017-03-24 08:35

Updated:: 2024-09-06 10:00

Resolved:: 2020-02-14 05:01

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

Inserting records:

Reorganizing a page: btr_page_reorganize_low()

Creating an index page: page_empty(), page_create()

Writing an undo log record:

Deleting records

Updating records

People

Dates

Git Integration

Reorganizing a page: `btr_page_reorganize_low()`

Creating an index page: `page_empty()`, `page_create()`