[MDEV-12353] Efficient InnoDB redo log record format Created: 2017-03-24 Updated: 2024-01-18 Resolved: 2020-02-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 10.5.2 |
| Type: | Task | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | performance, recovery | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The InnoDB crash recovery performance can be improved a little while not changing the file format. We should fix some fundamental issues that exist with the current InnoDB redo log record format:
In this task, we will only improve the redo log record format. Format changes to the redo log blocks and files will be covered by |
| Comments |
| Comment by Marko Mäkelä [ 2017-12-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Related to this, in MariaDB 10.3, I removed the following redo log record types:
Simplifying the redo logging of basic B-tree operations would bring much more benefit. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-06-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I had to put back the MLOG_UNDO_INSERT and MLOG_UNDO_INIT redo log record types, because the lower-level records are occupying more space in the redo log, and thus measurably slowing down the server. This is a problem with the redo log record format: if a mini-transaction contains multiple log records for the same page, each record will repeat the tablespace ID and page number. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that the records of a mini-transaction could be a stream of records that is always terminated by a NUL byte. The NUL byte would also act as padding in log blocks. We would remove the MLOG_SINGLE_REC_FLAG that currently identifies single-record mini-transactions. The first byte of the record would contain a record type, flags, and a part of length. At the minimum, we would seem to need the following record types:
For ROW_FORMAT=COMPRESSED pages, we would mostly write WRITE records that would refer to the compressed page frames. Currently the MLOG_ records refer to the uncompressed page frames. We would seem to need 3 bits for the redo log record type. MLOG_FILE_ records and MLOG_CHECKPOINT can be represented by setting the "same page" flag at the start of the mini-transaction. After the first record within a mini-transaction that has this flag clear, there could not be any non-page redo log records. This would leave 4 bits for record length in the first byte. Values 1‥15 would represent lengths of 1 to 15 bytes. If the total length of the record is longer than 15 bytes, then the value 0 would be used to indicate that 1 to 3 length bytes will follow. The encoded tablespace identifier and page number could use variable-length encoding, instead of always using 4+4 bytes. For encoding the byte offset for WRITE and MEMSET operations, we will use a variable-length encoding of 1‥3 bytes, instead of always logging 2 bytes. I would not use any delta coding where the byte offsets would be relative to preceding operations in the same mini-transaction. Keeping it simple allows recovery to group together redo log records for the same page from multiple independent mini-transactions. As mentioned at the start of this comment, the type byte 0 would be special, marking the end of a mini-transaction. We could use the corresponding flagged value 0x80 for something special, such as a future extension when more type codes are needed, or for encoding rarely needed redo log records. Examples:
kaamos, you expressed interest in this work in the 2019 New York Unconference. I would like to know your opinion about this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As explained in For If This would leave us with the following format: First byte: ‘same page’ flag, record type, and a part of length.
Note: The ‘same page’ flag can never be set on the INIT record, because the INIT record should only be issued for a freshly allocated page, and a single mini-transaction would not free and then allocate the same page. The ‘same page’ flag is also unlikely (but not impossible) to be set on the FREE record. Next, I will try to estimate the logged size of some operations when using these lower-level records. The most interesting ones are btr_page_reorganize() (which would use MEMMOVE) and page_move_rec_list_start() (which would use WRITE and MEMSET). The mode MTR_LOG_SHORT_INSERTS would be removed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Note: for index pages of ROW_FORMAT≠REDUNDANT, MLOG_COMP_ records will be written, with index field lengths. That will be omitted in the new format. Inserting records:Old format (MLOG_REC_INSERT and MLOG_LIST_END_COPY_CREATED), in page_cur_insert_rec_write_log():
Note: When the changes are applied by page_cur_parse_insert_rec(), the page header and footer will be updated by page_cur_rec_insert() without them having been mentioned in the log. Applying the record can cause a crash if the previous page contents is inconsistent with what was logged, say, if an incorrect version of the page is available. New format:
The ‘same page’ flag will be set on all but the first record for the page. Thus, the (space_id,page_number) will be logged only once. In the new format, we must explicitly update the page header and footer as well:
At the minimum, we must update 1 or 2 bytes of PAGE_LAST_INSERT and PAGE_N_RECS. The byte offset of both will fit in 1 byte, so these records will occupy 2 bytes plus the data length, that is, 2*(2+1) = 6 bytes minimum. For inserting multiple records, we need to update the page header and footer only once. Reorganizing a page: btr_page_reorganize_low()Old format:
Note: the index field lengths in MLOG_COMP_PAGE_REORGANIZE and MLOG_ZIP_PAGE_REORGANIZE can be very long. New format:
For reorganizing a page, we will obviously generate more log than earlier. Reorganizing pages should be a rather rare operation, so a possible size increase should be acceptable. Applying the records will be much simpler and faster. Creating an index page: page_empty(), page_create()
The log volume should not be larger than with the old logging. Writing an undo log record:Old format:
This is 1+1‥5+1‥5+2+length bytes, that is, 5‥13+length bytes. New format:
Size: 1+1+1‥5+1‥5+2 bytes for the first record, 1+1‥3+1‥3+length+4 bytes for the second (assuming that length + 4 > 15), or total 13‥25+length bytes. The overhead is 8‥12 bytes. Hopefully it will be more than compensated when logging record insertion (omitting index field length information). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The function page_copy_rec_list_end_no_locks() should be extended with an option for logging page reorganize operations. Otherwise, page reorganize would have to iterate over the record lists twice: first, to copy the records, and then, to write log for crash recovery. Reorganize can emit the smallest possible number of MEMMOVE records followed by WRITE records (to adjust the page header, next-record links and the footer) and a MEMSET to clear the unused area. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Deleting recordsSome overhead will be introduced for record deletion. In particular, MLOG_LIST_END_DELETE and MLOG_LIST_START_DELETE will only record the reference record. (Their MLOG_COMP_ variants will write the index field lengths as well, so in that case there could be savings.) In the new low-level format, we must log each deleted record separately:
Currently, page_create_empty() optimizes the case where the entire page becomes empty as a result of a deletion. We could optimize one more special case of deletion to reduce both redo log volume and the frequency of page reorganize operations: When deleting the last inserted record(s) such that the maximum heap number will be decremented, we could free the space altogether instead of putting the record to the PAGE_GARBAGE list for potential future same-or-smaller-size reallocation. In this way, the logging would be as follows:
Updating recordsIn the clustered index (which stores the data ordered by PRIMARY KEY), records can be ‘updated in place’ when neither the size nor the PRIMARY KEY of the record is not changing. In secondary indexes, ‘update in place’ is very rare, usually only happening when the case of a case-insensitive PRIMARY KEY is changing. If an ‘update in place’ is not possible, InnoDB will execute delete and insert in the same page and possibly split the page. All this is covered by operations described above. For ‘update in place’, we previously wrote a MLOG_REC_UPDATE_IN_PLACE record, which includes some information that can be useless:
In contrast to this, the new format would only write WRITE records, possibly optimized to MEMMOVE if the page already contains the to-be-written value somewhere else. MEMMOVE could save space not only for frequently occurring user column values, but also for logging | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
When it comes to data file creation, we should implement strict write-ahead logging. Even with
This is performing the change first, and logging it only after the fact. Currently, recovery never creates or initializes data files. If we did it properly, we could simply reduce it to the following:
Recovery would create the file if needed, and there would be no issue whatsoever if recovery would encounter an empty file or a file filled with zeroes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-11-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To prepare for this, I merged mtr_t::Impl and mtr_t::Command to mtr_t and removed unused or redundant data fields of mtr_t::Command. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am still debugging changes of the new, efficient redo log record format. Several recovery tests are still failing, but some do pass. Once I have debugged the new redo log encoding (which I did not push yet), I will remove the code to recover from the old-format redo log. An upgrade after a crash of an earlier server would not be supported. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The branch now includes an implementation of the new format, with the exception of the same_page flag. That is, even if a mini-transaction is writing multiple subsequent records for the same page, it will encode the page identifier in each record. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have now implemented the same_page encoding. In some cases, we are writing more redo log than earlier, and I must start assessing and fixing those. For inserting records, we can optimize some long WRITE records by emitting MEMMOVE for copying the preceding record, and then only writing the last bytes that differ between the records. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I optimized page_cur_insert_rec_low() and BtrBulk so that they write fewer WRITE records and try to copy data from the preceding record with MEMMOVE. A crude benchmark in I tested a variation of the microbenchmark that I originally developed for
I executed it as follows:
The test execution time seems comparable between the two branches. There is bound to be some variation, because log checkpoints and page flushing are nondeterministic. Up to the CREATE TABLE, the I specified innodb_force_recovery=2 in order to shut down the purge of transaction history. ( A debugging session with this test showed that the physical redo log recovery code is only invoking malloc() for adding log snippets to recv_sys.pages and related to some operations on file names. For the old redo log format, we would be invoking malloc() at least in mlog_parse_index(). This exercise revealed a recent performance regression which I fixed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Single-threaded performance is affected somewhat. Here is what I got for testing 10.5 immediately before/after
If we omit the MyISAM part of the test and use CREATE TEMPORARY TABLE t2 or persistent CREATE TABLE t2 for the InnoDB test, the difference becomes as follows:
There seems to be some performance regression that I had overlooked earlier. This ought to be due to the INSERT logging that we will be optimizing further in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The regression is due to increased redo log volume, and it does not affect small INSERT or UPDATE. It will be addressed in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following test demonstrates that we are doing fine with in-place UPDATE:
Let us repeat the same in a single transaction (less overhead to manage undo log pages, or to flush the redo log):
In both cases, the compact physical log format outperforms the old format, both in terms of redo log written and time elapsed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-02-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
as far as I can tell, the page_cur_insert_rec_low() dominates the perf top for small (index) updates for the test case for | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Before MDEV-21725 is completed, page reorganize operations will essentially copy the entire page payload to the redo log by invoking page_cur_insert_rec_low() for every record, instead of only writing a minimal amount of log to cover the physical changes to the page. Updating an indexed record will cause an insert, deferred delete (purge) and potentially a large number of page reorganize operations. Also, page splits and merges will cause deletions of record ranges, which has not been optimized yet either. (We write redundant log records for updating some page header fields multiple times, for example.) I think that in MDEV-21725, we should log the deletions of record ranges in the same way we would log a page reorganize (and combine the deletion with a reorganization). That should cure the observed regression. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I now believe that | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I collected statistics on the amount of redo log written for the following SQL:
I used the following commands in GDB to identify the mini-transactions:
Here are the mini-transactions (excluding purge and log checkpoints):
The third column shows the impact of introducing higher-level log records that correspond to MLOG_UNDO_INSERT and roughly correspond to MLOG_UNDO_INIT. Due to a different encoding, we sometimes lose 1 byte in our UNDO_APPEND replacement of MLOG_UNDO_INSERT. The high-level log records for inserting index tree records will be introduced in For updates, the new encoding is outperforming the old encoding, as expected. |