[MDEV-18976] Implement a CHECKSUM redo log record for improved validation - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.6.9, 10.7.5, 10.8.4, 10.9.2
Component/s: mariabackup, Storage Engine - InnoDB
Labels:

Description

The InnoDB redo log mostly uses physical addressing (byte offsets within a page). While ~~MDEV-12353~~ introduced a physical log format, some operations, such as inserting a record (~~MDEV-21724~~) were optimized with special records that use some amount of logical and addressing (depending on the existing contents of the data page).

If a page that is read during redo log apply is older than what the redo log expects (if some page writes were missed for any reason), then most redo log apply operations would happily corrupt the page further. The corruption might sometimes be caught when an insert operation is being applied.

We should introduce an option to generate new ~~MDEV-12353~~ OPTION records (subtype CHECKSUM) at mtr_t::commit(). For every page that was modified by the mini-transaction, we would compute and write to the log a CRC-32C checksum of the uncompressed and unencrypted page contents, as well as the previous modification LSN of the page. The checksum would not necessarily be the same as the page checksum in data files. This checksum might not cover the FIL_PAGE_LSN field on the page, because that field would not be updated until page flushing takes place.

When ‘applying’ a CHECKSUM record, recovery (or mariabackup --prepare) would compute the corresponding checksum of the page and compare it to the one that is written to the log record. It would also compare the FIL_PAGE_LSN on the page to the one in the CHECKSUM record.

Attachments

Issue Links

causes

MDEV-29383 Assertion mysql_mutex_assert_owner(&log_sys.flush_order_mutex) failed in mtr_t::commit()

Closed

is blocked by

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

MDEV-24142 rw_lock_t has unnecessarily complex wait logic

Closed

MDEV-24612 innodb hangs if it's initialization is broken before encryption threads are started

Closed

relates to

MDEV-28840 innodb_undo_log_truncate is not crash-safe

Closed

MDEV-29438 Recovery or backup of instant ALTER TABLE is incorrect

Closed

MDEV-35796 OPT_PAGE_CHECKSUM is ignored if innodb_encrypt_log=ON

Stalled

MDEV-12353 Efficient InnoDB redo log record format

Closed

MDEV-12699 Improve crash recovery of corrupted data pages

Closed

MDEV-24705 add check that LSN of the last skipped log record equals to FIL_PAGE_LSN field

Closed

MDEV-30404 Inconsistent updates of PAGE_MAX_TRX_ID on ROW_FORMAT=COMPRESSED pages

Closed

(6 relates to)

Activity

Ascending order - Click to sort in descending order

Vladislav Lesin added a comment - 2020-12-31 18:26

FIL_PAGE_LSN is set in mtr_t::commit() when ReleaseBlocks::operator()(mtr_memo_slot_t* slot) is called. But in recv_recover_page() it's set when all hashed records are applied:

static void recv_recover_page(buf_block_t* block, mtr_t& mtr,

            const recv_sys_t::map::iterator& p,

            fil_space_t* space = NULL,

            mlog_init_t::init* init = NULL)

...

  for (const log_rec_t* recv : p->second.log) {

...

    log_phys_t::apply_status a= l->apply(*block,

                 p->second.last_offset);

...

...

  if (start_lsn) {

    ut_ad(end_lsn >= start_lsn);

    mach_write_to_8(FIL_PAGE_LSN + frame, end_lsn);

    if (UNIV_LIKELY(frame == block->frame)) {

      mach_write_to_8(srv_page_size

          - FIL_PAGE_END_LSN_OLD_CHKSUM

          + frame, end_lsn);

...

    mysql_mutex_lock(&log_sys.flush_order_mutex);

    buf_flush_note_modification(block, start_lsn, end_lsn);

    mysql_mutex_unlock(&log_sys.flush_order_mutex);

...

  mtr.commit();

...

So we can't currently compare FIL_PAGE_LSN when CHECKSUM record is applied.

Vladislav Lesin added a comment - 2020-12-31 18:26 FIL_PAGE_LSN is set in mtr_t::commit() when ReleaseBlocks::operator()(mtr_memo_slot_t* slot) is called. But in recv_recover_page() it's set when all hashed records are applied: static void recv_recover_page(buf_block_t* block, mtr_t& mtr, const recv_sys_t::map::iterator& p, fil_space_t* space = NULL, mlog_init_t::init* init = NULL) { ... for ( const log_rec_t* recv : p->second.log) { ... log_phys_t::apply_status a= l->apply(*block, p->second.last_offset); ... } ... if (start_lsn) { ut_ad(end_lsn >= start_lsn); mach_write_to_8(FIL_PAGE_LSN + frame, end_lsn); if (UNIV_LIKELY(frame == block->frame)) { mach_write_to_8(srv_page_size - FIL_PAGE_END_LSN_OLD_CHKSUM + frame, end_lsn); } ... mysql_mutex_lock(&log_sys.flush_order_mutex); buf_flush_note_modification(block, start_lsn, end_lsn); mysql_mutex_unlock(&log_sys.flush_order_mutex); } ... mtr.commit(); ... } So we can't currently compare FIL_PAGE_LSN when CHECKSUM record is applied.

Vladislav Lesin added a comment - 2021-01-01 12:25 - edited

Pushed bb-10.5-MDEV-18976-redolog-crc branch. FIL_PAGE_LSN check is not implemented by the reason described above.

There is no clear description about what must happen if CRC does no match, so I implemented the version where warning is issued in error log.

I did not find the other way to test it except injecting debug code. My initial intention was to change pages FIL_PAGE_LSN to start recovery with OPTION CHECKSUM record, in this case page CRC would not match, but I did not find the way to pass/to count LSN of CHECKSUM record in mtr test.

Vladislav Lesin added a comment - 2021-01-01 12:25 - edited Pushed bb-10.5-MDEV-18976-redolog-crc branch. FIL_PAGE_LSN check is not implemented by the reason described above. There is no clear description about what must happen if CRC does no match, so I implemented the version where warning is issued in error log. I did not find the other way to test it except injecting debug code. My initial intention was to change pages FIL_PAGE_LSN to start recovery with OPTION CHECKSUM record, in this case page CRC would not match, but I did not find the way to pass/to count LSN of CHECKSUM record in mtr test.

Marko Mäkelä added a comment - 2022-05-31 06:48

The recursive page latches during page allocation and BLOB operations are making it challenging to implement this. I got an https://rr-project.org trace of a checksum mismatch. For the recovery of a mini-transaction that deletes a BLOB page by page during the ROLLBACK of an INSERT, there will be a checksum mismatch for an allocation bitmap page, because that page had already been modified in a ‘parent’ mini-transaction that did not write its log yet. The parent mini-transaction holds exclusive latches on the clustered index leaf page as well as the allocation bitmap page. I think that we must specially flag those sub-mini-transactions, so that checksum records will only be written in the parent mini-transaction.

There are also some challenges to implement this for ROW_FORMAT=COMPRESSED pages, because there are two copies of the page in the buffer pool.

Marko Mäkelä added a comment - 2022-05-31 06:48 The recursive page latches during page allocation and BLOB operations are making it challenging to implement this. I got an https://rr-project.org trace of a checksum mismatch. For the recovery of a mini-transaction that deletes a BLOB page by page during the ROLLBACK of an INSERT , there will be a checksum mismatch for an allocation bitmap page, because that page had already been modified in a ‘parent’ mini-transaction that did not write its log yet. The parent mini-transaction holds exclusive latches on the clustered index leaf page as well as the allocation bitmap page. I think that we must specially flag those sub-mini-transactions, so that checksum records will only be written in the parent mini-transaction. There are also some challenges to implement this for ROW_FORMAT=COMPRESSED pages, because there are two copies of the page in the buffer pool.

Marko Mäkelä added a comment - 2022-06-02 11:35

There is an anomaly where OPT_PAGE_CHECKSUM records could be emitted after a FREE_PAGE record. In this case, recovery would fail, because a checksum would be computed on a page that was freed during recovery.

This can happen during DROP INDEX (or DROP TABLE or any table-rebuilding DDL in the system tablespace). To prevent the anomaly, we must revise the logic of ~~MDEV-15528~~ and mark as freed any buffer pool pages that were modified by the mini-transaction before being freed in the mini-transaction. That in turn will depend on sux_lock_t::x_lock_upgraded() that was introduced in ~~MDEV-24142~~. Furthermore, this will depend on some other changes that were made as part of ~~MDEV-13542~~.

Marko Mäkelä added a comment - 2022-06-02 11:35 There is an anomaly where OPT_PAGE_CHECKSUM records could be emitted after a FREE_PAGE record. In this case, recovery would fail, because a checksum would be computed on a page that was freed during recovery. This can happen during DROP INDEX (or DROP TABLE or any table-rebuilding DDL in the system tablespace). To prevent the anomaly, we must revise the logic of MDEV-15528 and mark as freed any buffer pool pages that were modified by the mini-transaction before being freed in the mini-transaction. That in turn will depend on sux_lock_t::x_lock_upgraded() that was introduced in MDEV-24142 . Furthermore, this will depend on some other changes that were made as part of MDEV-13542 .

MariaDB Server

Implement a CHECKSUM redo log record for improved validation

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration