[MDEV-18976] Implement a CHECKSUM redo log record for improved validation Created: 2019-03-20  Updated: 2023-01-26  Resolved: 2022-06-06

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - InnoDB
Fix Version/s: 10.6.9, 10.7.5, 10.8.4, 10.9.2

Type: Task Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 3
Labels: backup, corruption, recovery

Issue Links:
Blocks
is blocked by MDEV-13542 Crashing on a corrupted page is unhel... Closed
is blocked by MDEV-24142 rw_lock_t has unnecessarily complex w... Closed
is blocked by MDEV-24612 innodb hangs if it's initialization i... Closed
Problem/Incident
causes MDEV-29383 Assertion mysql_mutex_assert_owner(&l... Closed
Relates
relates to MDEV-28840 innodb_undo_log_truncate is not crash... Closed
relates to MDEV-29438 Recovery or backup of instant ALTER T... Closed
relates to MDEV-12353 Efficient InnoDB redo log record format Closed
relates to MDEV-12699 Improve crash recovery of corrupted d... Closed
relates to MDEV-24705 add check that LSN of the last skippe... Closed
relates to MDEV-30404 Inconsistent updates of PAGE_MAX_TRX_... Closed

 Description   

The InnoDB redo log mostly uses physical addressing (byte offsets within a page). While MDEV-12353 introduced a physical log format, some operations, such as inserting a record (MDEV-21724) were optimized with special records that use some amount of logical and addressing (depending on the existing contents of the data page).

If a page that is read during redo log apply is older than what the redo log expects (if some page writes were missed for any reason), then most redo log apply operations would happily corrupt the page further. The corruption might sometimes be caught when an insert operation is being applied.

We should introduce an option to generate new MDEV-12353 OPTION records (subtype CHECKSUM) at mtr_t::commit(). For every page that was modified by the mini-transaction, we would compute and write to the log a CRC-32C checksum of the uncompressed and unencrypted page contents, as well as the previous modification LSN of the page. The checksum would not necessarily be the same as the page checksum in data files. This checksum might not cover the FIL_PAGE_LSN field on the page, because that field would not be updated until page flushing takes place.

When ‘applying’ a CHECKSUM record, recovery (or mariabackup --prepare) would compute the corresponding checksum of the page and compare it to the one that is written to the log record. It would also compare the FIL_PAGE_LSN on the page to the one in the CHECKSUM record.



 Comments   
Comment by Vladislav Lesin [ 2020-12-31 ]

FIL_PAGE_LSN is set in mtr_t::commit() when ReleaseBlocks::operator()(mtr_memo_slot_t* slot) is called. But in recv_recover_page() it's set when all hashed records are applied:

static void recv_recover_page(buf_block_t* block, mtr_t& mtr,                   
            const recv_sys_t::map::iterator& p,                                 
            fil_space_t* space = NULL,                                          
            mlog_init_t::init* init = NULL)                                     
{
...
  for (const log_rec_t* recv : p->second.log) {
...
    log_phys_t::apply_status a= l->apply(*block,                                
                 p->second.last_offset);
...
  }
...
  if (start_lsn) {                                                              
    ut_ad(end_lsn >= start_lsn);                                                
    mach_write_to_8(FIL_PAGE_LSN + frame, end_lsn);                             
    if (UNIV_LIKELY(frame == block->frame)) {                                   
      mach_write_to_8(srv_page_size                                             
          - FIL_PAGE_END_LSN_OLD_CHKSUM                                         
          + frame, end_lsn);                                                    
    }
...                                          
    mysql_mutex_lock(&log_sys.flush_order_mutex);                               
    buf_flush_note_modification(block, start_lsn, end_lsn);                     
    mysql_mutex_unlock(&log_sys.flush_order_mutex);                             
  }
...
  mtr.commit(); 
...
}

So we can't currently compare FIL_PAGE_LSN when CHECKSUM record is applied.

Comment by Vladislav Lesin [ 2021-01-01 ]

Pushed bb-10.5-MDEV-18976-redolog-crc branch. FIL_PAGE_LSN check is not implemented by the reason described above.

There is no clear description about what must happen if CRC does no match, so I implemented the version where warning is issued in error log.

I did not find the other way to test it except injecting debug code. My initial intention was to change pages FIL_PAGE_LSN to start recovery with OPTION CHECKSUM record, in this case page CRC would not match, but I did not find the way to pass/to count LSN of CHECKSUM record in mtr test.

Comment by Marko Mäkelä [ 2022-05-31 ]

The recursive page latches during page allocation and BLOB operations are making it challenging to implement this. I got an https://rr-project.org trace of a checksum mismatch. For the recovery of a mini-transaction that deletes a BLOB page by page during the ROLLBACK of an INSERT, there will be a checksum mismatch for an allocation bitmap page, because that page had already been modified in a ‘parent’ mini-transaction that did not write its log yet. The parent mini-transaction holds exclusive latches on the clustered index leaf page as well as the allocation bitmap page. I think that we must specially flag those sub-mini-transactions, so that checksum records will only be written in the parent mini-transaction.

There are also some challenges to implement this for ROW_FORMAT=COMPRESSED pages, because there are two copies of the page in the buffer pool.

Comment by Marko Mäkelä [ 2022-06-02 ]

There is an anomaly where OPT_PAGE_CHECKSUM records could be emitted after a FREE_PAGE record. In this case, recovery would fail, because a checksum would be computed on a page that was freed during recovery.

This can happen during DROP INDEX (or DROP TABLE or any table-rebuilding DDL in the system tablespace). To prevent the anomaly, we must revise the logic of MDEV-15528 and mark as freed any buffer pool pages that were modified by the mini-transaction before being freed in the mini-transaction. That in turn will depend on sux_lock_t::x_lock_upgraded() that was introduced in MDEV-24142. Furthermore, this will depend on some other changes that were made as part of MDEV-13542.

Generated at Thu Feb 08 08:48:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.