Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-18976

Implement a CHECKSUM redo log record for improved validation

Details

    Description

      The InnoDB redo log mostly uses physical addressing (byte offsets within a page). While MDEV-12353 introduced a physical log format, some operations, such as inserting a record (MDEV-21724) were optimized with special records that use some amount of logical and addressing (depending on the existing contents of the data page).

      If a page that is read during redo log apply is older than what the redo log expects (if some page writes were missed for any reason), then most redo log apply operations would happily corrupt the page further. The corruption might sometimes be caught when an insert operation is being applied.

      We should introduce an option to generate new MDEV-12353 OPTION records (subtype CHECKSUM) at mtr_t::commit(). For every page that was modified by the mini-transaction, we would compute and write to the log a CRC-32C checksum of the uncompressed and unencrypted page contents, as well as the previous modification LSN of the page. The checksum would not necessarily be the same as the page checksum in data files. This checksum might not cover the FIL_PAGE_LSN field on the page, because that field would not be updated until page flushing takes place.

      When ‘applying’ a CHECKSUM record, recovery (or mariabackup --prepare) would compute the corresponding checksum of the page and compare it to the one that is written to the log record. It would also compare the FIL_PAGE_LSN on the page to the one in the CHECKSUM record.

      Attachments

        Issue Links

          Activity

            FIL_PAGE_LSN is set in mtr_t::commit() when ReleaseBlocks::operator()(mtr_memo_slot_t* slot) is called. But in recv_recover_page() it's set when all hashed records are applied:

            static void recv_recover_page(buf_block_t* block, mtr_t& mtr,                   
                        const recv_sys_t::map::iterator& p,                                 
                        fil_space_t* space = NULL,                                          
                        mlog_init_t::init* init = NULL)                                     
            {
            ...
              for (const log_rec_t* recv : p->second.log) {
            ...
                log_phys_t::apply_status a= l->apply(*block,                                
                             p->second.last_offset);
            ...
              }
            ...
              if (start_lsn) {                                                              
                ut_ad(end_lsn >= start_lsn);                                                
                mach_write_to_8(FIL_PAGE_LSN + frame, end_lsn);                             
                if (UNIV_LIKELY(frame == block->frame)) {                                   
                  mach_write_to_8(srv_page_size                                             
                      - FIL_PAGE_END_LSN_OLD_CHKSUM                                         
                      + frame, end_lsn);                                                    
                }
            ...                                          
                mysql_mutex_lock(&log_sys.flush_order_mutex);                               
                buf_flush_note_modification(block, start_lsn, end_lsn);                     
                mysql_mutex_unlock(&log_sys.flush_order_mutex);                             
              }
            ...
              mtr.commit(); 
            ...
            }
            

            So we can't currently compare FIL_PAGE_LSN when CHECKSUM record is applied.

            vlad.lesin Vladislav Lesin added a comment - FIL_PAGE_LSN is set in mtr_t::commit() when ReleaseBlocks::operator()(mtr_memo_slot_t* slot) is called. But in recv_recover_page() it's set when all hashed records are applied: static void recv_recover_page(buf_block_t* block, mtr_t& mtr, const recv_sys_t::map::iterator& p, fil_space_t* space = NULL, mlog_init_t::init* init = NULL) { ... for ( const log_rec_t* recv : p->second.log) { ... log_phys_t::apply_status a= l->apply(*block, p->second.last_offset); ... } ... if (start_lsn) { ut_ad(end_lsn >= start_lsn); mach_write_to_8(FIL_PAGE_LSN + frame, end_lsn); if (UNIV_LIKELY(frame == block->frame)) { mach_write_to_8(srv_page_size - FIL_PAGE_END_LSN_OLD_CHKSUM + frame, end_lsn); } ... mysql_mutex_lock(&log_sys.flush_order_mutex); buf_flush_note_modification(block, start_lsn, end_lsn); mysql_mutex_unlock(&log_sys.flush_order_mutex); } ... mtr.commit(); ... } So we can't currently compare FIL_PAGE_LSN when CHECKSUM record is applied.
            vlad.lesin Vladislav Lesin added a comment - - edited

            Pushed bb-10.5-MDEV-18976-redolog-crc branch. FIL_PAGE_LSN check is not implemented by the reason described above.

            There is no clear description about what must happen if CRC does no match, so I implemented the version where warning is issued in error log.

            I did not find the other way to test it except injecting debug code. My initial intention was to change pages FIL_PAGE_LSN to start recovery with OPTION CHECKSUM record, in this case page CRC would not match, but I did not find the way to pass/to count LSN of CHECKSUM record in mtr test.

            vlad.lesin Vladislav Lesin added a comment - - edited Pushed bb-10.5-MDEV-18976-redolog-crc branch. FIL_PAGE_LSN check is not implemented by the reason described above. There is no clear description about what must happen if CRC does no match, so I implemented the version where warning is issued in error log. I did not find the other way to test it except injecting debug code. My initial intention was to change pages FIL_PAGE_LSN to start recovery with OPTION CHECKSUM record, in this case page CRC would not match, but I did not find the way to pass/to count LSN of CHECKSUM record in mtr test.

            The recursive page latches during page allocation and BLOB operations are making it challenging to implement this. I got an https://rr-project.org trace of a checksum mismatch. For the recovery of a mini-transaction that deletes a BLOB page by page during the ROLLBACK of an INSERT, there will be a checksum mismatch for an allocation bitmap page, because that page had already been modified in a ‘parent’ mini-transaction that did not write its log yet. The parent mini-transaction holds exclusive latches on the clustered index leaf page as well as the allocation bitmap page. I think that we must specially flag those sub-mini-transactions, so that checksum records will only be written in the parent mini-transaction.

            There are also some challenges to implement this for ROW_FORMAT=COMPRESSED pages, because there are two copies of the page in the buffer pool.

            marko Marko Mäkelä added a comment - The recursive page latches during page allocation and BLOB operations are making it challenging to implement this. I got an https://rr-project.org trace of a checksum mismatch. For the recovery of a mini-transaction that deletes a BLOB page by page during the ROLLBACK of an INSERT , there will be a checksum mismatch for an allocation bitmap page, because that page had already been modified in a ‘parent’ mini-transaction that did not write its log yet. The parent mini-transaction holds exclusive latches on the clustered index leaf page as well as the allocation bitmap page. I think that we must specially flag those sub-mini-transactions, so that checksum records will only be written in the parent mini-transaction. There are also some challenges to implement this for ROW_FORMAT=COMPRESSED pages, because there are two copies of the page in the buffer pool.

            There is an anomaly where OPT_PAGE_CHECKSUM records could be emitted after a FREE_PAGE record. In this case, recovery would fail, because a checksum would be computed on a page that was freed during recovery.

            This can happen during DROP INDEX (or DROP TABLE or any table-rebuilding DDL in the system tablespace). To prevent the anomaly, we must revise the logic of MDEV-15528 and mark as freed any buffer pool pages that were modified by the mini-transaction before being freed in the mini-transaction. That in turn will depend on sux_lock_t::x_lock_upgraded() that was introduced in MDEV-24142. Furthermore, this will depend on some other changes that were made as part of MDEV-13542.

            marko Marko Mäkelä added a comment - There is an anomaly where OPT_PAGE_CHECKSUM records could be emitted after a FREE_PAGE record. In this case, recovery would fail, because a checksum would be computed on a page that was freed during recovery. This can happen during DROP INDEX (or DROP TABLE or any table-rebuilding DDL in the system tablespace). To prevent the anomaly, we must revise the logic of MDEV-15528 and mark as freed any buffer pool pages that were modified by the mini-transaction before being freed in the mini-transaction. That in turn will depend on sux_lock_t::x_lock_upgraded() that was introduced in MDEV-24142 . Furthermore, this will depend on some other changes that were made as part of MDEV-13542 .

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.