[MDEV-28349] Provide "crash safe" options for CHECK TABLE and ALTER TABLE ... CHECK PARTITION ... Created: 2022-04-19  Updated: 2023-07-11

Status: Open
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB
Fix Version/s: None

Type: Task Priority: Major
Reporter: Valerii Kravchuk Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: check, corruption, innochecksum

Issue Links:
Blocks
is blocked by MDEV-21098 Crash in rec_get_offsets_func() due t... Closed
is blocked by MDEV-22388 Corrupted undo log record leads to se... Closed
is blocked by MDEV-24402 CHECK TABLE may miss some cases of in... Closed
is blocked by MDEV-28457 Crash in page_dir_find_owner_slot() Closed
is blocked by MDEV-29201 Crash in row_search_mvcc() Closed
is blocked by MDEV-29976 "InnoDB: Failing assertion" when usin... Closed
Relates
relates to MDEV-13542 Crashing on a corrupted page is unhel... Closed
relates to MDEV-27734 Set innodb_change_buffering=none by d... Closed

 Description   

We need a safe way to run table checks for InnoDB tabes with statements like CHECK TABLE or ALTER TABLE ... CHECK PARTITION in production, that will NOT cause any deliberate assertion failures.

Something like deprecated and removed since 10.3 innodb_corrupt_table_action option (the name may be different) for these statements (or all access) with values like "assert" (current behaviour), "warn" (add the details about corruption found and continue if possible or stop stating the table/partition is corrupted) etc. This should apply NOT only to page checksums, but to all other kinds of assertions we may hit in InnoDB in the process.



 Comments   
Comment by Marko Mäkelä [ 2022-04-19 ]

I am afraid that crashes in CHECK TABLE due to change buffer corruption (see MDEV-27734) are very hard to avoid, because the change buffer merge occurs at a very lower level of the code.

Comment by Marko Mäkelä [ 2022-05-23 ]

I believe that I implemented most of this in MDEV-13542 today, but I have no easy way to guarantee that.

Perhaps mleich could take fault injection to the next level and play a "crazy DBA" who would attempt to back up a running database with rsync instead of proper mariadb-backup or file system snapshots. If the corrupted backup does not cause InnoDB to crash, we should have a winner.

Comment by Marko Mäkelä [ 2022-06-07 ]

MDEV-13542 fixed a lot, but not everything. At least MDEV-21098, MDEV-22388, MDEV-28457 could theoretically still cause a crash in CHECK TABLE.

Comment by Marko Mäkelä [ 2022-08-01 ]

I am afraid that avoiding all crashes in CHECK TABLE requires avoiding crashes in all of InnoDB. CHECK TABLE shares a lot of code with normal multi-versioned (MVCC) repeatable read.

valerii, the obvious sources of crashes (including many in CHECK TABLE) should be fixed in the upcoming 10.6.9 release. Can you (or anyone else) provide examples of remaining crashes on corrupted data?

Comment by Marko Mäkelä [ 2022-08-01 ]

MDEV-29201 is another case that I think could crash in CHECK TABLE equally well. It may already have been fixed by MDEV-13542.

Comment by Marko Mäkelä [ 2022-11-02 ]

The CHECK TABLE record-counting code was rewritten in MDEV-24402. It will share less code with DML statements, such as SELECT. Because the implementation is simpler, it could be even less prone to crashing.

Comment by Marko Mäkelä [ 2022-11-02 ]

valerii, have any crashes been observed with MariaDB Server 10.6.9 or later? (MDEV-24402 was implemented in 10.6.11.)

Comment by Valerii Kravchuk [ 2022-11-07 ]

Are we sure that there is a version (which one, 10.6.9?) where CHECK TABLE and ALTER TABLE ... CHECK PARTITION ... statements are entirely safe, in a sense that when the statement finds any corruption or problem, it reports it, maybe do something else, but let server (other threads) to continue working? If so, the task can be closed IMHO. I doubt we are at this stage already, though.

Comment by Marko Mäkelä [ 2022-11-07 ]

valerii, I agree that it is better to retain this ticket open for a few more months, to find practical examples where CHECK TABLE would crash.

Just today, related to MDEV-28797, I became aware of MDEV-29976, which is a possible crash when a particular form of corruption is encountered in a ROW_FORMAT=COMPRESSED page.
The reported crash is on the "write" side, but there are some intentional crashes in code invoked by page_zip_decompress() as well. That code can be invoked by CHECK TABLE also after the MDEV-24402 rewrite.

Comment by Marko Mäkelä [ 2023-03-06 ]

valerii, a few months have passed. MDEV-29976 might be a duplicate of MDEV-28797.

Since CHECK TABLE shares quite some code with the rest of InnoDB even after MDEV-24402, it is impossible to give guarantees that no crashes are possible. Fixing any remaining crashes (such as MDEV-30787, affecting only ROW_FORMAT=REDUNDANT tables) is only possible if we can get copies of the corrupted pages.

Generated at Thu Feb 08 10:00:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.