[MDEV-24402] CHECK TABLE may miss some cases of index inconsistencies Created: 2020-12-14 Updated: 2023-02-24 Resolved: 2022-10-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2.31 |
| Fix Version/s: | 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Hartmut Holzgraefe | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | check, corruption | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Looks as if some cases of InnoDB index inconsistencies – see e.g. InnoDB currently ignores any other CHECK TABLE attributes than QUICK (which will suppress the time-consuming checks of the upper levels of the index trees). Most importantly, InnoDB is ignoring the attribute EXTENDED, and is essentially only performing a MVCC COUNT(*) in each index. A verified case exists where the record counts in the indexes would match, yet the records do not match between the indexes. A true EXTENDED check should report any mismatch between the clustered index (primary key) and the secondary index records. Note: By the design of InnoDB MVCC (see MDEV-17598 for some discussion), secondary indexes may contain multiple copies of a record, while in the clustered index the multiple versions are formed via a singly-linked list that the undo log pages are part of. An extremely extended test would have to check that in any possible read view, the secondary indexes match the clustered index, instead of only checking the current read view. It may turn out that only a special ‘locking’ variant of CHECK TABLE would be able to detect truly any type of mismatch. |
| Comments |
| Comment by Marko Mäkelä [ 2021-04-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
CHECK TABLE…QUICK currently essentially performs SELECT COUNT(*) in each index of the table, using concurrent read, and complains if the counts differ. CHECK TABLE without any attributes or with any other attributes than QUICK would be the same, but additionally check all pointers in the non-leaf pages of the index trees. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Let us consider a table that contains a single row:
The secondary index (b,a) would contain 3 delete-marked records (1,1),(2,1),(3,1) and a record (4,1). All these will correspond to a version of the primary key record with a=1. The versions of the clustered index records are stored in a singly linked list of undo log records in undo pages, while the versions of secondary indexes are stored directly in the secondary index B-trees. An extended check that I have been discussed with thiru would work as follows. For every secondary index record, we would invoke something similar to row_vers_old_has_index_entry(), which is what the purge of history would invoke to determine if it is safe to remove a secondary index record. In that check, we would complain if a secondary index record is orphan, with no counterpart in the clustered index. There is a complication. As soon as a COMMIT is executed in the example transaction above, the purge_sys.view could be advanced to a point where only the latest version is visible, letting anyone instantly ‘think’ that all but the secondary index record (4,1) are orphan. Yet, the purge would not instantly remove any records from the secondary index. Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed. Or we must make some more of the purge state accessible to this proposed CHECK TABLE…EXTENDED. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thirunarayanan Balathandayuthapani [ 2022-04-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Disabling the purge thread could avoid DB_MISSING_HISTORY error | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The purge of committed transaction history would remove old secondary index record versions (which have to be delete-marked) by invoking row_purge_remove_sec_if_poss() on every affected secondary index. Only after that, it would do something to the clustered index record.
In each case, if CHECK TABLE is holding a latch on the index page of a valid secondary index record, that page latch will prevent purge from proceeding, and it should be safe to look up the clustered index record and construct earlier versions of it until a match is found. Assuming that there is no corruption, we can ignore the purge_sys.view limit that normally prevents us from accessing earlier versions. However, the purpose of CHECK TABLE is to check for corruption. When purge_sys.view does not allow us to access an earlier version that is identified by DB_ROLL_PTR, we must tread more carefully to detect a freed undo page. If the search for a matching clustered index record leads us to a definitely freed undo page, then the secondary index record would be an orphan one. It is a little unclear if we can reliably detect that purge has processed an undo page. At least in the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment(), it may be challenging to detect that the undo page has been freed by purge. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To investigate what happens during the purge, I ran the following test case:
This experiment shows that there was exactly one call to trx_purge_free_segment() to purge the entire log of the transaction. As expected, it did fit in a single undo log page. The parameter log=0x7fe00709af70 of trx_purge_remove_log_hdr() is the buffer block that was returned by trx_undo_assign_low() for each row operation of our transaction. Based on the watchpoint on FIL_PAGE_LSN, we can see that purge did not modify that undo log page. FIL_PAGE_LSN must be updated every time before an exclusive latch on a modified page is released. The undo page was last touched during COMMIT when its state was changed from active to committed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment(), it is unclear to me how the undo page would be reused by something else. The block would be detached from the TRX_RSEG_HISTORY list but not added to anything else. There is a field TRX_UNDO_TRX_ID that we might be able to compare to the DB_TRX_ID of the record version whose previous version we are trying to reconstruct. This field is available only in the first undo log page of a transaction (an undo log header page). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
thiru reminded me that trx_undo_set_state_at_finish() treats short undo logs (all undo records of a transaction fit in a single undo page) specially. The state TRX_UNDO_CACHED means that trx_undo_reuse_cached() can assign the unused tail of the undo log page to an unrelated transaction later. At the end of a purge batch, trx_purge_truncate_rseg_history() would only invoke trx_purge_free_segment() if it is processing the very last undo log in a page that is no longer being "cached for reuse" but is in the state TRX_UNDO_TO_PURGE. That is when the page would really be freed. Let me formulate the problem once more. We have a secondary index record and want to know it is orphan.
The purge_sys.end_view would be an addition to the current purge_sys.view, which is updated at the start of a purge batch, in purge_sys.clone_oldest_view() invoked by trx_purge(). The purge_sys.end_view would be copied from purge_sys.view at the end of a purge batch, right before trx_purge_truncate_rseg_history() starts freeing any qualifying TRX_UNDO_TO_PURGE logs. Not only would the purge_sys.end_view prevent the CHECK TABLE…EXTENDED from accessing potentially freed undo log pages, but it would also provide an accurate read view. Without this accurate read view, we might find a matching old version in something that has already been purged but happens to be available in a partially reused undo log page. That is, without having an accurate read view, we could fail to detect some orphan records. The record lookups will find any records that are present in secondary indexes but missing from the clustered index. To detect any records that are present in the clustered index but missing from secondary indexes, also CHECK TABLE…EXTENDED will have to count the records in each index, using the current read view of the transaction. In case of a count mismatch, no detailed diagnostics (such as the PRIMARY KEY of the record) can be provided. There are no real checks for SPATIAL INDEX or FULLTEXT INDEX, and neither will be improved by this. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Row_sel_get_clust_rec_for_mysql::operator() there is a comment that says that purgeable delete-marked secondary index records may become orphan if row_undo_mod_clust() removes a purgeable clustered index record during ROLLBACK. This means that before reporting orphan delete-marked secondary index records we must check that the PAGE_MAX_TRX_ID of the secondary index leaf page is visible in purge_sys.end_view. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The extended check found its first bug: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
After
Also the following tests would fail the extended check:
Some of the partitioning tests occasionally pass, because there is no synchronization with the purge of committed transaction history. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For innodb.innodb-index-online, the problem was that my EXTENDED check was using the wrong read view. It should have used purge_sys.end_view instead of purge_sys.view. The partitioning tests seem to show genuine trouble: some delete-marked secondary index records are not being purged in the purge coordinator batch that was supposed to do it. Those tests fail even when run with innodb_purge_threads=1. That would point towards | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The partition tests stop failing if I make trx_purge_attach_undo_recs() ignore the parameter innodb_purge_batch_size and copy all available undo log records to a purge batch. There is a comment about that also in trx_purge_truncate_history(). It looks like trx_purge() has to be refactored so that purge_sys.end_view will not be updated and undo logs will not be truncated before everything has been processed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following test (reconstructed and simplified from an rr replay trace) would occasionally fail in the current development branch:
If the wait for purge is enabled (the comment marks removed), then the test will not fail. The problem is in row_undo_mod_clust(): it must refer to the new purge_sys.end_view instead of purge_sys.view when deciding whether to discard the history for a clustered index record. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The extended check would still occasionally fail in the already mentioned partitioning tests. Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs(), tests will occasionally fail like this:
I can reproduce it under rr replay. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
On 2022-03-29 I wrote:
This is indeed the case. My attempted work-around in row_undo_mod_clust() would make matters significantly worse by causing ROLLBACK to leave behind genuinely orphan records in the clustered index. What seems to work well is to leave the ROLLBACK alone and only report orphan secondary index if all history up to the PAGE_MAX_TRX_ID should have been fully purged. That type of a check would still catch
In the server error log, each orphan record will be reported:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following test would occasionally display warnings in CHECK TABLE…EXTENDED:
Another variant of this message (not observed for this test) is:
Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs(), the CHECK TABLE would report warnings every now and then. The CHECK TABLE t2 EXTENDED will stop issuing those warnings only if I add a slow shutdown after the start_mysqld.inc:
Due to the way how purge works across server restart, I do not think that it is feasible to suppress these warnings. I did try invoking end_view.clamp_low_limit_id() in trx_lists_init_at_db_start(), similar to how purge_sys_t::clone_end_view() does it. I also tried marking the initial purge_sys.end_view as ‘forged’ and checking this status before issuing these warnings, but it would only suppress the warnings until the first call of purge_sys.clone_end_view(), while we would want to suppress the warning until all history had been purged. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The trouble that I previously observed with some partitioning tests is also observable with the following nondeterministic test that I extracted and simplified from an rr replay trace:
In the ROLLBACK, it is possible that purge_sys.view has been advanced to the end, but the committed delete-marked secondary index record (b,a)=(4,5) for the first INSERT has not been purged. In this case, the ROLLBACK would delete the records (5,5) in both indexes. The CHECK TABLE would report a failure because purge had not run yet. The secondary index record would eventually be purged. This can be verified by adding the following lines to the test before or after the ROLLBACK, to wait for the purge of history to finish:
This failure can be prevented by tweaking row_undo_mod_must_purge() so that it will refer to the new purge_sys.end_view instead of the purge_sys.view. In that way, no seemingly orphaned secondary index records will exist. This fix might also address the recovery related false alarms of my previous comment. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was too quick to report success. If I tweak the row_undo_mod_must_purge(), it will cause a genuine failure like this:
(This was with TINYINT UNSIGNED instead of INT.) The clustered index record is (a,DB_TRX_ID,DB_ROLL_PTR,b)=(5,0xcaa,update,4). What seems to work is a tweak to the new CHECK TABLE…EXTENDED check so that if not everything up to the PAGE_MAX_TRX_ID of the page of the secondary index record has been purged yet, we will not report the missing record. Previously this rule had been implemented for seemingly missing older versions of the clustered index record, but not for a completely missing clustered index record. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I got the restart test from 2022-10-11 to work reliably (60×100 repetitions). It used to fail some 10% into the test run, even with the above mentioned fix. In trx_lists_init_at_db_start(), we must initialize purge_sys.end_view in a similar way to how purge_sys_t::clone_end_view() does it, to deal with partially completed purge batches. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I got an RQG grammar that would occasionally report orphan delete-marked secondary index records, similar to The error is not easily fixed. We might relax the check and ‘wave through’ cases where a matching clustered index record cannot be found at all. Other cases (where a record with a primary key is found for a newer version of the record, but not for one that matches the secondary index record) would still be flagged:
Alternatively, we could issue warnings for this but not flag the secondary index as corrupted. After all, this type of corruption should only hurt performance, not cause any correctness issues. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-10-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-10-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you, mleich. The test case corresponds to the scenario that I was analyzing yesterday, now filed as MDEV-29823. It is a ‘memory leak’ (failure to remove a secondary index that is no longer needed), but not one that can be fixed easily. Therefore, I would report it as a warning but not flag the secondary index as corrupted. MVCC reads as well as row_vers_impl_x_locked_low() will silently ignore delete-marked secondary index records for which no counterpart exists in the clustered index. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-10-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The logic of row_check_index() and the code around purge_sys.end_view looks correct to me. |