[MDEV-24402] CHECK TABLE may miss some cases of index inconsistencies Created: 2020-12-14  Updated: 2023-02-24  Resolved: 2022-10-21

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.31
Fix Version/s: 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1

Type: Bug Priority: Critical
Reporter: Hartmut Holzgraefe Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: check, corruption

Issue Links:
Blocks
blocks MDEV-28349 Provide "crash safe" options for CHEC... Open
Problem/Incident
causes MDEV-29886 Assertion `!index->table->is_temporar... Closed
causes MDEV-29978 Corruption errors upon CHECK on tempo... Closed
Relates
relates to MDEV-25459 MVCC read from index on CHAR or VARCH... Closed
relates to MDEV-29593 Purge misses a chance to free not-yet... Closed
relates to MDEV-29666 InnoDB fails to purge secondary index... Closed
relates to MDEV-29823 Secondary index records may never be ... Open
relates to MDEV-9663 InnoDB assertion failure: *cursor->in... Closed
relates to MDEV-11802 innodb.innodb_bug14676111 fails in bu... Closed
relates to MDEV-22373 Unable to find a record to delete-mar... Closed
relates to MDEV-29954 Unique hash key on column prefix is c... Closed
relates to MDEV-30129 MySQL 5.7 --> MariaDB 10.8 switch: my... Closed

 Description   

Looks as if some cases of InnoDB index inconsistencies – see e.g. MDEV-22373 – are missed by CHECK TABLE.

InnoDB currently ignores any other CHECK TABLE attributes than QUICK (which will suppress the time-consuming checks of the upper levels of the index trees).

Most importantly, InnoDB is ignoring the attribute EXTENDED, and is essentially only performing a MVCC COUNT(*) in each index. A verified case exists where the record counts in the indexes would match, yet the records do not match between the indexes. A true EXTENDED check should report any mismatch between the clustered index (primary key) and the secondary index records.

Note: By the design of InnoDB MVCC (see MDEV-17598 for some discussion), secondary indexes may contain multiple copies of a record, while in the clustered index the multiple versions are formed via a singly-linked list that the undo log pages are part of.

An extremely extended test would have to check that in any possible read view, the secondary indexes match the clustered index, instead of only checking the current read view.

It may turn out that only a special ‘locking’ variant of CHECK TABLE would be able to detect truly any type of mismatch.



 Comments   
Comment by Marko Mäkelä [ 2021-04-26 ]

CHECK TABLE…QUICK currently essentially performs SELECT COUNT(*) in each index of the table, using concurrent read, and complains if the counts differ. MDEV-25459 was a bug in the MVCC implementation of secondary indexes that would also cause false alarms in CHECK TABLE.

CHECK TABLE without any attributes or with any other attributes than QUICK would be the same, but additionally check all pointers in the non-leaf pages of the index trees.

Comment by Marko Mäkelä [ 2022-03-29 ]

Let us consider a table that contains a single row:

CREATE TABLE t1(a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB;
BEGIN;
INSERT INTO t1 VALUES(1,1);
UPDATE t1 SET b=2;
UPDATE t1 SET b=3;
UPDATE t1 SET b=4;

The secondary index (b,a) would contain 3 delete-marked records (1,1),(2,1),(3,1) and a record (4,1). All these will correspond to a version of the primary key record with a=1. The versions of the clustered index records are stored in a singly linked list of undo log records in undo pages, while the versions of secondary indexes are stored directly in the secondary index B-trees.

An extended check that I have been discussed with thiru would work as follows. For every secondary index record, we would invoke something similar to row_vers_old_has_index_entry(), which is what the purge of history would invoke to determine if it is safe to remove a secondary index record. In that check, we would complain if a secondary index record is orphan, with no counterpart in the clustered index.

There is a complication. As soon as a COMMIT is executed in the example transaction above, the purge_sys.view could be advanced to a point where only the latest version is visible, letting anyone instantly ‘think’ that all but the secondary index record (4,1) are orphan. Yet, the purge would not instantly remove any records from the secondary index. Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed. Or we must make some more of the purge state accessible to this proposed CHECK TABLE…EXTENDED.

Comment by Thirunarayanan Balathandayuthapani [ 2022-04-04 ]

Disabling the purge thread could avoid DB_MISSING_HISTORY error

Comment by Marko Mäkelä [ 2022-09-15 ]

The purge of committed transaction history would remove old secondary index record versions (which have to be delete-marked) by invoking row_purge_remove_sec_if_poss() on every affected secondary index. Only after that, it would do something to the clustered index record.

  • For TRX_UNDO_DEL_MARK_REC, row_purge_del_mark() will delete the record in each index if possible.
    1. row_purge_remove_sec_if_poss(() consults row_purge_poss_sec() and will not delete a secondary index record if a clustered index record with the same PRIMARY KEY and secondary key value was inserted later.
    2. row_purge_remove_clust_if_poss() will not delete the record if its DB_ROLL_PTR differs from the undo record (the record was re-inserted later).
  • For most other undo log record types, row_purge_upd_exist_or_extern() will
    1. invoke row_purge_remove_sec_if_poss() on indexes that were affected by an UPDATE of an indexed column
    2. reset the DB_TRX_ID in the clustered index record

In each case, if CHECK TABLE is holding a latch on the index page of a valid secondary index record, that page latch will prevent purge from proceeding, and it should be safe to look up the clustered index record and construct earlier versions of it until a match is found. Assuming that there is no corruption, we can ignore the purge_sys.view limit that normally prevents us from accessing earlier versions.

However, the purpose of CHECK TABLE is to check for corruption. When purge_sys.view does not allow us to access an earlier version that is identified by DB_ROLL_PTR, we must tread more carefully to detect a freed undo page. If the search for a matching clustered index record leads us to a definitely freed undo page, then the secondary index record would be an orphan one.

It is a little unclear if we can reliably detect that purge has processed an undo page. At least in the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment(), it may be challenging to detect that the undo page has been freed by purge.

Comment by Marko Mäkelä [ 2022-09-15 ]

To investigate what happens during the purge, I ran the following test case:

--source include/have_innodb.inc
CREATE TABLE t1(a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB
STATS_PERSISTENT=0;
SET GLOBAL innodb_max_purge_lag_wait=0;
BEGIN;
INSERT INTO t1 VALUES(1,1);
UPDATE t1 SET b=2;
UPDATE t1 SET b=3;
UPDATE t1 SET b=4;
COMMIT;
SET GLOBAL innodb_max_purge_lag_wait=0;
DROP TABLE t1;

./mtr --rr --mysqld=--innodb-purge-threads=1 name_of_above_test
rr replay var/log/mysqld.1.rr/latest-trace

break ha_innobase::update_row
ignore 1 2
continue
break trx_undo_assign_low
break trx_purge_remove_log_hdr
break ha_innobase::delete_table
continue

10.6 fd0bdd3180a7d5f4b9804d372d6a63b6a202818c

(rr) continue
Continuing.
 
Thread 2 hit Breakpoint 2, trx_undo_assign_low (trx=0x7fe0078feb80, rseg=0x555ef01c2b80 <trx_sys+17024>, undo=0x7fe0078ff4d0, err=0x7fdff575ef18, mtr=0x7fdff575ef60) at /mariadb/10.6/storage/innobase/trx/trx0undo.cc:1400
1400		ut_d(const bool	is_temp = rseg == trx->rsegs.m_noredo.rseg);
(rr) finish
Run till exit from #0  trx_undo_assign_low (trx=0x7fe0078feb80, 
    rseg=0x555ef01c2b80 <trx_sys+17024>, undo=0x7fe0078ff4d0, 
    err=0x7fdff575ef18, mtr=0x7fdff575ef60)
    at /mariadb/10.6/storage/innobase/trx/trx0undo.cc:1400
0x0000555eef51a87c in trx_undo_report_row_operation (thr=0x7fdfdc029bc8, index=0x7fdfdc19eee0, clust_entry=0x0, update=0x7fdfdc0299f8, cmpl_info=0, rec=0x7fe00756c07e "\200", offsets=0x7fdff575fda0, roll_ptr=0x7fdff575f510) at /mariadb/10.6/storage/innobase/trx/trx0rec.cc:2066
2066		buf_block_t*	undo_block = trx_undo_assign_low(trx, rseg, pundo,
Value returned is $1 = (buf_block_t *) 0x7fe00709af70
(rr) p $1.page.id_.m_id
$2 = 308
(rr) continue
Continuing.
[Switching to Thread 20557.20560]
 
Thread 5 hit Breakpoint 3, trx_purge_remove_log_hdr (rseg=0x7fe00709aea0, log=0x7fe00709af70, offset=581, mtr=0x7fe005d8cd80) at /mariadb/10.6/storage/innobase/trx/trx0purge.cc:359
359	                     uint16_t(offset + TRX_UNDO_HISTORY_NODE), mtr);
(rr) continue
Continuing.
[Switching to Thread 20557.20569]
 
Thread 2 hit Breakpoint 4, ha_innobase::delete_table (this=0x7fdfdc02ea88, name=0x7fdff5760720 "./test/t1") at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13462
13462	{
(rr) watch -l $1.page.frame[16]@8
Hardware watchpoint 5: -location $1.page.frame[16]@8
(rr) reverse-continue
Continuing.
[Switching to Thread 20557.20560]
 
Thread 5 hit Breakpoint 3, trx_purge_remove_log_hdr (rseg=0x7fe00709aea0, log=0x7fe00709af70, offset=581, mtr=0x7fe005d8cd80) at /mariadb/10.6/storage/innobase/trx/trx0purge.cc:359
359	                     uint16_t(offset + TRX_UNDO_HISTORY_NODE), mtr);
(rr) reverse-continue
Continuing.
[Switching to Thread 20557.20569]
 
Thread 2 hit Hardware watchpoint 5: -location $1.page.frame[16]@8
 
Old value = "\000\000\000\000\000\000\344", <incomplete sequence \353>
New value = "\000\000\000\000\000\000\344w"
0x0000555eef393d8d in mach_write_to_4 (b=0x7fe007100014 "", n=58603) at /mariadb/10.6/storage/innobase/include/mach0data.inl:167
167		b[3] = (byte) n;
(rr) backtrace
#1  0x0000555eef393e22 in mach_write_to_8 (b=0x7fe007100010, n=58603)
    at /mariadb/10.5/storage/innobase/include/mach0data.inl:342
#2  0x0000555eef3aa173 in buf_flush_note_modification (b=0x7fe00709af70, 
    start=58543, end=58603)
    at /mariadb/10.5/storage/innobase/include/buf0flu.h:117
#6  0x0000555eef3c39b5 in mtr_t::commit (this=0x7fdff575fb50)
    at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:484
#7  0x0000555eef52eceb in trx_t::commit_low (this=0x7fe0078feb80, 
    mtr=0x7fdff575fb50) at /mariadb/10.6/storage/innobase/trx/trx0trx.cc:1450
#19 0x0000555eeeb30095 in mysql_parse (thd=0x7fdfdc0020c8, 
    rawbuf=0x7fdfdc02e0a0 "COMMIT", length=6, parser_state=0x7fdff5761370)
    at /mariadb/10.6/sql/sql_parse.cc:8030

This experiment shows that there was exactly one call to trx_purge_free_segment() to purge the entire log of the transaction. As expected, it did fit in a single undo log page.

The parameter log=0x7fe00709af70 of trx_purge_remove_log_hdr() is the buffer block that was returned by trx_undo_assign_low() for each row operation of our transaction. Based on the watchpoint on FIL_PAGE_LSN, we can see that purge did not modify that undo log page. FIL_PAGE_LSN must be updated every time before an exclusive latch on a modified page is released. The undo page was last touched during COMMIT when its state was changed from active to committed.

Comment by Marko Mäkelä [ 2022-09-15 ]

In the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment(), it is unclear to me how the undo page would be reused by something else. The block would be detached from the TRX_RSEG_HISTORY list but not added to anything else.

There is a field TRX_UNDO_TRX_ID that we might be able to compare to the DB_TRX_ID of the record version whose previous version we are trying to reconstruct. This field is available only in the first undo log page of a transaction (an undo log header page).

Comment by Marko Mäkelä [ 2022-09-15 ]

thiru reminded me that trx_undo_set_state_at_finish() treats short undo logs (all undo records of a transaction fit in a single undo page) specially. The state TRX_UNDO_CACHED means that trx_undo_reuse_cached() can assign the unused tail of the undo log page to an unrelated transaction later. At the end of a purge batch, trx_purge_truncate_rseg_history() would only invoke trx_purge_free_segment() if it is processing the very last undo log in a page that is no longer being "cached for reuse" but is in the state TRX_UNDO_TO_PURGE. That is when the page would really be freed.

Let me formulate the problem once more. We have a secondary index record and want to know it is orphan.

  1. If the clustered index record key that the record points to is not found, the record is orphan. This is even more severe if the record is not delete-marked.
  2. Else, if any version not older than purge_sys.end_view matches the secondary index record, we are fine.
  3. Else, the secondary index record is orphan.

The purge_sys.end_view would be an addition to the current purge_sys.view, which is updated at the start of a purge batch, in purge_sys.clone_oldest_view() invoked by trx_purge(). The purge_sys.end_view would be copied from purge_sys.view at the end of a purge batch, right before trx_purge_truncate_rseg_history() starts freeing any qualifying TRX_UNDO_TO_PURGE logs.

Not only would the purge_sys.end_view prevent the CHECK TABLE…EXTENDED from accessing potentially freed undo log pages, but it would also provide an accurate read view. Without this accurate read view, we might find a matching old version in something that has already been purged but happens to be available in a partially reused undo log page. That is, without having an accurate read view, we could fail to detect some orphan records.

The record lookups will find any records that are present in secondary indexes but missing from the clustered index.

To detect any records that are present in the clustered index but missing from secondary indexes, also CHECK TABLE…EXTENDED will have to count the records in each index, using the current read view of the transaction. In case of a count mismatch, no detailed diagnostics (such as the PRIMARY KEY of the record) can be provided.

There are no real checks for SPATIAL INDEX or FULLTEXT INDEX, and neither will be improved by this.

Comment by Marko Mäkelä [ 2022-09-22 ]

In Row_sel_get_clust_rec_for_mysql::operator() there is a comment that says that purgeable delete-marked secondary index records may become orphan if row_undo_mod_clust() removes a purgeable clustered index record during ROLLBACK. This means that before reporting orphan delete-marked secondary index records we must check that the PAGE_MAX_TRX_ID of the secondary index leaf page is visible in purge_sys.end_view.

Comment by Marko Mäkelä [ 2022-09-29 ]

The extended check found its first bug: MDEV-29666.

Comment by Marko Mäkelä [ 2022-09-29 ]

After MDEV-29666 is worked around, gcol.innodb_virtual_purge would pass the extended check, but a number of other tests would still report some orphan records. Most notably, the following (simplified from innodb_zip.bug56680) is causing trouble, because the EXTENDED check fails to count the secondary index record. In the secondary index, we only have one record, because 'X'='x' in the collation.

-- source include/innodb_page_size_small.inc
 
CREATE TABLE bug56680(a INT PRIMARY KEY, b CHAR(1), c INT, INDEX(b))
ENGINE=InnoDB STATS_PERSISTENT=0;
 
INSERT INTO bug56680 VALUES(1,'x',1);
 
connect (con1,localhost,root,,);
BEGIN;
UPDATE bug56680 SET b='X';
 
connection default;
CHECK TABLE bug56680 EXTENDED;
 
connection con1;
ROLLBACK;
SELECT b FROM bug56680 LIMIT 2;
disconnect con1;
 
connection default;
 
DROP TABLE bug56680;

Also the following tests would fail the extended check:

innodb.innodb-index-online
parts.partition_alter1_1_2_innodb
parts.partition_alter1_1_innodb
parts.partition_alter2_1_1_innodb
parts.partition_alter2_2_1_innodb
parts.partition_alter2_2_2_innodb
parts.partition_basic_innodb

Some of the partitioning tests occasionally pass, because there is no synchronization with the purge of committed transaction history.

Comment by Marko Mäkelä [ 2022-09-30 ]

For innodb.innodb-index-online, the problem was that my EXTENDED check was using the wrong read view. It should have used purge_sys.end_view instead of purge_sys.view.

The partitioning tests seem to show genuine trouble: some delete-marked secondary index records are not being purged in the purge coordinator batch that was supposed to do it. Those tests fail even when run with innodb_purge_threads=1. That would point towards MDEV-29666 or a similar bug.

Comment by Marko Mäkelä [ 2022-10-03 ]

The partition tests stop failing if I make trx_purge_attach_undo_recs() ignore the parameter innodb_purge_batch_size and copy all available undo log records to a purge batch.

There is a comment about that also in trx_purge_truncate_history(). It looks like trx_purge() has to be refactored so that purge_sys.end_view will not be updated and undo logs will not be truncated before everything has been processed.

Comment by Marko Mäkelä [ 2022-10-05 ]

The following test (reconstructed and simplified from an rr replay trace) would occasionally fail in the current development branch:

--source include/have_innodb.inc
 
CREATE TABLE t1
(a TINYINT UNSIGNED PRIMARY KEY, b TINYINT UNSIGNED NOT NULL, KEY(b))
ENGINE=InnoDB;
 
connect (stop_purge,localhost,root);
START TRANSACTION WITH CONSISTENT SNAPSHOT;
 
connection default;
INSERT INTO t1 VALUES (5,6);
DELETE FROM t1;
INSERT INTO t1 VALUES (5,5);
 
START TRANSACTION WITH CONSISTENT SNAPSHOT;
 
disconnect stop_purge;
 
DELETE FROM t1 WHERE a=5;
ROLLBACK;
 
#SET @saved_frequency = @@GLOBAL.innodb_purge_rseg_truncate_frequency;
#SET GLOBAL innodb_purge_rseg_truncate_frequency = 1;
#SET GLOBAL innodb_max_purge_lag_wait=0;
#SET GLOBAL innodb_purge_rseg_truncate_frequency = @saved_frequency;
 
CHECK TABLE t1 EXTENDED;
SELECT * FROM t1;
DROP TABLE t1;

If the wait for purge is enabled (the comment marks removed), then the test will not fail. The problem is in row_undo_mod_clust(): it must refer to the new purge_sys.end_view instead of purge_sys.view when deciding whether to discard the history for a clustered index record.

Comment by Marko Mäkelä [ 2022-10-06 ]

The extended check would still occasionally fail in the already mentioned partitioning tests. Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs(), tests will occasionally fail like this:

bb-10.6-MDEV-24402 db0233c61cd51515e5dddc4f6ea86dcf1bf3394b

CURRENT_TEST: parts.partition_basic_innodb
--- /mariadb/10.6/mysql-test/suite/parts/r/partition_basic_innodb.result	2022-10-03 15:21:16.059907120 +0300
+++ /mariadb/10.6/mysql-test/suite/parts/r/partition_basic_innodb.reject	2022-10-06 15:46:20.987564284 +0300
@@ -15767,6 +15767,7 @@
 test.t1	analyze	status	OK
 CHECK    TABLE t1 EXTENDED;
 Table	Op	Msg_type	Msg_text
+test.t1	check	Warning	InnoDB: Unpurged clustered index record
 test.t1	check	status	OK
 CHECKSUM TABLE t1 EXTENDED;
 Table	Checksum

I can reproduce it under rr replay.

Comment by Marko Mäkelä [ 2022-10-07 ]

On 2022-03-29 I wrote:

Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed.

This is indeed the case. My attempted work-around in row_undo_mod_clust() would make matters significantly worse by causing ROLLBACK to leave behind genuinely orphan records in the clustered index.

What seems to work well is to leave the ROLLBACK alone and only report orphan secondary index if all history up to the PAGE_MAX_TRX_ID should have been fully purged. That type of a check would still catch MDEV-29666 if I revert the fix:

diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
index 434ea7cfd20..1944b469d4e 100644
--- a/storage/innobase/row/row0purge.cc
+++ b/storage/innobase/row/row0purge.cc
@@ -991,12 +991,14 @@ static byte *row_purge_get_partial(const byte *ptr, const dict_index_t &index,
          REC_ANTELOPE_MAX_INDEX_COL_LEN + BTR_EXTERN_FIELD_REF_SIZE);
   }
 
+#if 0
   for (ulint i= 0; i < index.n_uniq; i++)
   {
     dfield_t &field= node->row->fields[index.fields[i].col->ind];
     if (field.type.mtype == DATA_MISSING)
       field= node->ref->fields[i];
   }
+#endif
 
   return const_cast<byte*>(ptr);
 }

CURRENT_TEST: gcol.innodb_virtual_purge
mysqltest: At line 127: query 'SELECT b1 FROM t1' failed: ER_INDEX_CORRUPT (1712): Index t1 is corrupted

In the server error log, each orphan record will be reported:

2022-10-07 16:00:14 4 [ERROR] InnoDB: Clustered index record not found for index `b1` of table `test`.`t1`: COMPACT RECORD(info_bits=32, 2 fields): {[4]    (0x80000000),[6]      (0x000000000200)}
2022-10-07 16:00:14 4 [ERROR] InnoDB: Clustered index record not found for index `b1` of table `test`.`t1`: COMPACT RECORD(info_bits=32, 2 fields): {[4]    (0x80000009),[6]      (0x000000000203)}
2022-10-07 16:00:14 4 [ERROR] InnoDB: Flagged corruption of `b1` in table `test`.`t1` in CHECK TABLE-check index

Comment by Marko Mäkelä [ 2022-10-11 ]

The following test would occasionally display warnings in CHECK TABLE…EXTENDED:

--source include/have_innodb.inc
--source include/not_embedded.inc
 
CREATE TABLE t1(a INT PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t1 SET a=0;
--disable_query_log
SET GLOBAL innodb_flush_log_at_trx_commit=0;
let $N=768;
while ($N) {
UPDATE t1 SET a=a+1;
dec $N;
}
SET GLOBAL innodb_flush_log_at_trx_commit=1;
--enable_query_log
CREATE TABLE t2(a VARCHAR(8) PRIMARY KEY, b INT NOT NULL) ENGINE=InnoDB;
INSERT INTO t2 VALUES ("juliet",1),("lima",2),("november",3),("zulu",4);
 
INSERT INTO t2 VALUES ("infimum",0);
 
--source include/kill_mysqld.inc
--source include/start_mysqld.inc
 
CHECK TABLE t2 EXTENDED;
DROP TABLE t1,t2;

2022-10-11 11:55:56 3 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: …

Another variant of this message (not observed for this test) is:

2022-10-11 11:55:56 3 [Warning] InnoDB: Unpurged clustered index record in table …

Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs(), the CHECK TABLE would report warnings every now and then.

The CHECK TABLE t2 EXTENDED will stop issuing those warnings only if I add a slow shutdown after the start_mysqld.inc:

SET GLOBAL innodb_fast_shutdown=0;
--source include/restart_mysqld.inc

The servers were restarted 0 times
Spent 16897.207 of 313 seconds executing testcases
 
Completed: All 6000 tests were successful.

Due to the way how purge works across server restart, I do not think that it is feasible to suppress these warnings. I did try invoking end_view.clamp_low_limit_id() in trx_lists_init_at_db_start(), similar to how purge_sys_t::clone_end_view() does it. I also tried marking the initial purge_sys.end_view as ‘forged’ and checking this status before issuing these warnings, but it would only suppress the warnings until the first call of purge_sys.clone_end_view(), while we would want to suppress the warning until all history had been purged.

Comment by Marko Mäkelä [ 2022-10-13 ]

The trouble that I previously observed with some partitioning tests is also observable with the following nondeterministic test that I extracted and simplified from an rr replay trace:

--source include/have_innodb.inc
 
CREATE TABLE t1 (a INT PRIMARY KEY, b INT NOT NULL, INDEX(b)) ENGINE=InnoDB;
BEGIN;
INSERT INTO t1 VALUES(5,4);
DELETE FROM t1 WHERE a=5;
COMMIT;
BEGIN;
INSERT INTO t1 VALUES(5,5);
ROLLBACK;
 
CHECK TABLE t1 EXTENDED;
SELECT COUNT(*) FROM t1;
DROP TABLE t1;

In the ROLLBACK, it is possible that purge_sys.view has been advanced to the end, but the committed delete-marked secondary index record (b,a)=(4,5) for the first INSERT has not been purged. In this case, the ROLLBACK would delete the records (5,5) in both indexes. The CHECK TABLE would report a failure because purge had not run yet. The secondary index record would eventually be purged. This can be verified by adding the following lines to the test before or after the ROLLBACK, to wait for the purge of history to finish:

SET @save_freq=@@GLOBAL.innodb_purge_rseg_truncate_frequency;
SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
SET GLOBAL innodb_max_purge_lag_wait=1;
SET GLOBAL innodb_purge_rseg_truncate_frequency=@save_freq;

This failure can be prevented by tweaking row_undo_mod_must_purge() so that it will refer to the new purge_sys.end_view instead of the purge_sys.view. In that way, no seemingly orphaned secondary index records will exist. This fix might also address the recovery related false alarms of my previous comment.

Comment by Marko Mäkelä [ 2022-10-13 ]

I was too quick to report success. If I tweak the row_undo_mod_must_purge(), it will cause a genuine failure like this:

2022-10-13 16:05:56 1004 [Warning] InnoDB: Unpurged clustered index record in table `test`.`t1`: COMPACT RECORD(info_bits=32, 4 fields): {[1] (0x05),[6]      (0x000000000CAA),[7]K   t( (0x4B0000017428BE),[1] (0x04)}

(This was with TINYINT UNSIGNED instead of INT.) The clustered index record is (a,DB_TRX_ID,DB_ROLL_PTR,b)=(5,0xcaa,update,4).

What seems to work is a tweak to the new CHECK TABLE…EXTENDED check so that if not everything up to the PAGE_MAX_TRX_ID of the page of the secondary index record has been purged yet, we will not report the missing record. Previously this rule had been implemented for seemingly missing older versions of the clustered index record, but not for a completely missing clustered index record.

Comment by Marko Mäkelä [ 2022-10-13 ]

I got the restart test from 2022-10-11 to work reliably (60×100 repetitions). It used to fail some 10% into the test run, even with the above mentioned fix. In trx_lists_init_at_db_start(), we must initialize purge_sys.end_view in a similar way to how purge_sys_t::clone_end_view() does it, to deal with partially completed purge batches.

Comment by Marko Mäkelä [ 2022-10-17 ]

I got an RQG grammar that would occasionally report orphan delete-marked secondary index records, similar to MDEV-29666. I analyzed an rr replay trace of such a failure. The cause is a race condition between rollback and purge. Rollback would not remove a secondary index record that was inserted on top of a committed delete-marked record, because purge had not advanced that far yet. After rollback, trx_t::commit() of the ‘empty’ transaction would invoke trx_purge_add_undo_to_history(), which unnecessarily adds the empty undo log to the purge queue. I tried disabling trx_undo_try_truncate() so that the already rolled-back undo log records would be added to the purge queue, but even with that work-around (not a proper fix), the grammar would fail. With that tweak present, several tests that involve BLOBs would fail, presumably because some BLOBs would be prematurely freed.

The error is not easily fixed. We might relax the check and ‘wave through’ cases where a matching clustered index record cannot be found at all. Other cases (where a record with a primary key is found for a newer version of the record, but not for one that matches the secondary index record) would still be flagged:

diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc
index e1e7ed0e12a..4baa1e66376 100644
--- a/storage/innobase/row/row0sel.cc
+++ b/storage/innobase/row/row0sel.cc
@@ -6399,16 +6399,12 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows)
         if (!rec_deleted)
         {
         not_found:
-          ib::error() << "Clustered index record not found for index "
+          ib::fatal() << "Clustered index record not found for index "
                       << index->name << " of table " << index->table->name
                       << ": " << rec_offsets_print(rec, offsets);
           if (prebuilt->autoinc_error == DB_SUCCESS)
             prebuilt->autoinc_error= DB_CORRUPTION;
         }
-        else if (&view == &check_table_extended_view)
-        extended_not_found:
-          if (view.changes_visible(page_trx_id))
-            goto not_found;
       did_not_find:
         mtr.rollback_to_savepoint(savepoint);
         goto next_rec;
@@ -6587,8 +6583,9 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows)
             clust_rec version or when trx_undo_prev_version_build()
             encounters a BLOB that may have been freed according to
             purge_sys.view (not purge_sys.end_view). */
-            if (&view == &check_table_extended_view && !got_extended_match)
-              goto extended_not_found;
+            if (&view == &check_table_extended_view && !got_extended_match &&
+                view.changes_visible(page_trx_id))
+              goto not_found;
             goto did_not_find;
           }
 

Alternatively, we could issue warnings for this but not flag the secondary index as corrupted. After all, this type of corruption should only hurt performance, not cause any correctness issues.

Comment by Matthias Leich [ 2022-10-17 ]

RQG testing on some older version of origin/bb-10.6-MDEV-24402 and using the following simplified RQG grammars 
TBR-757-micro.yy
---------------------------
create_table:
    CREATE TABLE IF NOT EXISTS t1 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
 
insert_part:
    INSERT INTO t1 (col1,col2, col_int, col_text ) VALUES ;
 
my_int:
    { $my_int= 3 } |
    { $my_int= 5 } |
    { $my_int= 7 } ;
 
non_generated_cols:
    col1 INT PRIMARY KEY, col2 INT, col_int INTEGER , col_text TEXT ;
 
query:
    ALTER TABLE t1 ADD UNIQUE ( col_text(9) , col_int ) |
    ALTER TABLE t1 ADD UNIQUE ( col_int , col_int ) |
    CHECK TABLE t1 EXTENDED |
    DELETE FROM t1 WHERE col2 = my_int; COMMIT |
    DELETE FROM t1 WHERE col2 = my_int; COMMIT |
    insert_part ( my_int , $my_int, $my_int, NULL ); COMMIT |
    insert_part ( my_int , $my_int, $my_int, NULL ); ROLLBACK ;
 
thread1:
    query ;
 
thread1_connect:
    SET AUTOCOMMIT = 0 ;
 
thread1_init:
    create_table ;
 
thread_init:
    { sleep 30; exit 0 };
 
TBR-36-min_EXTENDED.yy
-----------------------------------------
insert_part:
    INSERT INTO t1 (col1,col2, col_text) VALUES ( 13 , 13, NULL ) ;
 
query:
    ALTER TABLE t1 ADD INDEX ( col_text(9) ) , ADD INDEX ( col1 ) |
    CHECK TABLE t1 EXTENDED |
    DELETE FROM t1 WHERE col2 = 13 OR col2 IS NULL ;
 
thread1:
    query ;
 
thread1_connect:
    ;
 
thread1_init:
    CREATE TABLE t1 ( col1 INT PRIMARY KEY, col2 INT, col_text TEXT ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
 
thread2:
    insert_part ;
 
thread2_connect:
    ;
 
thread2_init:
    ;
 
thread3:
    { sleep 30 ; exit 0 } ;
 
thread3_connect:
    ;
 
thread3_init:
    ;
 
thread_init:
    { sleep 30 ; exit 0 } ;
 
showed non rare as result of CHECK TABLE
- [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' ],
- [ 'TBR-36', 'InnoDB: Index .{1,150} contains .{1,10} entries, should be .{1,10}.' ], 
  [ 'TBR-36', 'failed: 1712 , errstr: Index .{1,100} is corrupted' ],
  [ 'TBR-36', 'failed: 1712 Index .* is corrupted' ],
 
Attempts to replay these problems on
origin/bb-10.6-MDEV-24402 435729fba417b5afcf759f05f3d6a1c62d3de058 2022-10-13T16:50:59+03:00 + the patch relax_check_extended.patch.diff
were given up after a few thousand tests "passed".
 
Nevertheless the bad effect 
     [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.'
occurs again if using the full (== non simplified) grammar conf/mariadb/table_stress_innodb_nocopy1.yy.
The automatic RQG test simplifier is in the moment running.

Comment by Matthias Leich [ 2022-10-18 ]

The following simplified test applied to origin/bb-10.6-MDEV-24402 (see above)
TBR-36--TBR-757_2-micro.yy
------------------------------------------
__clone__1:
    { $my_int= 2 } |
    { $my_int= 6 } |
    { $my_int= 8 } ;
__clone__2:
    { $my_int= 2 } |
    { $my_int= 6 } |
    { $my_int= 8 } ;
insert_part:
    INSERT INTO t3 (col1,col2, col_int, col_string, col_text ) VALUES ;
non_generated_cols:
    col1 INT PRIMARY KEY, col2 INT, col_int INTEGER, col_string INTEGER, col_text TEXT ;
query:
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
    CHECK TABLE t3 EXTENDED |
    CHECK TABLE t3 EXTENDED |
    CHECK TABLE t3 EXTENDED |
    UPDATE t3 SET col_int = __clone__2 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK |
    UPDATE t3 SET col_int = __clone__1 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK |
    insert_part ( __clone__2 , $my_int, $my_int, '', '' ); ROLLBACK |
    insert_part ( __clone__1 , $my_int, $my_int, '', '' ); ROLLBACK ;
thread1:
    query ;
thread1_connect:
    ;
thread1_init:
    CREATE TABLE IF NOT EXISTS t3 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Compact ;
thread_init:
    { sleep 30 ; exit 0 };
 
harvested during CHECK TABLE    check Warning InnoDB: The B-tree of index .{1,30} is corrupted.
In case of origin/bb-10.6-MDEV-24402 87e4f0dde4877090b6948c2734fbd37c860e216b 2022-10-18T10:20:02+03:00 + a patch which disables MDEV-371
the CHECK TABLE harvests now a warning like 'InnoDB: Clustered index record not found for index col_int of table test.t3 ...'.
Per Marko: This is no error.

Comment by Marko Mäkelä [ 2022-10-18 ]

Thank you, mleich. The test case corresponds to the scenario that I was analyzing yesterday, now filed as MDEV-29823. It is a ‘memory leak’ (failure to remove a secondary index that is no longer needed), but not one that can be fixed easily. Therefore, I would report it as a warning but not flag the secondary index as corrupted. MVCC reads as well as row_vers_impl_x_locked_low() will silently ignore delete-marked secondary index records for which no counterpart exists in the clustered index.

Comment by Vladislav Lesin [ 2022-10-21 ]

The logic of row_check_index() and the code around purge_sys.end_view looks correct to me.

Generated at Thu Feb 08 09:29:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.