Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24402

CHECK TABLE may miss some cases of index inconsistencies

Details

    Description

      Looks as if some cases of InnoDB index inconsistencies – see e.g. MDEV-22373 – are missed by CHECK TABLE.

      InnoDB currently ignores any other CHECK TABLE attributes than QUICK (which will suppress the time-consuming checks of the upper levels of the index trees).

      Most importantly, InnoDB is ignoring the attribute EXTENDED, and is essentially only performing a MVCC COUNT(*) in each index. A verified case exists where the record counts in the indexes would match, yet the records do not match between the indexes. A true EXTENDED check should report any mismatch between the clustered index (primary key) and the secondary index records.

      Note: By the design of InnoDB MVCC (see MDEV-17598 for some discussion), secondary indexes may contain multiple copies of a record, while in the clustered index the multiple versions are formed via a singly-linked list that the undo log pages are part of.

      An extremely extended test would have to check that in any possible read view, the secondary indexes match the clustered index, instead of only checking the current read view.

      It may turn out that only a special ‘locking’ variant of CHECK TABLE would be able to detect truly any type of mismatch.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            CHECK TABLE…QUICK currently essentially performs SELECT COUNT(*) in each index of the table, using concurrent read, and complains if the counts differ. MDEV-25459 was a bug in the MVCC implementation of secondary indexes that would also cause false alarms in CHECK TABLE.

            CHECK TABLE without any attributes or with any other attributes than QUICK would be the same, but additionally check all pointers in the non-leaf pages of the index trees.

            marko Marko Mäkelä added a comment - - edited CHECK TABLE…QUICK currently essentially performs SELECT COUNT(*) in each index of the table, using concurrent read, and complains if the counts differ. MDEV-25459 was a bug in the MVCC implementation of secondary indexes that would also cause false alarms in CHECK TABLE . CHECK TABLE without any attributes or with any other attributes than QUICK would be the same, but additionally check all pointers in the non-leaf pages of the index trees.

            Let us consider a table that contains a single row:

            CREATE TABLE t1(a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB;
            BEGIN;
            INSERT INTO t1 VALUES(1,1);
            UPDATE t1 SET b=2;
            UPDATE t1 SET b=3;
            UPDATE t1 SET b=4;
            

            The secondary index (b,a) would contain 3 delete-marked records (1,1),(2,1),(3,1) and a record (4,1). All these will correspond to a version of the primary key record with a=1. The versions of the clustered index records are stored in a singly linked list of undo log records in undo pages, while the versions of secondary indexes are stored directly in the secondary index B-trees.

            An extended check that I have been discussed with thiru would work as follows. For every secondary index record, we would invoke something similar to row_vers_old_has_index_entry(), which is what the purge of history would invoke to determine if it is safe to remove a secondary index record. In that check, we would complain if a secondary index record is orphan, with no counterpart in the clustered index.

            There is a complication. As soon as a COMMIT is executed in the example transaction above, the purge_sys.view could be advanced to a point where only the latest version is visible, letting anyone instantly ‘think’ that all but the secondary index record (4,1) are orphan. Yet, the purge would not instantly remove any records from the secondary index. Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed. Or we must make some more of the purge state accessible to this proposed CHECK TABLE…EXTENDED.

            marko Marko Mäkelä added a comment - Let us consider a table that contains a single row: CREATE TABLE t1(a INT PRIMARY KEY , b INT UNIQUE ) ENGINE=InnoDB; BEGIN ; INSERT INTO t1 VALUES (1,1); UPDATE t1 SET b=2; UPDATE t1 SET b=3; UPDATE t1 SET b=4; The secondary index (b,a) would contain 3 delete-marked records (1,1),(2,1),(3,1) and a record (4,1). All these will correspond to a version of the primary key record with a=1 . The versions of the clustered index records are stored in a singly linked list of undo log records in undo pages, while the versions of secondary indexes are stored directly in the secondary index B-trees. An extended check that I have been discussed with thiru would work as follows. For every secondary index record, we would invoke something similar to row_vers_old_has_index_entry() , which is what the purge of history would invoke to determine if it is safe to remove a secondary index record. In that check, we would complain if a secondary index record is orphan, with no counterpart in the clustered index. There is a complication. As soon as a COMMIT is executed in the example transaction above, the purge_sys.view could be advanced to a point where only the latest version is visible, letting anyone instantly ‘think’ that all but the secondary index record (4,1) are orphan. Yet, the purge would not instantly remove any records from the secondary index. Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed. Or we must make some more of the purge state accessible to this proposed CHECK TABLE…EXTENDED .

            Disabling the purge thread could avoid DB_MISSING_HISTORY error

            thiru Thirunarayanan Balathandayuthapani added a comment - Disabling the purge thread could avoid DB_MISSING_HISTORY error

            The purge of committed transaction history would remove old secondary index record versions (which have to be delete-marked) by invoking row_purge_remove_sec_if_poss() on every affected secondary index. Only after that, it would do something to the clustered index record.

            • For TRX_UNDO_DEL_MARK_REC, row_purge_del_mark() will delete the record in each index if possible.
              1. row_purge_remove_sec_if_poss(() consults row_purge_poss_sec() and will not delete a secondary index record if a clustered index record with the same PRIMARY KEY and secondary key value was inserted later.
              2. row_purge_remove_clust_if_poss() will not delete the record if its DB_ROLL_PTR differs from the undo record (the record was re-inserted later).
            • For most other undo log record types, row_purge_upd_exist_or_extern() will
              1. invoke row_purge_remove_sec_if_poss() on indexes that were affected by an UPDATE of an indexed column
              2. reset the DB_TRX_ID in the clustered index record

            In each case, if CHECK TABLE is holding a latch on the index page of a valid secondary index record, that page latch will prevent purge from proceeding, and it should be safe to look up the clustered index record and construct earlier versions of it until a match is found. Assuming that there is no corruption, we can ignore the purge_sys.view limit that normally prevents us from accessing earlier versions.

            However, the purpose of CHECK TABLE is to check for corruption. When purge_sys.view does not allow us to access an earlier version that is identified by DB_ROLL_PTR, we must tread more carefully to detect a freed undo page. If the search for a matching clustered index record leads us to a definitely freed undo page, then the secondary index record would be an orphan one.

            It is a little unclear if we can reliably detect that purge has processed an undo page. At least in the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment(), it may be challenging to detect that the undo page has been freed by purge.

            marko Marko Mäkelä added a comment - The purge of committed transaction history would remove old secondary index record versions (which have to be delete-marked) by invoking row_purge_remove_sec_if_poss() on every affected secondary index. Only after that, it would do something to the clustered index record. For TRX_UNDO_DEL_MARK_REC , row_purge_del_mark() will delete the record in each index if possible. row_purge_remove_sec_if_poss(() consults row_purge_poss_sec() and will not delete a secondary index record if a clustered index record with the same PRIMARY KEY and secondary key value was inserted later. row_purge_remove_clust_if_poss() will not delete the record if its DB_ROLL_PTR differs from the undo record (the record was re-inserted later). For most other undo log record types, row_purge_upd_exist_or_extern() will invoke row_purge_remove_sec_if_poss() on indexes that were affected by an UPDATE of an indexed column reset the DB_TRX_ID in the clustered index record In each case, if CHECK TABLE is holding a latch on the index page of a valid secondary index record, that page latch will prevent purge from proceeding, and it should be safe to look up the clustered index record and construct earlier versions of it until a match is found. Assuming that there is no corruption, we can ignore the purge_sys.view limit that normally prevents us from accessing earlier versions. However, the purpose of CHECK TABLE is to check for corruption. When purge_sys.view does not allow us to access an earlier version that is identified by DB_ROLL_PTR , we must tread more carefully to detect a freed undo page. If the search for a matching clustered index record leads us to a definitely freed undo page, then the secondary index record would be an orphan one. It is a little unclear if we can reliably detect that purge has processed an undo page. At least in the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment() , it may be challenging to detect that the undo page has been freed by purge.

            To investigate what happens during the purge, I ran the following test case:

            --source include/have_innodb.inc
            CREATE TABLE t1(a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB
            STATS_PERSISTENT=0;
            SET GLOBAL innodb_max_purge_lag_wait=0;
            BEGIN;
            INSERT INTO t1 VALUES(1,1);
            UPDATE t1 SET b=2;
            UPDATE t1 SET b=3;
            UPDATE t1 SET b=4;
            COMMIT;
            SET GLOBAL innodb_max_purge_lag_wait=0;
            DROP TABLE t1;
            

            ./mtr --rr --mysqld=--innodb-purge-threads=1 name_of_above_test
            rr replay var/log/mysqld.1.rr/latest-trace
            

            break ha_innobase::update_row
            ignore 1 2
            continue
            break trx_undo_assign_low
            break trx_purge_remove_log_hdr
            break ha_innobase::delete_table
            continue

            10.6 fd0bdd3180a7d5f4b9804d372d6a63b6a202818c

            (rr) continue
            Continuing.
             
            Thread 2 hit Breakpoint 2, trx_undo_assign_low (trx=0x7fe0078feb80, rseg=0x555ef01c2b80 <trx_sys+17024>, undo=0x7fe0078ff4d0, err=0x7fdff575ef18, mtr=0x7fdff575ef60) at /mariadb/10.6/storage/innobase/trx/trx0undo.cc:1400
            1400		ut_d(const bool	is_temp = rseg == trx->rsegs.m_noredo.rseg);
            (rr) finish
            Run till exit from #0  trx_undo_assign_low (trx=0x7fe0078feb80, 
                rseg=0x555ef01c2b80 <trx_sys+17024>, undo=0x7fe0078ff4d0, 
                err=0x7fdff575ef18, mtr=0x7fdff575ef60)
                at /mariadb/10.6/storage/innobase/trx/trx0undo.cc:1400
            0x0000555eef51a87c in trx_undo_report_row_operation (thr=0x7fdfdc029bc8, index=0x7fdfdc19eee0, clust_entry=0x0, update=0x7fdfdc0299f8, cmpl_info=0, rec=0x7fe00756c07e "\200", offsets=0x7fdff575fda0, roll_ptr=0x7fdff575f510) at /mariadb/10.6/storage/innobase/trx/trx0rec.cc:2066
            2066		buf_block_t*	undo_block = trx_undo_assign_low(trx, rseg, pundo,
            Value returned is $1 = (buf_block_t *) 0x7fe00709af70
            (rr) p $1.page.id_.m_id
            $2 = 308
            (rr) continue
            Continuing.
            [Switching to Thread 20557.20560]
             
            Thread 5 hit Breakpoint 3, trx_purge_remove_log_hdr (rseg=0x7fe00709aea0, log=0x7fe00709af70, offset=581, mtr=0x7fe005d8cd80) at /mariadb/10.6/storage/innobase/trx/trx0purge.cc:359
            359	                     uint16_t(offset + TRX_UNDO_HISTORY_NODE), mtr);
            (rr) continue
            Continuing.
            [Switching to Thread 20557.20569]
             
            Thread 2 hit Breakpoint 4, ha_innobase::delete_table (this=0x7fdfdc02ea88, name=0x7fdff5760720 "./test/t1") at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13462
            13462	{
            (rr) watch -l $1.page.frame[16]@8
            Hardware watchpoint 5: -location $1.page.frame[16]@8
            (rr) reverse-continue
            Continuing.
            [Switching to Thread 20557.20560]
             
            Thread 5 hit Breakpoint 3, trx_purge_remove_log_hdr (rseg=0x7fe00709aea0, log=0x7fe00709af70, offset=581, mtr=0x7fe005d8cd80) at /mariadb/10.6/storage/innobase/trx/trx0purge.cc:359
            359	                     uint16_t(offset + TRX_UNDO_HISTORY_NODE), mtr);
            (rr) reverse-continue
            Continuing.
            [Switching to Thread 20557.20569]
             
            Thread 2 hit Hardware watchpoint 5: -location $1.page.frame[16]@8
             
            Old value = "\000\000\000\000\000\000\344", <incomplete sequence \353>
            New value = "\000\000\000\000\000\000\344w"
            0x0000555eef393d8d in mach_write_to_4 (b=0x7fe007100014 "", n=58603) at /mariadb/10.6/storage/innobase/include/mach0data.inl:167
            167		b[3] = (byte) n;
            (rr) backtrace
            #1  0x0000555eef393e22 in mach_write_to_8 (b=0x7fe007100010, n=58603)
                at /mariadb/10.5/storage/innobase/include/mach0data.inl:342
            #2  0x0000555eef3aa173 in buf_flush_note_modification (b=0x7fe00709af70, 
                start=58543, end=58603)
                at /mariadb/10.5/storage/innobase/include/buf0flu.h:117
            #6  0x0000555eef3c39b5 in mtr_t::commit (this=0x7fdff575fb50)
                at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:484
            #7  0x0000555eef52eceb in trx_t::commit_low (this=0x7fe0078feb80, 
                mtr=0x7fdff575fb50) at /mariadb/10.6/storage/innobase/trx/trx0trx.cc:1450
            #19 0x0000555eeeb30095 in mysql_parse (thd=0x7fdfdc0020c8, 
                rawbuf=0x7fdfdc02e0a0 "COMMIT", length=6, parser_state=0x7fdff5761370)
                at /mariadb/10.6/sql/sql_parse.cc:8030
            

            This experiment shows that there was exactly one call to trx_purge_free_segment() to purge the entire log of the transaction. As expected, it did fit in a single undo log page.

            The parameter log=0x7fe00709af70 of trx_purge_remove_log_hdr() is the buffer block that was returned by trx_undo_assign_low() for each row operation of our transaction. Based on the watchpoint on FIL_PAGE_LSN, we can see that purge did not modify that undo log page. FIL_PAGE_LSN must be updated every time before an exclusive latch on a modified page is released. The undo page was last touched during COMMIT when its state was changed from active to committed.

            marko Marko Mäkelä added a comment - To investigate what happens during the purge, I ran the following test case: --source include/have_innodb.inc CREATE TABLE t1(a INT PRIMARY KEY , b INT UNIQUE ) ENGINE=InnoDB STATS_PERSISTENT=0; SET GLOBAL innodb_max_purge_lag_wait=0; BEGIN ; INSERT INTO t1 VALUES (1,1); UPDATE t1 SET b=2; UPDATE t1 SET b=3; UPDATE t1 SET b=4; COMMIT ; SET GLOBAL innodb_max_purge_lag_wait=0; DROP TABLE t1; ./mtr --rr --mysqld=--innodb-purge-threads=1 name_of_above_test rr replay var/log/mysqld.1.rr/latest-trace break ha_innobase::update_row ignore 1 2 continue break trx_undo_assign_low break trx_purge_remove_log_hdr break ha_innobase::delete_table continue … 10.6 fd0bdd3180a7d5f4b9804d372d6a63b6a202818c (rr) continue Continuing.   Thread 2 hit Breakpoint 2, trx_undo_assign_low (trx=0x7fe0078feb80, rseg=0x555ef01c2b80 <trx_sys+17024>, undo=0x7fe0078ff4d0, err=0x7fdff575ef18, mtr=0x7fdff575ef60) at /mariadb/10.6/storage/innobase/trx/trx0undo.cc:1400 1400 ut_d(const bool is_temp = rseg == trx->rsegs.m_noredo.rseg); (rr) finish Run till exit from #0 trx_undo_assign_low (trx=0x7fe0078feb80, rseg=0x555ef01c2b80 <trx_sys+17024>, undo=0x7fe0078ff4d0, err=0x7fdff575ef18, mtr=0x7fdff575ef60) at /mariadb/10.6/storage/innobase/trx/trx0undo.cc:1400 0x0000555eef51a87c in trx_undo_report_row_operation (thr=0x7fdfdc029bc8, index=0x7fdfdc19eee0, clust_entry=0x0, update=0x7fdfdc0299f8, cmpl_info=0, rec=0x7fe00756c07e "\200", offsets=0x7fdff575fda0, roll_ptr=0x7fdff575f510) at /mariadb/10.6/storage/innobase/trx/trx0rec.cc:2066 2066 buf_block_t* undo_block = trx_undo_assign_low(trx, rseg, pundo, Value returned is $1 = (buf_block_t *) 0x7fe00709af70 (rr) p $1.page.id_.m_id $2 = 308 (rr) continue Continuing. [Switching to Thread 20557.20560]   Thread 5 hit Breakpoint 3, trx_purge_remove_log_hdr (rseg=0x7fe00709aea0, log=0x7fe00709af70, offset=581, mtr=0x7fe005d8cd80) at /mariadb/10.6/storage/innobase/trx/trx0purge.cc:359 359 uint16_t(offset + TRX_UNDO_HISTORY_NODE), mtr); (rr) continue Continuing. [Switching to Thread 20557.20569]   Thread 2 hit Breakpoint 4, ha_innobase::delete_table (this=0x7fdfdc02ea88, name=0x7fdff5760720 "./test/t1") at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13462 13462 { (rr) watch -l $1.page.frame[16]@8 Hardware watchpoint 5: -location $1.page.frame[16]@8 (rr) reverse-continue Continuing. [Switching to Thread 20557.20560]   Thread 5 hit Breakpoint 3, trx_purge_remove_log_hdr (rseg=0x7fe00709aea0, log=0x7fe00709af70, offset=581, mtr=0x7fe005d8cd80) at /mariadb/10.6/storage/innobase/trx/trx0purge.cc:359 359 uint16_t(offset + TRX_UNDO_HISTORY_NODE), mtr); (rr) reverse-continue Continuing. [Switching to Thread 20557.20569]   Thread 2 hit Hardware watchpoint 5: -location $1.page.frame[16]@8   Old value = "\000\000\000\000\000\000\344", <incomplete sequence \353> New value = "\000\000\000\000\000\000\344w" 0x0000555eef393d8d in mach_write_to_4 (b=0x7fe007100014 "", n=58603) at /mariadb/10.6/storage/innobase/include/mach0data.inl:167 167 b[3] = (byte) n; (rr) backtrace … #1 0x0000555eef393e22 in mach_write_to_8 (b=0x7fe007100010, n=58603) at /mariadb/10.5/storage/innobase/include/mach0data.inl:342 #2 0x0000555eef3aa173 in buf_flush_note_modification (b=0x7fe00709af70, start=58543, end=58603) at /mariadb/10.5/storage/innobase/include/buf0flu.h:117 … #6 0x0000555eef3c39b5 in mtr_t::commit (this=0x7fdff575fb50) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:484 #7 0x0000555eef52eceb in trx_t::commit_low (this=0x7fe0078feb80, mtr=0x7fdff575fb50) at /mariadb/10.6/storage/innobase/trx/trx0trx.cc:1450 … #19 0x0000555eeeb30095 in mysql_parse (thd=0x7fdfdc0020c8, rawbuf=0x7fdfdc02e0a0 "COMMIT", length=6, parser_state=0x7fdff5761370) at /mariadb/10.6/sql/sql_parse.cc:8030 This experiment shows that there was exactly one call to trx_purge_free_segment() to purge the entire log of the transaction. As expected, it did fit in a single undo log page. The parameter log=0x7fe00709af70 of trx_purge_remove_log_hdr() is the buffer block that was returned by trx_undo_assign_low() for each row operation of our transaction. Based on the watchpoint on FIL_PAGE_LSN , we can see that purge did not modify that undo log page. FIL_PAGE_LSN must be updated every time before an exclusive latch on a modified page is released. The undo page was last touched during COMMIT when its state was changed from active to committed.

            In the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment(), it is unclear to me how the undo page would be reused by something else. The block would be detached from the TRX_RSEG_HISTORY list but not added to anything else.

            There is a field TRX_UNDO_TRX_ID that we might be able to compare to the DB_TRX_ID of the record version whose previous version we are trying to reconstruct. This field is available only in the first undo log page of a transaction (an undo log header page).

            marko Marko Mäkelä added a comment - In the case where trx_purge_remove_log_hdr() is invoked outside of trx_purge_free_segment() , it is unclear to me how the undo page would be reused by something else. The block would be detached from the TRX_RSEG_HISTORY list but not added to anything else. There is a field TRX_UNDO_TRX_ID that we might be able to compare to the DB_TRX_ID of the record version whose previous version we are trying to reconstruct. This field is available only in the first undo log page of a transaction (an undo log header page).

            thiru reminded me that trx_undo_set_state_at_finish() treats short undo logs (all undo records of a transaction fit in a single undo page) specially. The state TRX_UNDO_CACHED means that trx_undo_reuse_cached() can assign the unused tail of the undo log page to an unrelated transaction later. At the end of a purge batch, trx_purge_truncate_rseg_history() would only invoke trx_purge_free_segment() if it is processing the very last undo log in a page that is no longer being "cached for reuse" but is in the state TRX_UNDO_TO_PURGE. That is when the page would really be freed.

            Let me formulate the problem once more. We have a secondary index record and want to know it is orphan.

            1. If the clustered index record key that the record points to is not found, the record is orphan. This is even more severe if the record is not delete-marked.
            2. Else, if any version not older than purge_sys.end_view matches the secondary index record, we are fine.
            3. Else, the secondary index record is orphan.

            The purge_sys.end_view would be an addition to the current purge_sys.view, which is updated at the start of a purge batch, in purge_sys.clone_oldest_view() invoked by trx_purge(). The purge_sys.end_view would be copied from purge_sys.view at the end of a purge batch, right before trx_purge_truncate_rseg_history() starts freeing any qualifying TRX_UNDO_TO_PURGE logs.

            Not only would the purge_sys.end_view prevent the CHECK TABLE…EXTENDED from accessing potentially freed undo log pages, but it would also provide an accurate read view. Without this accurate read view, we might find a matching old version in something that has already been purged but happens to be available in a partially reused undo log page. That is, without having an accurate read view, we could fail to detect some orphan records.

            The record lookups will find any records that are present in secondary indexes but missing from the clustered index.

            To detect any records that are present in the clustered index but missing from secondary indexes, also CHECK TABLE…EXTENDED will have to count the records in each index, using the current read view of the transaction. In case of a count mismatch, no detailed diagnostics (such as the PRIMARY KEY of the record) can be provided.

            There are no real checks for SPATIAL INDEX or FULLTEXT INDEX, and neither will be improved by this.

            marko Marko Mäkelä added a comment - thiru reminded me that trx_undo_set_state_at_finish() treats short undo logs (all undo records of a transaction fit in a single undo page) specially. The state TRX_UNDO_CACHED means that trx_undo_reuse_cached() can assign the unused tail of the undo log page to an unrelated transaction later. At the end of a purge batch, trx_purge_truncate_rseg_history() would only invoke trx_purge_free_segment() if it is processing the very last undo log in a page that is no longer being "cached for reuse" but is in the state TRX_UNDO_TO_PURGE . That is when the page would really be freed. Let me formulate the problem once more. We have a secondary index record and want to know it is orphan. If the clustered index record key that the record points to is not found, the record is orphan. This is even more severe if the record is not delete-marked. Else, if any version not older than purge_sys.end_view matches the secondary index record, we are fine. Else, the secondary index record is orphan. The purge_sys.end_view would be an addition to the current purge_sys.view , which is updated at the start of a purge batch, in purge_sys.clone_oldest_view() invoked by trx_purge() . The purge_sys.end_view would be copied from purge_sys.view at the end of a purge batch, right before trx_purge_truncate_rseg_history() starts freeing any qualifying TRX_UNDO_TO_PURGE logs. Not only would the purge_sys.end_view prevent the CHECK TABLE…EXTENDED from accessing potentially freed undo log pages, but it would also provide an accurate read view. Without this accurate read view, we might find a matching old version in something that has already been purged but happens to be available in a partially reused undo log page. That is, without having an accurate read view, we could fail to detect some orphan records. The record lookups will find any records that are present in secondary indexes but missing from the clustered index. To detect any records that are present in the clustered index but missing from secondary indexes, also CHECK TABLE…EXTENDED will have to count the records in each index, using the current read view of the transaction. In case of a count mismatch, no detailed diagnostics (such as the PRIMARY KEY of the record) can be provided. There are no real checks for SPATIAL INDEX or FULLTEXT INDEX , and neither will be improved by this.

            In Row_sel_get_clust_rec_for_mysql::operator() there is a comment that says that purgeable delete-marked secondary index records may become orphan if row_undo_mod_clust() removes a purgeable clustered index record during ROLLBACK. This means that before reporting orphan delete-marked secondary index records we must check that the PAGE_MAX_TRX_ID of the secondary index leaf page is visible in purge_sys.end_view.

            marko Marko Mäkelä added a comment - In Row_sel_get_clust_rec_for_mysql::operator() there is a comment that says that purgeable delete-marked secondary index records may become orphan if row_undo_mod_clust() removes a purgeable clustered index record during ROLLBACK . This means that before reporting orphan delete-marked secondary index records we must check that the PAGE_MAX_TRX_ID of the secondary index leaf page is visible in purge_sys.end_view .

            The extended check found its first bug: MDEV-29666.

            marko Marko Mäkelä added a comment - The extended check found its first bug: MDEV-29666 .

            After MDEV-29666 is worked around, gcol.innodb_virtual_purge would pass the extended check, but a number of other tests would still report some orphan records. Most notably, the following (simplified from innodb_zip.bug56680) is causing trouble, because the EXTENDED check fails to count the secondary index record. In the secondary index, we only have one record, because 'X'='x' in the collation.

            -- source include/innodb_page_size_small.inc
             
            CREATE TABLE bug56680(a INT PRIMARY KEY, b CHAR(1), c INT, INDEX(b))
            ENGINE=InnoDB STATS_PERSISTENT=0;
             
            INSERT INTO bug56680 VALUES(1,'x',1);
             
            connect (con1,localhost,root,,);
            BEGIN;
            UPDATE bug56680 SET b='X';
             
            connection default;
            CHECK TABLE bug56680 EXTENDED;
             
            connection con1;
            ROLLBACK;
            SELECT b FROM bug56680 LIMIT 2;
            disconnect con1;
             
            connection default;
             
            DROP TABLE bug56680;
            

            Also the following tests would fail the extended check:

            innodb.innodb-index-online
            parts.partition_alter1_1_2_innodb
            parts.partition_alter1_1_innodb
            parts.partition_alter2_1_1_innodb
            parts.partition_alter2_2_1_innodb
            parts.partition_alter2_2_2_innodb
            parts.partition_basic_innodb
            

            Some of the partitioning tests occasionally pass, because there is no synchronization with the purge of committed transaction history.

            marko Marko Mäkelä added a comment - After MDEV-29666 is worked around, gcol.innodb_virtual_purge would pass the extended check, but a number of other tests would still report some orphan records. Most notably, the following (simplified from innodb_zip.bug56680 ) is causing trouble, because the EXTENDED check fails to count the secondary index record. In the secondary index, we only have one record, because 'X'='x' in the collation. -- source include/innodb_page_size_small.inc   CREATE TABLE bug56680(a INT PRIMARY KEY , b CHAR (1), c INT , INDEX (b)) ENGINE=InnoDB STATS_PERSISTENT=0;   INSERT INTO bug56680 VALUES (1, 'x' ,1);   connect (con1,localhost,root,,); BEGIN ; UPDATE bug56680 SET b= 'X' ;   connection default ; CHECK TABLE bug56680 EXTENDED;   connection con1; ROLLBACK ; SELECT b FROM bug56680 LIMIT 2; disconnect con1;   connection default ;   DROP TABLE bug56680; Also the following tests would fail the extended check: innodb.innodb-index-online parts.partition_alter1_1_2_innodb parts.partition_alter1_1_innodb parts.partition_alter2_1_1_innodb parts.partition_alter2_2_1_innodb parts.partition_alter2_2_2_innodb parts.partition_basic_innodb Some of the partitioning tests occasionally pass, because there is no synchronization with the purge of committed transaction history.

            For innodb.innodb-index-online, the problem was that my EXTENDED check was using the wrong read view. It should have used purge_sys.end_view instead of purge_sys.view.

            The partitioning tests seem to show genuine trouble: some delete-marked secondary index records are not being purged in the purge coordinator batch that was supposed to do it. Those tests fail even when run with innodb_purge_threads=1. That would point towards MDEV-29666 or a similar bug.

            marko Marko Mäkelä added a comment - For innodb.innodb-index-online , the problem was that my EXTENDED check was using the wrong read view. It should have used purge_sys.end_view instead of purge_sys.view . The partitioning tests seem to show genuine trouble: some delete-marked secondary index records are not being purged in the purge coordinator batch that was supposed to do it. Those tests fail even when run with innodb_purge_threads=1 . That would point towards MDEV-29666 or a similar bug.

            The partition tests stop failing if I make trx_purge_attach_undo_recs() ignore the parameter innodb_purge_batch_size and copy all available undo log records to a purge batch.

            There is a comment about that also in trx_purge_truncate_history(). It looks like trx_purge() has to be refactored so that purge_sys.end_view will not be updated and undo logs will not be truncated before everything has been processed.

            marko Marko Mäkelä added a comment - The partition tests stop failing if I make trx_purge_attach_undo_recs() ignore the parameter innodb_purge_batch_size and copy all available undo log records to a purge batch. There is a comment about that also in trx_purge_truncate_history() . It looks like trx_purge() has to be refactored so that purge_sys.end_view will not be updated and undo logs will not be truncated before everything has been processed.

            The following test (reconstructed and simplified from an rr replay trace) would occasionally fail in the current development branch:

            --source include/have_innodb.inc
             
            CREATE TABLE t1
            (a TINYINT UNSIGNED PRIMARY KEY, b TINYINT UNSIGNED NOT NULL, KEY(b))
            ENGINE=InnoDB;
             
            connect (stop_purge,localhost,root);
            START TRANSACTION WITH CONSISTENT SNAPSHOT;
             
            connection default;
            INSERT INTO t1 VALUES (5,6);
            DELETE FROM t1;
            INSERT INTO t1 VALUES (5,5);
             
            START TRANSACTION WITH CONSISTENT SNAPSHOT;
             
            disconnect stop_purge;
             
            DELETE FROM t1 WHERE a=5;
            ROLLBACK;
             
            #SET @saved_frequency = @@GLOBAL.innodb_purge_rseg_truncate_frequency;
            #SET GLOBAL innodb_purge_rseg_truncate_frequency = 1;
            #SET GLOBAL innodb_max_purge_lag_wait=0;
            #SET GLOBAL innodb_purge_rseg_truncate_frequency = @saved_frequency;
             
            CHECK TABLE t1 EXTENDED;
            SELECT * FROM t1;
            DROP TABLE t1;
            

            If the wait for purge is enabled (the comment marks removed), then the test will not fail. The problem is in row_undo_mod_clust(): it must refer to the new purge_sys.end_view instead of purge_sys.view when deciding whether to discard the history for a clustered index record.

            marko Marko Mäkelä added a comment - The following test (reconstructed and simplified from an rr replay trace) would occasionally fail in the current development branch: --source include/have_innodb.inc   CREATE TABLE t1 (a TINYINT UNSIGNED PRIMARY KEY , b TINYINT UNSIGNED NOT NULL , KEY (b)) ENGINE=InnoDB;   connect (stop_purge,localhost,root); START TRANSACTION WITH CONSISTENT SNAPSHOT;   connection default ; INSERT INTO t1 VALUES (5,6); DELETE FROM t1; INSERT INTO t1 VALUES (5,5);   START TRANSACTION WITH CONSISTENT SNAPSHOT;   disconnect stop_purge;   DELETE FROM t1 WHERE a=5; ROLLBACK ;   # SET @saved_frequency = @@ GLOBAL .innodb_purge_rseg_truncate_frequency; # SET GLOBAL innodb_purge_rseg_truncate_frequency = 1; # SET GLOBAL innodb_max_purge_lag_wait=0; # SET GLOBAL innodb_purge_rseg_truncate_frequency = @saved_frequency;   CHECK TABLE t1 EXTENDED; SELECT * FROM t1; DROP TABLE t1; If the wait for purge is enabled (the comment marks removed), then the test will not fail. The problem is in row_undo_mod_clust() : it must refer to the new purge_sys.end_view instead of purge_sys.view when deciding whether to discard the history for a clustered index record.

            The extended check would still occasionally fail in the already mentioned partitioning tests. Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs(), tests will occasionally fail like this:

            bb-10.6-MDEV-24402 db0233c61cd51515e5dddc4f6ea86dcf1bf3394b

            CURRENT_TEST: parts.partition_basic_innodb
            --- /mariadb/10.6/mysql-test/suite/parts/r/partition_basic_innodb.result	2022-10-03 15:21:16.059907120 +0300
            +++ /mariadb/10.6/mysql-test/suite/parts/r/partition_basic_innodb.reject	2022-10-06 15:46:20.987564284 +0300
            @@ -15767,6 +15767,7 @@
             test.t1	analyze	status	OK
             CHECK    TABLE t1 EXTENDED;
             Table	Op	Msg_type	Msg_text
            +test.t1	check	Warning	InnoDB: Unpurged clustered index record
             test.t1	check	status	OK
             CHECKSUM TABLE t1 EXTENDED;
             Table	Checksum
            

            I can reproduce it under rr replay.

            marko Marko Mäkelä added a comment - The extended check would still occasionally fail in the already mentioned partitioning tests. Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs() , tests will occasionally fail like this: bb-10.6-MDEV-24402 db0233c61cd51515e5dddc4f6ea86dcf1bf3394b CURRENT_TEST: parts.partition_basic_innodb --- /mariadb/10.6/mysql-test/suite/parts/r/partition_basic_innodb.result 2022-10-03 15:21:16.059907120 +0300 +++ /mariadb/10.6/mysql-test/suite/parts/r/partition_basic_innodb.reject 2022-10-06 15:46:20.987564284 +0300 @@ -15767,6 +15767,7 @@ test.t1 analyze status OK CHECK TABLE t1 EXTENDED; Table Op Msg_type Msg_text +test.t1 check Warning InnoDB: Unpurged clustered index record test.t1 check status OK CHECKSUM TABLE t1 EXTENDED; Table Checksum I can reproduce it under rr replay .

            On 2022-03-29 I wrote:

            Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed.

            This is indeed the case. My attempted work-around in row_undo_mod_clust() would make matters significantly worse by causing ROLLBACK to leave behind genuinely orphan records in the clustered index.

            What seems to work well is to leave the ROLLBACK alone and only report orphan secondary index if all history up to the PAGE_MAX_TRX_ID should have been fully purged. That type of a check would still catch MDEV-29666 if I revert the fix:

            diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
            index 434ea7cfd20..1944b469d4e 100644
            --- a/storage/innobase/row/row0purge.cc
            +++ b/storage/innobase/row/row0purge.cc
            @@ -991,12 +991,14 @@ static byte *row_purge_get_partial(const byte *ptr, const dict_index_t &index,
                      REC_ANTELOPE_MAX_INDEX_COL_LEN + BTR_EXTERN_FIELD_REF_SIZE);
               }
             
            +#if 0
               for (ulint i= 0; i < index.n_uniq; i++)
               {
                 dfield_t &field= node->row->fields[index.fields[i].col->ind];
                 if (field.type.mtype == DATA_MISSING)
                   field= node->ref->fields[i];
               }
            +#endif
             
               return const_cast<byte*>(ptr);
             }
            

            CURRENT_TEST: gcol.innodb_virtual_purge
            mysqltest: At line 127: query 'SELECT b1 FROM t1' failed: ER_INDEX_CORRUPT (1712): Index t1 is corrupted
            

            In the server error log, each orphan record will be reported:

            2022-10-07 16:00:14 4 [ERROR] InnoDB: Clustered index record not found for index `b1` of table `test`.`t1`: COMPACT RECORD(info_bits=32, 2 fields): {[4]    (0x80000000),[6]      (0x000000000200)}
            2022-10-07 16:00:14 4 [ERROR] InnoDB: Clustered index record not found for index `b1` of table `test`.`t1`: COMPACT RECORD(info_bits=32, 2 fields): {[4]    (0x80000009),[6]      (0x000000000203)}
            2022-10-07 16:00:14 4 [ERROR] InnoDB: Flagged corruption of `b1` in table `test`.`t1` in CHECK TABLE-check index
            

            marko Marko Mäkelä added a comment - On 2022-03-29 I wrote: Possibly, we will have to accept that CHECK TABLE is nondeterministic and may find more orphan records after purge has completed. This is indeed the case. My attempted work-around in row_undo_mod_clust() would make matters significantly worse by causing ROLLBACK to leave behind genuinely orphan records in the clustered index. What seems to work well is to leave the ROLLBACK alone and only report orphan secondary index if all history up to the PAGE_MAX_TRX_ID should have been fully purged. That type of a check would still catch MDEV-29666 if I revert the fix: diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc index 434ea7cfd20..1944b469d4e 100644 --- a/storage/innobase/row/row0purge.cc +++ b/storage/innobase/row/row0purge.cc @@ -991,12 +991,14 @@ static byte *row_purge_get_partial(const byte *ptr, const dict_index_t &index, REC_ANTELOPE_MAX_INDEX_COL_LEN + BTR_EXTERN_FIELD_REF_SIZE); } +#if 0 for (ulint i= 0; i < index.n_uniq; i++) { dfield_t &field= node->row->fields[index.fields[i].col->ind]; if (field.type.mtype == DATA_MISSING) field= node->ref->fields[i]; } +#endif return const_cast<byte*>(ptr); } CURRENT_TEST: gcol.innodb_virtual_purge mysqltest: At line 127: query 'SELECT b1 FROM t1' failed: ER_INDEX_CORRUPT (1712): Index t1 is corrupted In the server error log, each orphan record will be reported: 2022-10-07 16:00:14 4 [ERROR] InnoDB: Clustered index record not found for index `b1` of table `test`.`t1`: COMPACT RECORD(info_bits=32, 2 fields): {[4] (0x80000000),[6] (0x000000000200)} … 2022-10-07 16:00:14 4 [ERROR] InnoDB: Clustered index record not found for index `b1` of table `test`.`t1`: COMPACT RECORD(info_bits=32, 2 fields): {[4] (0x80000009),[6] (0x000000000203)} 2022-10-07 16:00:14 4 [ERROR] InnoDB: Flagged corruption of `b1` in table `test`.`t1` in CHECK TABLE-check index

            The following test would occasionally display warnings in CHECK TABLE…EXTENDED:

            --source include/have_innodb.inc
            --source include/not_embedded.inc
             
            CREATE TABLE t1(a INT PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t1 SET a=0;
            --disable_query_log
            SET GLOBAL innodb_flush_log_at_trx_commit=0;
            let $N=768;
            while ($N) {
            UPDATE t1 SET a=a+1;
            dec $N;
            }
            SET GLOBAL innodb_flush_log_at_trx_commit=1;
            --enable_query_log
            CREATE TABLE t2(a VARCHAR(8) PRIMARY KEY, b INT NOT NULL) ENGINE=InnoDB;
            INSERT INTO t2 VALUES ("juliet",1),("lima",2),("november",3),("zulu",4);
             
            INSERT INTO t2 VALUES ("infimum",0);
             
            --source include/kill_mysqld.inc
            --source include/start_mysqld.inc
             
            CHECK TABLE t2 EXTENDED;
            DROP TABLE t1,t2;
            

            2022-10-11 11:55:56 3 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: …
            

            Another variant of this message (not observed for this test) is:

            2022-10-11 11:55:56 3 [Warning] InnoDB: Unpurged clustered index record in table …
            

            Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs(), the CHECK TABLE would report warnings every now and then.

            The CHECK TABLE t2 EXTENDED will stop issuing those warnings only if I add a slow shutdown after the start_mysqld.inc:

            SET GLOBAL innodb_fast_shutdown=0;
            --source include/restart_mysqld.inc
            

            The servers were restarted 0 times
            Spent 16897.207 of 313 seconds executing testcases
             
            Completed: All 6000 tests were successful.
            

            Due to the way how purge works across server restart, I do not think that it is feasible to suppress these warnings. I did try invoking end_view.clamp_low_limit_id() in trx_lists_init_at_db_start(), similar to how purge_sys_t::clone_end_view() does it. I also tried marking the initial purge_sys.end_view as ‘forged’ and checking this status before issuing these warnings, but it would only suppress the warnings until the first call of purge_sys.clone_end_view(), while we would want to suppress the warning until all history had been purged.

            marko Marko Mäkelä added a comment - The following test would occasionally display warnings in CHECK TABLE…EXTENDED : --source include/have_innodb.inc --source include/not_embedded.inc   CREATE TABLE t1(a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t1 SET a=0; --disable_query_log SET GLOBAL innodb_flush_log_at_trx_commit=0; let $N=768; while ($N) { UPDATE t1 SET a=a+1; dec $N; } SET GLOBAL innodb_flush_log_at_trx_commit=1; --enable_query_log CREATE TABLE t2(a VARCHAR (8) PRIMARY KEY , b INT NOT NULL ) ENGINE=InnoDB; INSERT INTO t2 VALUES ( "juliet" ,1),( "lima" ,2),( "november" ,3),( "zulu" ,4);   INSERT INTO t2 VALUES ( "infimum" ,0);   --source include/kill_mysqld.inc --source include/start_mysqld.inc   CHECK TABLE t2 EXTENDED; DROP TABLE t1,t2; 2022-10-11 11:55:56 3 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: … Another variant of this message (not observed for this test) is: 2022-10-11 11:55:56 3 [Warning] InnoDB: Unpurged clustered index record in table … Even if I disable the innodb_purge_batch_size limit in trx_purge_attach_undo_recs() , the CHECK TABLE would report warnings every now and then. The CHECK TABLE t2 EXTENDED will stop issuing those warnings only if I add a slow shutdown after the start_mysqld.inc : SET GLOBAL innodb_fast_shutdown=0; --source include/restart_mysqld.inc The servers were restarted 0 times Spent 16897.207 of 313 seconds executing testcases   Completed: All 6000 tests were successful. Due to the way how purge works across server restart, I do not think that it is feasible to suppress these warnings. I did try invoking end_view.clamp_low_limit_id() in trx_lists_init_at_db_start() , similar to how purge_sys_t::clone_end_view() does it. I also tried marking the initial purge_sys.end_view as ‘forged’ and checking this status before issuing these warnings, but it would only suppress the warnings until the first call of purge_sys.clone_end_view() , while we would want to suppress the warning until all history had been purged.

            The trouble that I previously observed with some partitioning tests is also observable with the following nondeterministic test that I extracted and simplified from an rr replay trace:

            --source include/have_innodb.inc
             
            CREATE TABLE t1 (a INT PRIMARY KEY, b INT NOT NULL, INDEX(b)) ENGINE=InnoDB;
            BEGIN;
            INSERT INTO t1 VALUES(5,4);
            DELETE FROM t1 WHERE a=5;
            COMMIT;
            BEGIN;
            INSERT INTO t1 VALUES(5,5);
            ROLLBACK;
             
            CHECK TABLE t1 EXTENDED;
            SELECT COUNT(*) FROM t1;
            DROP TABLE t1;
            

            In the ROLLBACK, it is possible that purge_sys.view has been advanced to the end, but the committed delete-marked secondary index record (b,a)=(4,5) for the first INSERT has not been purged. In this case, the ROLLBACK would delete the records (5,5) in both indexes. The CHECK TABLE would report a failure because purge had not run yet. The secondary index record would eventually be purged. This can be verified by adding the following lines to the test before or after the ROLLBACK, to wait for the purge of history to finish:

            SET @save_freq=@@GLOBAL.innodb_purge_rseg_truncate_frequency;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
            SET GLOBAL innodb_max_purge_lag_wait=1;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=@save_freq;
            

            This failure can be prevented by tweaking row_undo_mod_must_purge() so that it will refer to the new purge_sys.end_view instead of the purge_sys.view. In that way, no seemingly orphaned secondary index records will exist. This fix might also address the recovery related false alarms of my previous comment.

            marko Marko Mäkelä added a comment - The trouble that I previously observed with some partitioning tests is also observable with the following nondeterministic test that I extracted and simplified from an rr replay trace: --source include/have_innodb.inc   CREATE TABLE t1 (a INT PRIMARY KEY , b INT NOT NULL , INDEX (b)) ENGINE=InnoDB; BEGIN ; INSERT INTO t1 VALUES (5,4); DELETE FROM t1 WHERE a=5; COMMIT ; BEGIN ; INSERT INTO t1 VALUES (5,5); ROLLBACK ;   CHECK TABLE t1 EXTENDED; SELECT COUNT (*) FROM t1; DROP TABLE t1; In the ROLLBACK , it is possible that purge_sys.view has been advanced to the end, but the committed delete-marked secondary index record (b,a)=(4,5) for the first INSERT has not been purged. In this case, the ROLLBACK would delete the records (5,5) in both indexes. The CHECK TABLE would report a failure because purge had not run yet. The secondary index record would eventually be purged. This can be verified by adding the following lines to the test before or after the ROLLBACK , to wait for the purge of history to finish: SET @save_freq=@@ GLOBAL .innodb_purge_rseg_truncate_frequency; SET GLOBAL innodb_purge_rseg_truncate_frequency=1; SET GLOBAL innodb_max_purge_lag_wait=1; SET GLOBAL innodb_purge_rseg_truncate_frequency=@save_freq; This failure can be prevented by tweaking row_undo_mod_must_purge() so that it will refer to the new purge_sys.end_view instead of the purge_sys.view . In that way, no seemingly orphaned secondary index records will exist. This fix might also address the recovery related false alarms of my previous comment.

            I was too quick to report success. If I tweak the row_undo_mod_must_purge(), it will cause a genuine failure like this:

            2022-10-13 16:05:56 1004 [Warning] InnoDB: Unpurged clustered index record in table `test`.`t1`: COMPACT RECORD(info_bits=32, 4 fields): {[1] (0x05),[6]      (0x000000000CAA),[7]K   t( (0x4B0000017428BE),[1] (0x04)}
            

            (This was with TINYINT UNSIGNED instead of INT.) The clustered index record is (a,DB_TRX_ID,DB_ROLL_PTR,b)=(5,0xcaa,update,4).

            What seems to work is a tweak to the new CHECK TABLE…EXTENDED check so that if not everything up to the PAGE_MAX_TRX_ID of the page of the secondary index record has been purged yet, we will not report the missing record. Previously this rule had been implemented for seemingly missing older versions of the clustered index record, but not for a completely missing clustered index record.

            marko Marko Mäkelä added a comment - I was too quick to report success. If I tweak the row_undo_mod_must_purge() , it will cause a genuine failure like this: 2022-10-13 16:05:56 1004 [Warning] InnoDB: Unpurged clustered index record in table `test`.`t1`: COMPACT RECORD(info_bits=32, 4 fields): {[1] (0x05),[6] (0x000000000CAA),[7]K t( (0x4B0000017428BE),[1] (0x04)} (This was with TINYINT UNSIGNED instead of INT .) The clustered index record is (a,DB_TRX_ID,DB_ROLL_PTR,b)=(5,0xcaa,update,4). What seems to work is a tweak to the new CHECK TABLE…EXTENDED check so that if not everything up to the PAGE_MAX_TRX_ID of the page of the secondary index record has been purged yet, we will not report the missing record. Previously this rule had been implemented for seemingly missing older versions of the clustered index record, but not for a completely missing clustered index record.

            I got the restart test from 2022-10-11 to work reliably (60×100 repetitions). It used to fail some 10% into the test run, even with the above mentioned fix. In trx_lists_init_at_db_start(), we must initialize purge_sys.end_view in a similar way to how purge_sys_t::clone_end_view() does it, to deal with partially completed purge batches.

            marko Marko Mäkelä added a comment - I got the restart test from 2022-10-11 to work reliably (60×100 repetitions). It used to fail some 10% into the test run, even with the above mentioned fix. In trx_lists_init_at_db_start() , we must initialize purge_sys.end_view in a similar way to how purge_sys_t::clone_end_view() does it, to deal with partially completed purge batches.

            I got an RQG grammar that would occasionally report orphan delete-marked secondary index records, similar to MDEV-29666. I analyzed an rr replay trace of such a failure. The cause is a race condition between rollback and purge. Rollback would not remove a secondary index record that was inserted on top of a committed delete-marked record, because purge had not advanced that far yet. After rollback, trx_t::commit() of the ‘empty’ transaction would invoke trx_purge_add_undo_to_history(), which unnecessarily adds the empty undo log to the purge queue. I tried disabling trx_undo_try_truncate() so that the already rolled-back undo log records would be added to the purge queue, but even with that work-around (not a proper fix), the grammar would fail. With that tweak present, several tests that involve BLOBs would fail, presumably because some BLOBs would be prematurely freed.

            The error is not easily fixed. We might relax the check and ‘wave through’ cases where a matching clustered index record cannot be found at all. Other cases (where a record with a primary key is found for a newer version of the record, but not for one that matches the secondary index record) would still be flagged:

            diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc
            index e1e7ed0e12a..4baa1e66376 100644
            --- a/storage/innobase/row/row0sel.cc
            +++ b/storage/innobase/row/row0sel.cc
            @@ -6399,16 +6399,12 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows)
                     if (!rec_deleted)
                     {
                     not_found:
            -          ib::error() << "Clustered index record not found for index "
            +          ib::fatal() << "Clustered index record not found for index "
                                   << index->name << " of table " << index->table->name
                                   << ": " << rec_offsets_print(rec, offsets);
                       if (prebuilt->autoinc_error == DB_SUCCESS)
                         prebuilt->autoinc_error= DB_CORRUPTION;
                     }
            -        else if (&view == &check_table_extended_view)
            -        extended_not_found:
            -          if (view.changes_visible(page_trx_id))
            -            goto not_found;
                   did_not_find:
                     mtr.rollback_to_savepoint(savepoint);
                     goto next_rec;
            @@ -6587,8 +6583,9 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows)
                         clust_rec version or when trx_undo_prev_version_build()
                         encounters a BLOB that may have been freed according to
                         purge_sys.view (not purge_sys.end_view). */
            -            if (&view == &check_table_extended_view && !got_extended_match)
            -              goto extended_not_found;
            +            if (&view == &check_table_extended_view && !got_extended_match &&
            +                view.changes_visible(page_trx_id))
            +              goto not_found;
                         goto did_not_find;
                       }
             
            

            Alternatively, we could issue warnings for this but not flag the secondary index as corrupted. After all, this type of corruption should only hurt performance, not cause any correctness issues.

            marko Marko Mäkelä added a comment - I got an RQG grammar that would occasionally report orphan delete-marked secondary index records, similar to MDEV-29666 . I analyzed an rr replay trace of such a failure. The cause is a race condition between rollback and purge. Rollback would not remove a secondary index record that was inserted on top of a committed delete-marked record, because purge had not advanced that far yet. After rollback, trx_t::commit() of the ‘empty’ transaction would invoke trx_purge_add_undo_to_history() , which unnecessarily adds the empty undo log to the purge queue. I tried disabling trx_undo_try_truncate() so that the already rolled-back undo log records would be added to the purge queue, but even with that work-around (not a proper fix), the grammar would fail. With that tweak present, several tests that involve BLOBs would fail, presumably because some BLOBs would be prematurely freed. The error is not easily fixed. We might relax the check and ‘wave through’ cases where a matching clustered index record cannot be found at all. Other cases (where a record with a primary key is found for a newer version of the record, but not for one that matches the secondary index record) would still be flagged: diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc index e1e7ed0e12a..4baa1e66376 100644 --- a/storage/innobase/row/row0sel.cc +++ b/storage/innobase/row/row0sel.cc @@ -6399,16 +6399,12 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows) if (!rec_deleted) { not_found: - ib::error() << "Clustered index record not found for index " + ib::fatal() << "Clustered index record not found for index " << index->name << " of table " << index->table->name << ": " << rec_offsets_print(rec, offsets); if (prebuilt->autoinc_error == DB_SUCCESS) prebuilt->autoinc_error= DB_CORRUPTION; } - else if (&view == &check_table_extended_view) - extended_not_found: - if (view.changes_visible(page_trx_id)) - goto not_found; did_not_find: mtr.rollback_to_savepoint(savepoint); goto next_rec; @@ -6587,8 +6583,9 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows) clust_rec version or when trx_undo_prev_version_build() encounters a BLOB that may have been freed according to purge_sys.view (not purge_sys.end_view). */ - if (&view == &check_table_extended_view && !got_extended_match) - goto extended_not_found; + if (&view == &check_table_extended_view && !got_extended_match && + view.changes_visible(page_trx_id)) + goto not_found; goto did_not_find; } Alternatively, we could issue warnings for this but not flag the secondary index as corrupted. After all, this type of corruption should only hurt performance, not cause any correctness issues.
            mleich Matthias Leich added a comment - - edited

            RQG testing on some older version of origin/bb-10.6-MDEV-24402 and using the following simplified RQG grammars 
            TBR-757-micro.yy
            ---------------------------
            create_table:
                CREATE TABLE IF NOT EXISTS t1 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
             
            insert_part:
                INSERT INTO t1 (col1,col2, col_int, col_text ) VALUES ;
             
            my_int:
                { $my_int= 3 } |
                { $my_int= 5 } |
                { $my_int= 7 } ;
             
            non_generated_cols:
                col1 INT PRIMARY KEY, col2 INT, col_int INTEGER , col_text TEXT ;
             
            query:
                ALTER TABLE t1 ADD UNIQUE ( col_text(9) , col_int ) |
                ALTER TABLE t1 ADD UNIQUE ( col_int , col_int ) |
                CHECK TABLE t1 EXTENDED |
                DELETE FROM t1 WHERE col2 = my_int; COMMIT |
                DELETE FROM t1 WHERE col2 = my_int; COMMIT |
                insert_part ( my_int , $my_int, $my_int, NULL ); COMMIT |
                insert_part ( my_int , $my_int, $my_int, NULL ); ROLLBACK ;
             
            thread1:
                query ;
             
            thread1_connect:
                SET AUTOCOMMIT = 0 ;
             
            thread1_init:
                create_table ;
             
            thread_init:
                { sleep 30; exit 0 };
             
            TBR-36-min_EXTENDED.yy
            -----------------------------------------
            insert_part:
                INSERT INTO t1 (col1,col2, col_text) VALUES ( 13 , 13, NULL ) ;
             
            query:
                ALTER TABLE t1 ADD INDEX ( col_text(9) ) , ADD INDEX ( col1 ) |
                CHECK TABLE t1 EXTENDED |
                DELETE FROM t1 WHERE col2 = 13 OR col2 IS NULL ;
             
            thread1:
                query ;
             
            thread1_connect:
                ;
             
            thread1_init:
                CREATE TABLE t1 ( col1 INT PRIMARY KEY, col2 INT, col_text TEXT ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
             
            thread2:
                insert_part ;
             
            thread2_connect:
                ;
             
            thread2_init:
                ;
             
            thread3:
                { sleep 30 ; exit 0 } ;
             
            thread3_connect:
                ;
             
            thread3_init:
                ;
             
            thread_init:
                { sleep 30 ; exit 0 } ;
             
            showed non rare as result of CHECK TABLE
            - [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' ],
            - [ 'TBR-36', 'InnoDB: Index .{1,150} contains .{1,10} entries, should be .{1,10}.' ], 
              [ 'TBR-36', 'failed: 1712 , errstr: Index .{1,100} is corrupted' ],
              [ 'TBR-36', 'failed: 1712 Index .* is corrupted' ],
             
            Attempts to replay these problems on
            origin/bb-10.6-MDEV-24402 435729fba417b5afcf759f05f3d6a1c62d3de058 2022-10-13T16:50:59+03:00 + the patch relax_check_extended.patch.diff
            were given up after a few thousand tests "passed".
             
            Nevertheless the bad effect 
                 [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.'
            occurs again if using the full (== non simplified) grammar conf/mariadb/table_stress_innodb_nocopy1.yy.
            The automatic RQG test simplifier is in the moment running.
            
            

            mleich Matthias Leich added a comment - - edited RQG testing on some older version of origin/bb-10.6-MDEV-24402 and using the following simplified RQG grammars TBR-757-micro.yy --------------------------- create_table: CREATE TABLE IF NOT EXISTS t1 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;   insert_part: INSERT INTO t1 (col1,col2, col_int, col_text ) VALUES ;   my_int: { $my_int= 3 } | { $my_int= 5 } | { $my_int= 7 } ;   non_generated_cols: col1 INT PRIMARY KEY, col2 INT, col_int INTEGER , col_text TEXT ;   query: ALTER TABLE t1 ADD UNIQUE ( col_text(9) , col_int ) | ALTER TABLE t1 ADD UNIQUE ( col_int , col_int ) | CHECK TABLE t1 EXTENDED | DELETE FROM t1 WHERE col2 = my_int; COMMIT | DELETE FROM t1 WHERE col2 = my_int; COMMIT | insert_part ( my_int , $my_int, $my_int, NULL ); COMMIT | insert_part ( my_int , $my_int, $my_int, NULL ); ROLLBACK ;   thread1: query ;   thread1_connect: SET AUTOCOMMIT = 0 ;   thread1_init: create_table ;   thread_init: { sleep 30; exit 0 };   TBR-36-min_EXTENDED.yy ----------------------------------------- insert_part: INSERT INTO t1 (col1,col2, col_text) VALUES ( 13 , 13, NULL ) ;   query: ALTER TABLE t1 ADD INDEX ( col_text(9) ) , ADD INDEX ( col1 ) | CHECK TABLE t1 EXTENDED | DELETE FROM t1 WHERE col2 = 13 OR col2 IS NULL ;   thread1: query ;   thread1_connect: ;   thread1_init: CREATE TABLE t1 ( col1 INT PRIMARY KEY, col2 INT, col_text TEXT ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;   thread2: insert_part ;   thread2_connect: ;   thread2_init: ;   thread3: { sleep 30 ; exit 0 } ;   thread3_connect: ;   thread3_init: ;   thread_init: { sleep 30 ; exit 0 } ;   showed non rare as result of CHECK TABLE - [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' ], - [ 'TBR-36', 'InnoDB: Index .{1,150} contains .{1,10} entries, should be .{1,10}.' ], [ 'TBR-36', 'failed: 1712 , errstr: Index .{1,100} is corrupted' ], [ 'TBR-36', 'failed: 1712 Index .* is corrupted' ],   Attempts to replay these problems on origin/bb-10.6-MDEV-24402 435729fba417b5afcf759f05f3d6a1c62d3de058 2022-10-13T16:50:59+03:00 + the patch relax_check_extended.patch.diff were given up after a few thousand tests "passed".   Nevertheless the bad effect [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' occurs again if using the full (== non simplified) grammar conf/mariadb/table_stress_innodb_nocopy1.yy. The automatic RQG test simplifier is in the moment running.

            The following simplified test applied to origin/bb-10.6-MDEV-24402 (see above)
            TBR-36--TBR-757_2-micro.yy
            ------------------------------------------
            __clone__1:
                { $my_int= 2 } |
                { $my_int= 6 } |
                { $my_int= 8 } ;
            __clone__2:
                { $my_int= 2 } |
                { $my_int= 6 } |
                { $my_int= 8 } ;
            insert_part:
                INSERT INTO t3 (col1,col2, col_int, col_string, col_text ) VALUES ;
            non_generated_cols:
                col1 INT PRIMARY KEY, col2 INT, col_int INTEGER, col_string INTEGER, col_text TEXT ;
            query:
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                CHECK TABLE t3 EXTENDED |
                CHECK TABLE t3 EXTENDED |
                CHECK TABLE t3 EXTENDED |
                UPDATE t3 SET col_int = __clone__2 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK |
                UPDATE t3 SET col_int = __clone__1 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK |
                insert_part ( __clone__2 , $my_int, $my_int, '', '' ); ROLLBACK |
                insert_part ( __clone__1 , $my_int, $my_int, '', '' ); ROLLBACK ;
            thread1:
                query ;
            thread1_connect:
                ;
            thread1_init:
                CREATE TABLE IF NOT EXISTS t3 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Compact ;
            thread_init:
                { sleep 30 ; exit 0 };
             
            harvested during CHECK TABLE    check Warning InnoDB: The B-tree of index .{1,30} is corrupted.
            In case of origin/bb-10.6-MDEV-24402 87e4f0dde4877090b6948c2734fbd37c860e216b 2022-10-18T10:20:02+03:00 + a patch which disables MDEV-371
            the CHECK TABLE harvests now a warning like 'InnoDB: Clustered index record not found for index col_int of table test.t3 ...'.
            Per Marko: This is no error.
            

            mleich Matthias Leich added a comment - The following simplified test applied to origin/bb-10.6-MDEV-24402 (see above) TBR-36--TBR-757_2-micro.yy ------------------------------------------ __clone__1: { $my_int= 2 } | { $my_int= 6 } | { $my_int= 8 } ; __clone__2: { $my_int= 2 } | { $my_int= 6 } | { $my_int= 8 } ; insert_part: INSERT INTO t3 (col1,col2, col_int, col_string, col_text ) VALUES ; non_generated_cols: col1 INT PRIMARY KEY, col2 INT, col_int INTEGER, col_string INTEGER, col_text TEXT ; query: ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | CHECK TABLE t3 EXTENDED | CHECK TABLE t3 EXTENDED | CHECK TABLE t3 EXTENDED | UPDATE t3 SET col_int = __clone__2 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK | UPDATE t3 SET col_int = __clone__1 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK | insert_part ( __clone__2 , $my_int, $my_int, '', '' ); ROLLBACK | insert_part ( __clone__1 , $my_int, $my_int, '', '' ); ROLLBACK ; thread1: query ; thread1_connect: ; thread1_init: CREATE TABLE IF NOT EXISTS t3 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Compact ; thread_init: { sleep 30 ; exit 0 };   harvested during CHECK TABLE check Warning InnoDB: The B-tree of index .{1,30} is corrupted. In case of origin/bb-10.6-MDEV-24402 87e4f0dde4877090b6948c2734fbd37c860e216b 2022-10-18T10:20:02+03:00 + a patch which disables MDEV-371 the CHECK TABLE harvests now a warning like 'InnoDB: Clustered index record not found for index col_int of table test.t3 ...'. Per Marko: This is no error.
            marko Marko Mäkelä added a comment - - edited

            Thank you, mleich. The test case corresponds to the scenario that I was analyzing yesterday, now filed as MDEV-29823. It is a ‘memory leak’ (failure to remove a secondary index that is no longer needed), but not one that can be fixed easily. Therefore, I would report it as a warning but not flag the secondary index as corrupted. MVCC reads as well as row_vers_impl_x_locked_low() will silently ignore delete-marked secondary index records for which no counterpart exists in the clustered index.

            marko Marko Mäkelä added a comment - - edited Thank you, mleich . The test case corresponds to the scenario that I was analyzing yesterday, now filed as MDEV-29823 . It is a ‘memory leak’ (failure to remove a secondary index that is no longer needed), but not one that can be fixed easily. Therefore, I would report it as a warning but not flag the secondary index as corrupted. MVCC reads as well as row_vers_impl_x_locked_low() will silently ignore delete-marked secondary index records for which no counterpart exists in the clustered index.

            The logic of row_check_index() and the code around purge_sys.end_view looks correct to me.

            vlad.lesin Vladislav Lesin added a comment - The logic of row_check_index() and the code around purge_sys.end_view looks correct to me.

            People

              marko Marko Mäkelä
              hholzgra Hartmut Holzgraefe
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.