Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24402

CHECK TABLE may miss some cases of index inconsistencies

Details

    Description

      Looks as if some cases of InnoDB index inconsistencies – see e.g. MDEV-22373 – are missed by CHECK TABLE.

      InnoDB currently ignores any other CHECK TABLE attributes than QUICK (which will suppress the time-consuming checks of the upper levels of the index trees).

      Most importantly, InnoDB is ignoring the attribute EXTENDED, and is essentially only performing a MVCC COUNT(*) in each index. A verified case exists where the record counts in the indexes would match, yet the records do not match between the indexes. A true EXTENDED check should report any mismatch between the clustered index (primary key) and the secondary index records.

      Note: By the design of InnoDB MVCC (see MDEV-17598 for some discussion), secondary indexes may contain multiple copies of a record, while in the clustered index the multiple versions are formed via a singly-linked list that the undo log pages are part of.

      An extremely extended test would have to check that in any possible read view, the secondary indexes match the clustered index, instead of only checking the current read view.

      It may turn out that only a special ‘locking’ variant of CHECK TABLE would be able to detect truly any type of mismatch.

      Attachments

        Issue Links

          Activity

            I got an RQG grammar that would occasionally report orphan delete-marked secondary index records, similar to MDEV-29666. I analyzed an rr replay trace of such a failure. The cause is a race condition between rollback and purge. Rollback would not remove a secondary index record that was inserted on top of a committed delete-marked record, because purge had not advanced that far yet. After rollback, trx_t::commit() of the ‘empty’ transaction would invoke trx_purge_add_undo_to_history(), which unnecessarily adds the empty undo log to the purge queue. I tried disabling trx_undo_try_truncate() so that the already rolled-back undo log records would be added to the purge queue, but even with that work-around (not a proper fix), the grammar would fail. With that tweak present, several tests that involve BLOBs would fail, presumably because some BLOBs would be prematurely freed.

            The error is not easily fixed. We might relax the check and ‘wave through’ cases where a matching clustered index record cannot be found at all. Other cases (where a record with a primary key is found for a newer version of the record, but not for one that matches the secondary index record) would still be flagged:

            diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc
            index e1e7ed0e12a..4baa1e66376 100644
            --- a/storage/innobase/row/row0sel.cc
            +++ b/storage/innobase/row/row0sel.cc
            @@ -6399,16 +6399,12 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows)
                     if (!rec_deleted)
                     {
                     not_found:
            -          ib::error() << "Clustered index record not found for index "
            +          ib::fatal() << "Clustered index record not found for index "
                                   << index->name << " of table " << index->table->name
                                   << ": " << rec_offsets_print(rec, offsets);
                       if (prebuilt->autoinc_error == DB_SUCCESS)
                         prebuilt->autoinc_error= DB_CORRUPTION;
                     }
            -        else if (&view == &check_table_extended_view)
            -        extended_not_found:
            -          if (view.changes_visible(page_trx_id))
            -            goto not_found;
                   did_not_find:
                     mtr.rollback_to_savepoint(savepoint);
                     goto next_rec;
            @@ -6587,8 +6583,9 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows)
                         clust_rec version or when trx_undo_prev_version_build()
                         encounters a BLOB that may have been freed according to
                         purge_sys.view (not purge_sys.end_view). */
            -            if (&view == &check_table_extended_view && !got_extended_match)
            -              goto extended_not_found;
            +            if (&view == &check_table_extended_view && !got_extended_match &&
            +                view.changes_visible(page_trx_id))
            +              goto not_found;
                         goto did_not_find;
                       }
             
            

            Alternatively, we could issue warnings for this but not flag the secondary index as corrupted. After all, this type of corruption should only hurt performance, not cause any correctness issues.

            marko Marko Mäkelä added a comment - I got an RQG grammar that would occasionally report orphan delete-marked secondary index records, similar to MDEV-29666 . I analyzed an rr replay trace of such a failure. The cause is a race condition between rollback and purge. Rollback would not remove a secondary index record that was inserted on top of a committed delete-marked record, because purge had not advanced that far yet. After rollback, trx_t::commit() of the ‘empty’ transaction would invoke trx_purge_add_undo_to_history() , which unnecessarily adds the empty undo log to the purge queue. I tried disabling trx_undo_try_truncate() so that the already rolled-back undo log records would be added to the purge queue, but even with that work-around (not a proper fix), the grammar would fail. With that tweak present, several tests that involve BLOBs would fail, presumably because some BLOBs would be prematurely freed. The error is not easily fixed. We might relax the check and ‘wave through’ cases where a matching clustered index record cannot be found at all. Other cases (where a record with a primary key is found for a newer version of the record, but not for one that matches the secondary index record) would still be flagged: diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc index e1e7ed0e12a..4baa1e66376 100644 --- a/storage/innobase/row/row0sel.cc +++ b/storage/innobase/row/row0sel.cc @@ -6399,16 +6399,12 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows) if (!rec_deleted) { not_found: - ib::error() << "Clustered index record not found for index " + ib::fatal() << "Clustered index record not found for index " << index->name << " of table " << index->table->name << ": " << rec_offsets_print(rec, offsets); if (prebuilt->autoinc_error == DB_SUCCESS) prebuilt->autoinc_error= DB_CORRUPTION; } - else if (&view == &check_table_extended_view) - extended_not_found: - if (view.changes_visible(page_trx_id)) - goto not_found; did_not_find: mtr.rollback_to_savepoint(savepoint); goto next_rec; @@ -6587,8 +6583,9 @@ dberr_t row_check_index(row_prebuilt_t *prebuilt, ulint *n_rows) clust_rec version or when trx_undo_prev_version_build() encounters a BLOB that may have been freed according to purge_sys.view (not purge_sys.end_view). */ - if (&view == &check_table_extended_view && !got_extended_match) - goto extended_not_found; + if (&view == &check_table_extended_view && !got_extended_match && + view.changes_visible(page_trx_id)) + goto not_found; goto did_not_find; } Alternatively, we could issue warnings for this but not flag the secondary index as corrupted. After all, this type of corruption should only hurt performance, not cause any correctness issues.
            mleich Matthias Leich added a comment - - edited

            RQG testing on some older version of origin/bb-10.6-MDEV-24402 and using the following simplified RQG grammars 
            TBR-757-micro.yy
            ---------------------------
            create_table:
                CREATE TABLE IF NOT EXISTS t1 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
             
            insert_part:
                INSERT INTO t1 (col1,col2, col_int, col_text ) VALUES ;
             
            my_int:
                { $my_int= 3 } |
                { $my_int= 5 } |
                { $my_int= 7 } ;
             
            non_generated_cols:
                col1 INT PRIMARY KEY, col2 INT, col_int INTEGER , col_text TEXT ;
             
            query:
                ALTER TABLE t1 ADD UNIQUE ( col_text(9) , col_int ) |
                ALTER TABLE t1 ADD UNIQUE ( col_int , col_int ) |
                CHECK TABLE t1 EXTENDED |
                DELETE FROM t1 WHERE col2 = my_int; COMMIT |
                DELETE FROM t1 WHERE col2 = my_int; COMMIT |
                insert_part ( my_int , $my_int, $my_int, NULL ); COMMIT |
                insert_part ( my_int , $my_int, $my_int, NULL ); ROLLBACK ;
             
            thread1:
                query ;
             
            thread1_connect:
                SET AUTOCOMMIT = 0 ;
             
            thread1_init:
                create_table ;
             
            thread_init:
                { sleep 30; exit 0 };
             
            TBR-36-min_EXTENDED.yy
            -----------------------------------------
            insert_part:
                INSERT INTO t1 (col1,col2, col_text) VALUES ( 13 , 13, NULL ) ;
             
            query:
                ALTER TABLE t1 ADD INDEX ( col_text(9) ) , ADD INDEX ( col1 ) |
                CHECK TABLE t1 EXTENDED |
                DELETE FROM t1 WHERE col2 = 13 OR col2 IS NULL ;
             
            thread1:
                query ;
             
            thread1_connect:
                ;
             
            thread1_init:
                CREATE TABLE t1 ( col1 INT PRIMARY KEY, col2 INT, col_text TEXT ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
             
            thread2:
                insert_part ;
             
            thread2_connect:
                ;
             
            thread2_init:
                ;
             
            thread3:
                { sleep 30 ; exit 0 } ;
             
            thread3_connect:
                ;
             
            thread3_init:
                ;
             
            thread_init:
                { sleep 30 ; exit 0 } ;
             
            showed non rare as result of CHECK TABLE
            - [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' ],
            - [ 'TBR-36', 'InnoDB: Index .{1,150} contains .{1,10} entries, should be .{1,10}.' ], 
              [ 'TBR-36', 'failed: 1712 , errstr: Index .{1,100} is corrupted' ],
              [ 'TBR-36', 'failed: 1712 Index .* is corrupted' ],
             
            Attempts to replay these problems on
            origin/bb-10.6-MDEV-24402 435729fba417b5afcf759f05f3d6a1c62d3de058 2022-10-13T16:50:59+03:00 + the patch relax_check_extended.patch.diff
            were given up after a few thousand tests "passed".
             
            Nevertheless the bad effect 
                 [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.'
            occurs again if using the full (== non simplified) grammar conf/mariadb/table_stress_innodb_nocopy1.yy.
            The automatic RQG test simplifier is in the moment running.
            
            

            mleich Matthias Leich added a comment - - edited RQG testing on some older version of origin/bb-10.6-MDEV-24402 and using the following simplified RQG grammars TBR-757-micro.yy --------------------------- create_table: CREATE TABLE IF NOT EXISTS t1 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;   insert_part: INSERT INTO t1 (col1,col2, col_int, col_text ) VALUES ;   my_int: { $my_int= 3 } | { $my_int= 5 } | { $my_int= 7 } ;   non_generated_cols: col1 INT PRIMARY KEY, col2 INT, col_int INTEGER , col_text TEXT ;   query: ALTER TABLE t1 ADD UNIQUE ( col_text(9) , col_int ) | ALTER TABLE t1 ADD UNIQUE ( col_int , col_int ) | CHECK TABLE t1 EXTENDED | DELETE FROM t1 WHERE col2 = my_int; COMMIT | DELETE FROM t1 WHERE col2 = my_int; COMMIT | insert_part ( my_int , $my_int, $my_int, NULL ); COMMIT | insert_part ( my_int , $my_int, $my_int, NULL ); ROLLBACK ;   thread1: query ;   thread1_connect: SET AUTOCOMMIT = 0 ;   thread1_init: create_table ;   thread_init: { sleep 30; exit 0 };   TBR-36-min_EXTENDED.yy ----------------------------------------- insert_part: INSERT INTO t1 (col1,col2, col_text) VALUES ( 13 , 13, NULL ) ;   query: ALTER TABLE t1 ADD INDEX ( col_text(9) ) , ADD INDEX ( col1 ) | CHECK TABLE t1 EXTENDED | DELETE FROM t1 WHERE col2 = 13 OR col2 IS NULL ;   thread1: query ;   thread1_connect: ;   thread1_init: CREATE TABLE t1 ( col1 INT PRIMARY KEY, col2 INT, col_text TEXT ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;   thread2: insert_part ;   thread2_connect: ;   thread2_init: ;   thread3: { sleep 30 ; exit 0 } ;   thread3_connect: ;   thread3_init: ;   thread_init: { sleep 30 ; exit 0 } ;   showed non rare as result of CHECK TABLE - [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' ], - [ 'TBR-36', 'InnoDB: Index .{1,150} contains .{1,10} entries, should be .{1,10}.' ], [ 'TBR-36', 'failed: 1712 , errstr: Index .{1,100} is corrupted' ], [ 'TBR-36', 'failed: 1712 Index .* is corrupted' ],   Attempts to replay these problems on origin/bb-10.6-MDEV-24402 435729fba417b5afcf759f05f3d6a1c62d3de058 2022-10-13T16:50:59+03:00 + the patch relax_check_extended.patch.diff were given up after a few thousand tests "passed".   Nevertheless the bad effect [ 'TBR-757', 'ERROR: ->.{1,100}check Warning InnoDB: The B-tree of index .{1,30} is corrupted.<-.' occurs again if using the full (== non simplified) grammar conf/mariadb/table_stress_innodb_nocopy1.yy. The automatic RQG test simplifier is in the moment running.

            The following simplified test applied to origin/bb-10.6-MDEV-24402 (see above)
            TBR-36--TBR-757_2-micro.yy
            ------------------------------------------
            __clone__1:
                { $my_int= 2 } |
                { $my_int= 6 } |
                { $my_int= 8 } ;
            __clone__2:
                { $my_int= 2 } |
                { $my_int= 6 } |
                { $my_int= 8 } ;
            insert_part:
                INSERT INTO t3 (col1,col2, col_int, col_string, col_text ) VALUES ;
            non_generated_cols:
                col1 INT PRIMARY KEY, col2 INT, col_int INTEGER, col_string INTEGER, col_text TEXT ;
            query:
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE |
                CHECK TABLE t3 EXTENDED |
                CHECK TABLE t3 EXTENDED |
                CHECK TABLE t3 EXTENDED |
                UPDATE t3 SET col_int = __clone__2 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK |
                UPDATE t3 SET col_int = __clone__1 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK |
                insert_part ( __clone__2 , $my_int, $my_int, '', '' ); ROLLBACK |
                insert_part ( __clone__1 , $my_int, $my_int, '', '' ); ROLLBACK ;
            thread1:
                query ;
            thread1_connect:
                ;
            thread1_init:
                CREATE TABLE IF NOT EXISTS t3 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Compact ;
            thread_init:
                { sleep 30 ; exit 0 };
             
            harvested during CHECK TABLE    check Warning InnoDB: The B-tree of index .{1,30} is corrupted.
            In case of origin/bb-10.6-MDEV-24402 87e4f0dde4877090b6948c2734fbd37c860e216b 2022-10-18T10:20:02+03:00 + a patch which disables MDEV-371
            the CHECK TABLE harvests now a warning like 'InnoDB: Clustered index record not found for index col_int of table test.t3 ...'.
            Per Marko: This is no error.
            

            mleich Matthias Leich added a comment - The following simplified test applied to origin/bb-10.6-MDEV-24402 (see above) TBR-36--TBR-757_2-micro.yy ------------------------------------------ __clone__1: { $my_int= 2 } | { $my_int= 6 } | { $my_int= 8 } ; __clone__2: { $my_int= 2 } | { $my_int= 6 } | { $my_int= 8 } ; insert_part: INSERT INTO t3 (col1,col2, col_int, col_string, col_text ) VALUES ; non_generated_cols: col1 INT PRIMARY KEY, col2 INT, col_int INTEGER, col_string INTEGER, col_text TEXT ; query: ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | ALTER TABLE t3 ADD UNIQUE KEY ( col_int ) , ALGORITHM = NOCOPY , LOCK = EXCLUSIVE | CHECK TABLE t3 EXTENDED | CHECK TABLE t3 EXTENDED | CHECK TABLE t3 EXTENDED | UPDATE t3 SET col_int = __clone__2 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK | UPDATE t3 SET col_int = __clone__1 ORDER BY col1 DESC LIMIT 2 ; ROLLBACK | insert_part ( __clone__2 , $my_int, $my_int, '', '' ); ROLLBACK | insert_part ( __clone__1 , $my_int, $my_int, '', '' ); ROLLBACK ; thread1: query ; thread1_connect: ; thread1_init: CREATE TABLE IF NOT EXISTS t3 ( non_generated_cols ) ENGINE = InnoDB ROW_FORMAT = Compact ; thread_init: { sleep 30 ; exit 0 };   harvested during CHECK TABLE check Warning InnoDB: The B-tree of index .{1,30} is corrupted. In case of origin/bb-10.6-MDEV-24402 87e4f0dde4877090b6948c2734fbd37c860e216b 2022-10-18T10:20:02+03:00 + a patch which disables MDEV-371 the CHECK TABLE harvests now a warning like 'InnoDB: Clustered index record not found for index col_int of table test.t3 ...'. Per Marko: This is no error.
            marko Marko Mäkelä added a comment - - edited

            Thank you, mleich. The test case corresponds to the scenario that I was analyzing yesterday, now filed as MDEV-29823. It is a ‘memory leak’ (failure to remove a secondary index that is no longer needed), but not one that can be fixed easily. Therefore, I would report it as a warning but not flag the secondary index as corrupted. MVCC reads as well as row_vers_impl_x_locked_low() will silently ignore delete-marked secondary index records for which no counterpart exists in the clustered index.

            marko Marko Mäkelä added a comment - - edited Thank you, mleich . The test case corresponds to the scenario that I was analyzing yesterday, now filed as MDEV-29823 . It is a ‘memory leak’ (failure to remove a secondary index that is no longer needed), but not one that can be fixed easily. Therefore, I would report it as a warning but not flag the secondary index as corrupted. MVCC reads as well as row_vers_impl_x_locked_low() will silently ignore delete-marked secondary index records for which no counterpart exists in the clustered index.

            The logic of row_check_index() and the code around purge_sys.end_view looks correct to me.

            vlad.lesin Vladislav Lesin added a comment - The logic of row_check_index() and the code around purge_sys.end_view looks correct to me.

            People

              marko Marko Mäkelä
              hholzgra Hartmut Holzgraefe
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.