[MDEV-27746] Wrong comparision of BLOB's empty preffix with non-preffixed BLOB causes rows count mismatch for clustered and secondary indexes during non-locking read Created: 2022-02-04  Updated: 2022-02-15  Resolved: 2022-02-11

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.0, 10.6.1, 10.6.2, 10.6.3, 10.6.4, 10.6.5, 10.6.6, 10.7.2, 10.6, 10.8
Fix Version/s: 10.2.44, 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3

Type: Bug Priority: Critical
Reporter: Matthias Leich Assignee: Vladislav Lesin
Resolution: Fixed Votes: 1
Labels: BLOB, corruption, purge, race, regression-10.6, rr-profile-analyzed

Attachments: File Corruption.test     File MDEV-27746.test    
Issue Links:
Relates
relates to MDEV-27788 avoid BLOB fetch in row_sel_sec_rec_i... Open
relates to MDEV-25029 Reduce dict_sys mutex contention for ... Closed

 Description   

origin/10.7 commit 3351dfaab0599268eaf25f9d6995ef128910a8b9
 
SET GLOBAL default_storage_engine = 'InnoDB' ;
SET SESSION sql_mode = 'NO_ENGINE_SUBSTITUTION' ;
CREATE TABLE IF NOT EXISTS t1 ( col_text TEXT NOT NULL, KEY ( col_text(9) ) ) ENGINE = InnoDB ROW_FORMAT = Dynamic ;
SET @fill_amount = (@@innodb_page_size / 2 ) + 1 ;
INSERT INTO t1 (col_text) VALUES ( REPEAT(SUBSTR(CAST( 24 - 1 AS CHAR),1,1), @fill_amount) ) ;
UPDATE t1 SET col_text = REPEAT(SUBSTR(CAST( NULL AS CHAR),1,1), @fill_amount) ;
Warnings:
Warning 1048    Column 'col_text' cannot be null
UPDATE t1 SET col_text = REPEAT(SUBSTR(CAST( NULL AS CHAR),1,1), @fill_amount) ;
Warnings:
Warning 1048    Column 'col_text' cannot be null
UPDATE t1 SET col_text = REPEAT(SUBSTR(CAST( 18 AS CHAR),1,1), @fill_amount) ;
CHECK TABLE t1 ;
Table   Op      Msg_type        Msg_text
test.t1 check   Warning InnoDB: Index 'col_text' contains 2 entries, should be 1.
test.t1 check   error   Corrupt
 
I cannot exclude that the uploaded MTR test contains more SQL than required for showing the problem.
Warning:
The layout of the table with one column only comes from the attempt to get the smallest test possible.
In case the table contains additional columns of type INT and/or CHAR/VARCHAR I get the same bad effect.



 Comments   
Comment by Marko Mäkelä [ 2022-02-04 ]

I can repeat this on a 10.8 based branch. Here is a little simpler test case:

--source include/have_innodb.inc
SET sql_mode='';
CREATE TABLE t1 (col_text TEXT NOT NULL, KEY (col_text(9))) ENGINE=InnoDB;
SET @fill_amount = (@@innodb_page_size / 2 ) + 1 ;
INSERT INTO t1 (col_text) VALUES (REPEAT('x', @fill_amount));
UPDATE t1 SET col_text=NULL;
UPDATE t1 SET col_text = REPEAT('y', @fill_amount);
CHECK TABLE t1 ;
DROP TABLE t1;

We need a non-default (non-strict) sql_mode that will allow a NULL value to be inserted into a NOT NULL column.

If I make the REPEAT arguments shorter, so that the record will fit in the clustered index page, then CHECK TABLE will not fail.

Comment by Marko Mäkelä [ 2022-02-04 ]

The failure is not repeatable if I disable the purge of history (innodb_force_recovery=2), or if I add wait_all_purged.inc at any point between INSERT and CHECK TABLE, or if I add a server restart before the CHECK TABLE.

The test does fail if I replace the invalid NULL with the empty string ''. Here is a simpler test that I think removes any doubt of the problem being outside InnoDB:

--source include/have_innodb.inc
CREATE TABLE t1 (col_text TEXT NOT NULL, KEY (col_text(9))) ENGINE=InnoDB;
SET @fill_amount = (@@innodb_page_size / 2 ) + 1 ;
INSERT INTO t1 (col_text) VALUES (REPEAT('x', @fill_amount));
UPDATE t1 SET col_text='';
UPDATE t1 SET col_text=REPEAT('y', @fill_amount);
CHECK TABLE t1;
DROP TABLE t1;

mleich, can you please try to generate an rr replay trace of this? My attempts failed so far.

I expect that 10.6 and likely also older versions are affected.

Comment by Matthias Leich [ 2022-02-04 ]

sdp:/dev/shm/vardir/86/log/mysqld.1.rr

Comment by Marko Mäkelä [ 2022-02-04 ]

I created a core dump:

diff --git a/storage/innobase/dict/dict0dict.cc b/storage/innobase/dict/dict0dict.cc
index 3b4b280d1da..121ff13fba0 100644
--- a/storage/innobase/dict/dict0dict.cc
+++ b/storage/innobase/dict/dict0dict.cc
@@ -4138,6 +4138,7 @@ void dict_set_corrupted(dict_index_t *index, const char *ctx, bool dict_locked)
 		dict_sys.lock(SRW_LOCK_CALL);
 	}
 
+	abort();
 	ut_ad(dict_sys.locked());
 	ut_ad(!dict_table_is_comp(dict_sys.sys_tables));
 	ut_ad(!dict_table_is_comp(dict_sys.sys_indexes));

In the buffer pool of the core dump, I find the following in the clustered index:
(DB_ROW_ID=0x200,DB_TRX_ID=0x1b,DB_ROLL_PTR=0x07000001380580,col_text=(pointer to 'y…' in page 6))
In the secondary index, we have the following:
(col_text,DB_ROW_ID)=delete_marked('',0x200),delete_marked('xxxxxxxxx',0x200),('yyyyyyyyy',0x200)

So far, it looks good. Let us dereference the DB_ROLL_PTR and fetch the 2 previous versions.
We need to fetch the undo page 0x138, byte position 0x580:

00000580: 05a8 0c00 1200 0000 0000 18c5 0000 0137  ...............7
00000590: 02a2 0600 0000 0002 0001 0300 000a 0006  ................
000005a0: 0000 0000 0200 0580 0000 0000 0000 0000  ................

We can see a next-record pointer to 0x5a8, and at 0x5a6 the back-pointer to the start of this record.
The undo record type is 0x0c (TRX_UNDO_UPD_EXIST_REC=12).
It is followed by the undo number (0, it was the only row modified by this transaction 0x1b), the table ID 0x12, the record info_bits (0), the previous (DB_TRX_ID,DB_ROLL_PTR) in encoded format (0x18 and something pointing to undo page 0x137, offset 0x2a2). That is then followed by the primary key {{DB_ROW_ID=0x200) and some data. We already know that the previous version of this record was the empty string, so I will not decode that.

Next, let us dereference the DB_ROLL_PTR of the previous version. On page 0x137 at byte offset 0x2a2 we can find the following:

000002a0: 0000 02ee 8c00 1200 0000 0000 16e0 8400  ................
000002b0: 0001 3602 7d06 0000 0000 0200 0103 f0ff  ..6.}...........
000002c0: ffbf ff14 1d78 7878 7878 7878 7878 0000  .....xxxxxxxxx..
000002d0: 0005 0000 0005 0000 0026 0000 0000 0000  .........&......
000002e0: 2001 000a 0006 0000 0000 0200 02a2 0000   ...............

The undo log record type is 0x8c (TRX_UNDO_UPD_EXIST_REC | TRX_UNDO_UPD_EXTERN), the undo number is again 0, and this is again table 0x12 and the info_bits are 0.
The previous (DB_TRX_ID,DB_ROLL_PTR) are 0x16 and something at page 0x136, offset 0x27d. After that we can see the column prefix for the secondary index and BLOB pointer (page 5 contains 'x…' as expected).

Again, let us dereference the DB_ROLL_PTR of the previous version. On page 0x136 at byte offset 0x27d we can find the following:

00000270: 0000 0000 0000 0000 0000 0000 0002 8b0b  ................
00000280: 0012 0600 0000 0002 0002 7d00 0000 0000  ..........}.....

This is the first version of the record, as expected, of the type TRX_UNDO_INSERT_REC=11.
We will only find the DB_ROW_ID=0x200 here, but not any BLOB pointer.
We would insert the record with a zero BLOB pointer, and then update the BLOB afterwards in a new mini-transaction.

I think that an analysis of an rr replay trace may be needed to find out which of the 2 delete-marked records in the secondary index we counted as existing in the current read view, and why.

Comment by Marko Mäkelä [ 2022-02-04 ]

To check if MDEV-20301 could have caused this (and the MDEV-22924 fix would have been insufficient), I tried the following patch to disable the cache:

diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc
index ab4da856d1d..a6a41879400 100644
--- a/storage/innobase/row/row0sel.cc
+++ b/storage/innobase/row/row0sel.cc
@@ -3493,7 +3493,7 @@ Row_sel_get_clust_rec_for_mysql::operator()(
 			const lsn_t lsn = mach_read_from_8(
 				page_align(clust_rec) + FIL_PAGE_LSN);
 
-			if (lsn != cached_lsn
+			if (true || lsn != cached_lsn
 			    || bpage.id() != cached_page_id
 			    || clust_rec != cached_clust_rec) {
 				/* The following call returns 'offsets' associated with

The CHECK TABLE would occasionally fail also with this patch, again with 2 rows counted in the secondary index instead of 1.

Comment by Matthias Leich [ 2022-02-04 ]

Replay on
origin/10.6 ba5ef63ae1da74a90dcdd798bf12c4ca7c9c720f 2022-01-12T12:34:07+02:00  10.6.6
 
No replay on
origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16T10:53:32+11:00  10.5.14
within 300 attempts.

Comment by Marko Mäkelä [ 2022-02-04 ]

The test does not fail for me on 10.2 or 10.5, but it fails on 10.6.

In the rr replay trace that I analyzed, the delete-marked record that contains the empty string was wrongly counted. There were exactly 3 invocations of Row_sel_get_clust_rec_for_mysql::operator(), and the very first one was for the delete-marked old record version containing the empty string.

Comment by Marko Mäkelä [ 2022-02-04 ]

The regression was caused somewhere between mariadb-10.6.4 and mariadb-10.6.5.

mariadb-10.6.4

innodb.mdev-27746 'innodb'               w19 [ 1000 pass ]      8
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 208.757 of 344 seconds executing testcases
 
Completed: All 23000 tests were successful.

I will try to find the commit that made this error reproducible.

Comment by Marko Mäkelä [ 2022-02-04 ]

It turns out that this error was previously masked by MDEV-4750, which I essentially reverted in 10.6.5. If I add STATS_PERSISTENT=1 to the table definition, the test will fail in mariadb-10.6.4 just fine:

CURRENT_TEST: innodb.mdev-27746
--- /mariadb/10.6/mysql-test/suite/innodb/r/mdev-27746.result	2022-02-04 15:44:10.800580936 +0200
+++ /mariadb/10.6/mysql-test/suite/innodb/r/mdev-27746.reject	2022-02-04 15:47:24.807168863 +0200
@@ -6,5 +6,6 @@
 UPDATE t1 SET col_text = REPEAT('y', @fill_amount);
 CHECK TABLE t1 ;
 Table	Op	Msg_type	Msg_text
-test.t1	check	status	OK
+test.t1	check	Warning	InnoDB: Index 'col_text' contains 2 entries, should be 1.
+test.t1	check	error	Corrupt
 DROP TABLE t1;
 
mysqltest: Result length mismatch

It still does not fail in 10.5 for me.

Comment by Marko Mäkelä [ 2022-02-04 ]

This fails already in MariaDB 10.6.0 (not every time, but with enough attempts).

--source include/have_innodb.inc
CREATE TABLE t1 (col_text TEXT NOT NULL, KEY (col_text(9)))
STATS_PERSISTENT=1 ENGINE=InnoDB;
SET @fill_amount = (@@innodb_page_size / 2 ) + 1 ;
INSERT INTO t1 (col_text) VALUES (REPEAT('x', @fill_amount));
UPDATE t1 SET col_text='';
UPDATE t1 SET col_text = REPEAT('y', @fill_amount);
CHECK TABLE t1 ;
DROP TABLE t1;

If the purge of history is disabled (innodb_force_recovery=2), it will not fail.

Comment by Marko Mäkelä [ 2022-02-04 ]

The test started to fail with a fairly high-level change that reduced the use of dict_sys.mutex.

I will not claim that MDEV-25029 caused this bug; I think that it only revealed a race condition in the existing secondary index MVCC logic, or made this race condition easier to reproduce.

Only after the rr replay trace has been analyzed deeper, we can say why exactly we wrongly decided that the delete-marked secondary index record '' exists in the current read view.

I am assigning this to vlad.lesin for final analysis, because he already spent a few days analyzing a very similar failure that popped up when his change MDEV-20605 was being tested.

Comment by Matthias Leich [ 2022-02-04 ]

Useful error patterns for RQG
#   [ 'CT_Crash', 'Some pointers may be invalid..{1,150}Query .{1,100}C.{1,20}H.{1,20}E.{1,20}C.{1,20}K.{1,3000}Connection ID' ],
#   [ 'TBR-1348', '\[ERROR\] InnoDB: Flagged corruption of .{1,100} in table .{1,150} in CHECK TABLE; Wrong count.+\[ERROR\] InnoDB: Flagged corruption of .{1,100} in table .{1,150} in CHECK TABLE; Wrong count.+RESULT: The RQG run ended with status STATUS_SERVER_CRASHED' ],
#   [ 'TBR-1347', '\[ERROR\] InnoDB: Flagged corruption of .{1,100} in table .{1,150} in CHECK TABLE; Wrong count.+\[Warning\] InnoDB: CHECK TABLE on index .{1,100} of table .{1,150} returned Data structure corruption.+RESULT: The RQG run ended with status STATUS_SERVER_CRASHED' ],
#   [ 'TBR-215', 'InnoDB: Index .{1,32} is marked as corrupted' ],
#   [ 'TBR-36', 'InnoDB: Index .{1,30} contains .{1,10} entries, should be .{1,10}.' ],
#   [ 'TBR-36', 'failed: 1712 Index .* is corrupted' ],
#   [ 'TBR-36', 'failed: 1712 , errstr: Index .{1,100} is corrupted' ],

Comment by Vladislav Lesin [ 2022-02-05 ]

I repeated and recorded the issue locally on 10.6 with Marko's test. I debugged row_search_mvcc() for CHECK TABLE's index scanning. Clustered index scanning looks good, the index contains one record with correct link to BLOB and trx_id.

(DB_ROW_ID=0x000000000205, DB_TRX_ID=0x000000000066, DB_ROLL_PTR=0x2f0000015d0110, col_text={(space_id=0x00000a, page=0x00000006, offset=0x00000026,...), points to: {len= 0x00002001, 0xff, 0xff, 0xff, 0xff, 0x79('y') <repeats 192 times>}})

The first found record during secondary index scanning is:

(col_text='', DB_ROW_ID=0x000000000205) - delete-marked
p/x rec[-6]@12
$29 = {0x0, 0x20, 0x0, 0x18, 0xff, 0xeb, 0x0, 0x0, 0x0, 0x0, 0x2, 0x5}

As CHECK TABLE's readview can see the changes of transaction with id equal to the secondary index page transaction id, clustered index record is read. As the readview can see clustered index record, and the secondary index record is delete-marked, the BLOB fields are compared.

As we compare the empty field of secondary index(with length equal to 0) with non-empty BLOB field of clustered index, row_sel_sec_rec_is_for_clust_rec() must return DB_SUCCESS with the following calls trace:

▾ row_sel_sec_rec_is_for_clust_rec                                                                                                
  ▾ Row_sel_get_clust_rec_for_mysql::operator()                                 
    ▸ row_search_mvcc

but it returns DB_SUCCESS_LOCKED_REC because of the following code:

static                                                                          
dberr_t                                                                         
row_sel_sec_rec_is_for_clust_rec(...)                                           
{                                                                               
...                                                                             
                                                                                
                        if (ulint prefix_len = ifield->prefix_len) {            
                                len = dtype_get_at_most_n_mbchars(              
                                        col->prtype, col->mbminlen,             
                                        col->mbmaxlen, prefix_len, len,         
                                        reinterpret_cast<const char*>(          
                                                clust_field));                  
                                if (len < sec_len) {                                                                              
                                        goto check_for_blob;                    
                                }                                               
                        } else {                                                
check_for_blob:                                                                 
                                if (rec_offs_nth_extern(clust_offs,             
                                                        clust_pos)) {           
                                        if (!row_sel_sec_rec_is_for_blob(       
                                                    col->mtype, col->prtype,    
                                                    col->mbminlen,              
                                                    col->mbmaxlen,              
                                                    clust_field, clust_len,     
                                                    sec_field, sec_len,         
                                                    prefix_len,                                                                   
                                                    clust_index->table)) {      
                                                return DB_SUCCESS;              
                                        }                                                                                         
                                                                                                                                  
                                        continue;                                                                                 
                                }                                               
                        }                                                       
                }                                                               
                                                                                
                if (0 != cmp_data_data(col->mtype, col->prtype,                 
                                       clust_field, len,                        
                                       sec_field, sec_len)) {
                        return DB_SUCCESS;                                      
                }                                                               
...                                                                             
}

I suppose that for BLOB comparison row_sel_sec_rec_is_for_blob() must be invoked. But in the above code "if (ulint prefix_len = ifield->prefix_len)" branch is executed. Then dtype_get_at_most_n_mbchars(...) returns 0 and then this 0 is compared with sec_len, which is 0 too. So instead of row_sel_sec_rec_is_for_blob() cmp_data_data() is executed, which, in turns, returns 0 as both len and sec_len equal to 0. The secondary index record length is counted from offsets array, which is initialized in rec_init_offsets_comp_ordinary(), and the length of the BLOB prefix field is read from rec[-6] byte, which is 0. According to the above, empty BLOB prefix field from secondary index equals to non-empty BLOB field in clustered record. Commit 25ed665a is under suspicion.

Why dtype_get_at_most_n_mbchars(...) returns 0, i.e. why BLOB's field len is 0? That's because of the following code in row_sel_sec_rec_is_for_clust_rec():

                        len = clust_len;                                        
                        if (rec_offs_nth_extern(clust_offs, clust_pos)) {       
                                len -= BTR_EXTERN_FIELD_REF_SIZE;               

For our case clust_len equals to BTR_EXTERN_FIELD_REF_SIZE, so len = 0, and it's passed as data_len argument in dtype_get_at_most_n_mbchars(...), that is why it returns 0.

As row_sel_sec_rec_is_for_clust_rec() treats empty BLOB prefix field in secondary index equal to any external BLOB field in clustered index and returns DB_SUCCESS_LOCKED_REC, Row_sel_get_clust_rec_for_mysql::operator() doesn't zerro out clustered record pointer in row_search_mvcc(), and row_search_mvcc() thinks that delete-marked secondary index record has visible for CHECK TABLES read view old-versioned clustered index record, and row_scan_index_for_mysql() counts it as a row.

Comment by Marko Mäkelä [ 2022-02-08 ]

If a BLOB pointer is present, then the record should not match a zero-length column prefix. This is because dtuple_convert_big_rec() should never create a BLOB for a column that is shorter than BTR_EXTERN_FIELD_REF_SIZE (20) bytes.

Furthermore, if the BTR_EXTERN_LEN is not at least as long as the column prefix in the secondary index record, a match should be impossible, and we can avoid fetching the BLOB.

Comment by Vladislav Lesin [ 2022-02-09 ]

I understood why the mtr test case is unstable. The difference between failed and passed tests is that CHECK TABLE's readview sees secondary index record with empty BLOB for passed test and doesn't see it for failed test(that it why it requests clustered index and compares BLOB fields). For the failed test there was active rw-transaction when CHECK TABLE's readview was created - dict statistics was being updated. That is why STATS_PERSISTENT=1 influenced test result. To make it stable is enough to have active rw-transaction when CHECK TABLE is executed. Here is stable test case: MDEV-27746.test.

Comment by Vladislav Lesin [ 2022-02-09 ]

Pushed bb-10.6-MDEV-27746-emty-blob.

Comment by Matthias Leich [ 2022-02-10 ]

origin/bb-10.6-MDEV-27746-emty-blob 04589516851124d33746ef89c9ed0c47f074bb62 2022-02-10T11:49:06+03:00
behaved well in RQG testing.

Generated at Thu Feb 08 09:55:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.