[MDEV-27746] Wrong comparision of BLOB's empty preffix with non-preffixed BLOB causes rows count mismatch for clustered and secondary indexes during non-locking read Created: 2022-02-04 Updated: 2022-02-15 Resolved: 2022-02-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.0, 10.6.1, 10.6.2, 10.6.3, 10.6.4, 10.6.5, 10.6.6, 10.7.2, 10.6, 10.8 |
| Fix Version/s: | 10.2.44, 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Matthias Leich | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | BLOB, corruption, purge, race, regression-10.6, rr-profile-analyzed | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
|
| Comments |
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can repeat this on a 10.8 based branch. Here is a little simpler test case:
We need a non-default (non-strict) sql_mode that will allow a NULL value to be inserted into a NOT NULL column. If I make the REPEAT arguments shorter, so that the record will fit in the clustered index page, then CHECK TABLE will not fail. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The failure is not repeatable if I disable the purge of history (innodb_force_recovery=2), or if I add wait_all_purged.inc at any point between INSERT and CHECK TABLE, or if I add a server restart before the CHECK TABLE. The test does fail if I replace the invalid NULL with the empty string ''. Here is a simpler test that I think removes any doubt of the problem being outside InnoDB:
mleich, can you please try to generate an rr replay trace of this? My attempts failed so far. I expect that 10.6 and likely also older versions are affected. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
sdp:/dev/shm/vardir/86/log/mysqld.1.rr | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I created a core dump:
In the buffer pool of the core dump, I find the following in the clustered index: So far, it looks good. Let us dereference the DB_ROLL_PTR and fetch the 2 previous versions.
We can see a next-record pointer to 0x5a8, and at 0x5a6 the back-pointer to the start of this record. Next, let us dereference the DB_ROLL_PTR of the previous version. On page 0x137 at byte offset 0x2a2 we can find the following:
The undo log record type is 0x8c (TRX_UNDO_UPD_EXIST_REC | TRX_UNDO_UPD_EXTERN), the undo number is again 0, and this is again table 0x12 and the info_bits are 0. Again, let us dereference the DB_ROLL_PTR of the previous version. On page 0x136 at byte offset 0x27d we can find the following:
This is the first version of the record, as expected, of the type TRX_UNDO_INSERT_REC=11. I think that an analysis of an rr replay trace may be needed to find out which of the 2 delete-marked records in the secondary index we counted as existing in the current read view, and why. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To check if
The CHECK TABLE would occasionally fail also with this patch, again with 2 rows counted in the secondary index instead of 1. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test does not fail for me on 10.2 or 10.5, but it fails on 10.6. In the rr replay trace that I analyzed, the delete-marked record that contains the empty string was wrongly counted. There were exactly 3 invocations of Row_sel_get_clust_rec_for_mysql::operator(), and the very first one was for the delete-marked old record version containing the empty string. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The regression was caused somewhere between mariadb-10.6.4 and mariadb-10.6.5.
I will try to find the commit that made this error reproducible. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It turns out that this error was previously masked by
It still does not fail in 10.5 for me. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This fails already in MariaDB 10.6.0 (not every time, but with enough attempts).
If the purge of history is disabled (innodb_force_recovery=2), it will not fail. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test started to fail with a fairly high-level change that reduced the use of dict_sys.mutex. I will not claim that Only after the rr replay trace has been analyzed deeper, we can say why exactly we wrongly decided that the delete-marked secondary index record '' exists in the current read view. I am assigning this to vlad.lesin for final analysis, because he already spent a few days analyzing a very similar failure that popped up when his change | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-02-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I repeated and recorded the issue locally on 10.6 with Marko's test. I debugged row_search_mvcc() for CHECK TABLE's index scanning. Clustered index scanning looks good, the index contains one record with correct link to BLOB and trx_id.
The first found record during secondary index scanning is:
As CHECK TABLE's readview can see the changes of transaction with id equal to the secondary index page transaction id, clustered index record is read. As the readview can see clustered index record, and the secondary index record is delete-marked, the BLOB fields are compared. As we compare the empty field of secondary index(with length equal to 0) with non-empty BLOB field of clustered index, row_sel_sec_rec_is_for_clust_rec() must return DB_SUCCESS with the following calls trace:
but it returns DB_SUCCESS_LOCKED_REC because of the following code:
I suppose that for BLOB comparison row_sel_sec_rec_is_for_blob() must be invoked. But in the above code "if (ulint prefix_len = ifield->prefix_len)" branch is executed. Then dtype_get_at_most_n_mbchars(...) returns 0 and then this 0 is compared with sec_len, which is 0 too. So instead of row_sel_sec_rec_is_for_blob() cmp_data_data() is executed, which, in turns, returns 0 as both len and sec_len equal to 0. The secondary index record length is counted from offsets array, which is initialized in rec_init_offsets_comp_ordinary(), and the length of the BLOB prefix field is read from rec[-6] byte, which is 0. According to the above, empty BLOB prefix field from secondary index equals to non-empty BLOB field in clustered record. Commit 25ed665a is under suspicion. Why dtype_get_at_most_n_mbchars(...) returns 0, i.e. why BLOB's field len is 0? That's because of the following code in row_sel_sec_rec_is_for_clust_rec():
For our case clust_len equals to BTR_EXTERN_FIELD_REF_SIZE, so len = 0, and it's passed as data_len argument in dtype_get_at_most_n_mbchars(...), that is why it returns 0. As row_sel_sec_rec_is_for_clust_rec() treats empty BLOB prefix field in secondary index equal to any external BLOB field in clustered index and returns DB_SUCCESS_LOCKED_REC, Row_sel_get_clust_rec_for_mysql::operator() doesn't zerro out clustered record pointer in row_search_mvcc(), and row_search_mvcc() thinks that delete-marked secondary index record has visible for CHECK TABLES read view old-versioned clustered index record, and row_scan_index_for_mysql() counts it as a row. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If a BLOB pointer is present, then the record should not match a zero-length column prefix. This is because dtuple_convert_big_rec() should never create a BLOB for a column that is shorter than BTR_EXTERN_FIELD_REF_SIZE (20) bytes. Furthermore, if the BTR_EXTERN_LEN is not at least as long as the column prefix in the secondary index record, a match should be impossible, and we can avoid fetching the BLOB. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-02-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I understood why the mtr test case is unstable. The difference between failed and passed tests is that CHECK TABLE's readview sees secondary index record with empty BLOB for passed test and doesn't see it for failed test(that it why it requests clustered index and compares BLOB fields). For the failed test there was active rw-transaction when CHECK TABLE's readview was created - dict statistics was being updated. That is why STATS_PERSISTENT=1 influenced test result. To make it stable is enough to have active rw-transaction when CHECK TABLE is executed. Here is stable test case: MDEV-27746.test | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-02-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pushed bb-10.6- | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
origin/bb-10.6- |