[MDEV-35508] Race condition between purge and secondary index INSERT or UPDATE - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.6.20, 10.11.10, 11.2.6, 11.4.4, 11.6.2, 11.7.1
Fix Version/s: 10.6.21, 10.11.11, 11.4.5, 11.7.2
Component/s: Storage Engine - InnoDB
Labels:

Description

mleich produced an rr replay trace of a ~~MDEV-35049~~ development branch that features this same error, but without the involvement of OPTIMIZE TABLE. Initially, I was suspecting that this failure is something specific to that branch, because the branch includes a rewrite of how keys are being searched for in B-tree pages.

ssh sdp

rr replay /data/results/1732217540/Marko-3/1/rr/latest-trace

In the end, this seems to be a race condition. It took me a while to figure out how to debug this. In the end, I set a hardware data watchpoint on the clustered index record (DB_ROW_ID=0x4b7) in the buffer pool, specifically on the last 4 bytes of the DB_ROLL_PTR field and the 4 bytes of the problematic col_int_key field, to catch what is going on. I also set a watchpoint on the delete-mark flag of the secondary index record (col_int_key,DB_ROW_ID)=(5,0x4b7).

The hardware data watchpoint on the clustered index record was being hit by the following:

INSERT /*! IGNORE */ INTO table10000_innodb VALUES (5, 5), (5,5), …;

UPDATE table10000_innodb SET `col_int_key` = 4 /* E_R Thread6 QNO 8 CON_ID 23 */

UPDATE table10000_innodb SET `col_int_key` = 5 /* E_R Thread7 QNO 13 CON_ID 24 */

During the execution of the second UPDATE (transaction 0x1e, just a little too new to be included in purge_sys.view), the purge of the first UPDATE was blocked in the following:

10.6-MDEV-35049 36a8b44ebd96ec9a8d449c83248109d5e893f534
#17 log_free_check () at /data/Server/10.6-MDEV-35049E/storage/innobase/log/log0log.cc:956
#18 0x000063a8330d1a61 in row_purge_remove_sec_if_poss_tree (node=node@entry=0x63a8354a62b8, index=index@entry=0x7d1998069e78, entry=entry@entry=0x7d19a402df88, page_max_trx_id=page_max_trx_id@entry=0x1e)
at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:767
#19 0x000063a8330d2925 in row_purge_remove_sec_if_poss (node=node@entry=0x63a8354a62b8, index=0x7d1998069e78, entry=0x7d19a402df88) at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:991
#20 0x000063a8330d31c6 in row_purge_upd_exist_or_extern_func (thr=thr@entry=0x63a8354a6218, node=node@entry=0x63a8354a62b8, undo_rec=undo_rec@entry=0x7d19c51abdda ">\b\f\202\267\022")
at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1157
#21 0x000063a8330d36a6 in row_purge_record_func (node=node@entry=0x63a8354a62b8, undo_rec=undo_rec@entry=0x7d19c51abdda ">\b\f\202\267\022", thr=thr@entry=0x63a8354a6218, updated_extern=0x0)
at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1548
#22 0x000063a8330d3baa in row_purge (node=node@entry=0x63a8354a62b8, undo_rec=undo_rec@entry=0x7d19c51abdda ">\b\f\202\267\022", thr=thr@entry=0x63a8354a6218)
at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1587
#23 0x000063a8330d3c09 in row_purge_step (thr=thr@entry=0x63a8354a6218) at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1650

During this blockage, the second UPDATE had updated both the clustered index record and removed the delete-mark on the secondary index record (5,0x4b7) which had been delete-marked by the first UPDATE. Then, purge would report an error and hit ut_ad(0), crashing the debug instrumented build:

2024-11-21 13:14:38 0 [ERROR] InnoDB: tried to purge non-delete-marked record in index `col_int_key` of table `test`.`table10000_innodb`: tuple: TUPLE (info_bits=0, 2 fields): {[4]    (0x80000005),[6]      (0x0000000004B7)}, record: COMPACT RECORD(info_bits=0, 2 fields): {[4]    (0x80000005),[6]      (0x0000000004B7)}

The problem turns out to be that ~~MDEV-34515~~ introduced an unsafe optimization: If the PAGE_MAX_TRX_ID did not change between row_purge_remove_sec_if_poss_leaf() and row_purge_remove_sec_if_poss_tree(), a call to row_purge_poss_sec() would be skipped.

A more correct condition would be the following: If the PAGE_MAX_TRX_ID was not changed and it did not belong to an active transaction when row_purge_remove_sec_if_poss_leaf() was holding the secondary index leaf page latch, the check would be redundant.

As far as I can tell, the impact of this bug is limited to some error log "spam" and the debug assertion failure. This should not cause any actual corruption.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

10.6_test_repeat.patch
2024-11-28 06:05
4 kB
Debarun Banerjee
10.6_fix.patch
2024-11-28 06:05
1 kB
Debarun Banerjee

Issue Links

causes

MDEV-35619 Assertion failure in row_purge_del_mark_error

Closed

duplicates

MDEV-35829 galera node crash with race condition

Open

is caused by

MDEV-34515 Contention between secondary index UPDATE and purge due to large innodb_purge_batch_size

Closed

Activity

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-11-26 15:09

Updated:: 2025-04-29 14:37

Resolved:: 2024-11-29 09:17

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.