[MDEV-34690] lock_rec_unlock_unmodified() causes deadlock - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6
Fix Version/s: 10.6.20, 10.11.10, 11.2.6, 11.4.4
Component/s: Storage Engine - InnoDB
Labels:
None

Description

lock_rec_unlock_unmodified() is executed either under lock_sys.wr_lock() or under a combination of lock_sys.rd_lock() + record locks hash table cell latch. It also requests page latch to check if locked records were changed by the current transaction or not.

Usually InnoDB requests page latch to find the certain record on the page, and then requests lock_sys and/or record lock hash cell latch to request record lock. lock_rec_unlock_unmodified() requests the latches in the opposite order, what causes deadlocks. One of the possible scenario for the deadlock is the following:

thread 1 - lock_rec_unlock_unmodified() is invoked under locks hash table cell latch, the latch is acquired;
thread 2 - purge thread acquires page latch and tries to remove delete-marked record, it invokes lock_update_delete(), which requests locks hash table cell latch, held by thread 1;
thread 1 - requests page latch, held by thread 2.

The above scenario can be reproduced with the attached test case: test.tar.gz .

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

test.tar.gz
2 kB
2024-08-01 19:26

Issue Links

blocks

MDEV-34466 XA prepare don't release unmodified records in non-blocking mode

Closed

is caused by

MDEV-33454 release row locks for non-modified rows at XA PREPARE

Closed

is duplicated by

MDEV-34777 Wrong page mode during getting page from buffer pool for lock_rec_unlock_unmodified()

Closed

relates to

MDEV-35228 Have a configuration variable to enable/disable lock release at XA PREPARE

Open

Activity

Ascending order - Click to sort in descending order

View 10 older comments

Vladislav Lesin added a comment - 2024-10-16 09:21

Update:

susil.behera developed grammar for RQG testing, which is specific to MDEV-32020 case, he launched the testing and executed non-blocking SELECT from console, and the SELECT hung sporadically. I reproduced the hanging, recorded it with rr and analysed it.

The reason of the hanging is the wrong latching order, i.e. similar to the scenario from the description of the current Jira ticket.

The common rule for InnoDB is that page latch must precede lock_sys or lock_sys + lock hash cell latch. I tried to fix the current issue with getting block after releasing lock_sys, and reacquiring lock_sys and probably lock hash cell latch in this commit in lock_release_on_prepare(_try)(). And it works.

But there is one more place where we get page latch under lock_sys( + lock hash cell) latch. This is row_vers_impl_x_locked() invocation from lock_rec_unlock_unmodified():

#4  sux_lock<ssux_lock_impl<true> >::s_lock (this=this@entry=0x53ee3252e2b8) at /10.6-MDEV-34466/storage/innobase/include/sux_lock.h:396

#5  0x00005b42f5ab18d2 in buf_page_get_low (page_id=..., page_id@entry=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=RW_S_LATCH, guess=guess@entry=0x53ee3252e2a0, mode=mode@entry=10,

    mtr=mtr@entry=0x68347f80, err=0x683475bc, allow_ibuf_merge=false) at /10.6-MDEV-34466/storage/innobase/buf/buf0buf.cc:2993

#6  0x00005b42f5ab23d5 in buf_page_get_gen (page_id=page_id@entry=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=RW_S_LATCH, guess=guess@entry=0x53ee3252e2a0, mode=mode@entry=10,

    mtr=mtr@entry=0x68347f80, err=0x683475bc, allow_ibuf_merge=false) at /10.6-MDEV-34466/storage/innobase/buf/buf0buf.cc:3061

#7  0x00005b42f5a86777 in btr_cur_t::search_leaf (this=this@entry=0x68347b40, tuple=tuple@entry=0x53ee2c02c8b8, mode=mode@entry=PAGE_CUR_LE, latch_mode=latch_mode@entry=BTR_SEARCH_LEAF, mtr=mtr@entry=0x68347f80)

    at /10.6-MDEV-34466/storage/innobase/btr/btr0cur.cc:1238

#8  0x00005b42f59a9997 in btr_pcur_open (mtr=0x68347f80, cursor=0x68347b40, latch_mode=BTR_SEARCH_LEAF, mode=PAGE_CUR_LE, tuple=0x53ee2c02c8b8) at /10.6-MDEV-34466/storage/innobase/include/btr0pcur.h:430

#9  row_search_on_row_ref (pcur=pcur@entry=0x68347b40, mode=mode@entry=BTR_SEARCH_LEAF, table=table@entry=0x53ee2c01fbe8, ref=0x53ee2c02c8b8, mtr=mtr@entry=0x68347f80)

    at /10.6-MDEV-34466/storage/innobase/row/row0row.cc:1219

#10 0x00005b42f59ad9a4 in row_get_clust_rec (mode=mode@entry=BTR_SEARCH_LEAF, rec=rec@entry=0x53ee32c04fae "\200", index=index@entry=0x53ee2c022fe8, clust_index=clust_index@entry=0x68347cf8,

    mtr=mtr@entry=0x68347f80) at /10.6-MDEV-34466/storage/innobase/row/row0row.cc:1255

#11 0x00005b42f59e369f in row_vers_impl_x_locked (caller_trx=caller_trx@entry=0x742a5a5b0c00, rec=rec@entry=0x53ee32c04fae "\200", index=index@entry=0x53ee2c022fe8, offsets=offsets@entry=0x68348440)

    at /10.6-MDEV-34466/storage/innobase/row/row0vers.cc:417

#12 0x00005b42f5858484 in lock_sec_rec_some_has_impl (caller_trx=0x742a5a5b0c00, rec=rec@entry=0x53ee32c04fae "\200", index=index@entry=0x53ee2c022fe8, offsets=0x68348440)

    at /10.6-MDEV-34466/storage/innobase/lock/lock0lock.cc:1235

#13 0x00005b42f5863e53 in lock_rec_unlock_unmodified (block=block@entry=0x53ee3252e370, cell=..., lock=lock@entry=0x5b42f85620e8, offsets=@0x68348430: 0x68348440, heap=@0x68348438: 0x0)

    at /10.6-MDEV-34466/storage/innobase/lock/lock0lock.cc:4556

Note that lock_rec_unlock_unmodified() is invoked under lock_sys(+ lock hash cell) latch. So we have wrong latching order here again, i.e. the whole InnoDB code acquires page latch before locks_sys(+ lock hash cell) latch, but here we acquire lock_sys(+ lock hash cell) latch before page latch, what causes deadlock.

The fix is to invoke lock_sec_rec_some_has_impl() without lock_sys latch from lock_rec_unlock_unmodified(), and reacquire the latch after the call.

Vladislav Lesin added a comment - 2024-10-16 09:21 Update: susil.behera developed grammar for RQG testing, which is specific to MDEV-32020 case, he launched the testing and executed non-blocking SELECT from console, and the SELECT hung sporadically. I reproduced the hanging, recorded it with rr and analysed it. The reason of the hanging is the wrong latching order, i.e. similar to the scenario from the description of the current Jira ticket. The common rule for InnoDB is that page latch must precede lock_sys or lock_sys + lock hash cell latch. I tried to fix the current issue with getting block after releasing lock_sys, and reacquiring lock_sys and probably lock hash cell latch in this commit in lock_release_on_prepare(_try)(). And it works. But there is one more place where we get page latch under lock_sys( + lock hash cell) latch. This is row_vers_impl_x_locked() invocation from lock_rec_unlock_unmodified(): #4 sux_lock<ssux_lock_impl<true> >::s_lock (this=this@entry=0x53ee3252e2b8) at /10.6-MDEV-34466/storage/innobase/include/sux_lock.h:396 #5 0x00005b42f5ab18d2 in buf_page_get_low (page_id=..., page_id@entry=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=RW_S_LATCH, guess=guess@entry=0x53ee3252e2a0, mode=mode@entry=10, mtr=mtr@entry=0x68347f80, err=0x683475bc, allow_ibuf_merge=false) at /10.6-MDEV-34466/storage/innobase/buf/buf0buf.cc:2993 #6 0x00005b42f5ab23d5 in buf_page_get_gen (page_id=page_id@entry=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=RW_S_LATCH, guess=guess@entry=0x53ee3252e2a0, mode=mode@entry=10, mtr=mtr@entry=0x68347f80, err=0x683475bc, allow_ibuf_merge=false) at /10.6-MDEV-34466/storage/innobase/buf/buf0buf.cc:3061 #7 0x00005b42f5a86777 in btr_cur_t::search_leaf (this=this@entry=0x68347b40, tuple=tuple@entry=0x53ee2c02c8b8, mode=mode@entry=PAGE_CUR_LE, latch_mode=latch_mode@entry=BTR_SEARCH_LEAF, mtr=mtr@entry=0x68347f80) at /10.6-MDEV-34466/storage/innobase/btr/btr0cur.cc:1238 #8 0x00005b42f59a9997 in btr_pcur_open (mtr=0x68347f80, cursor=0x68347b40, latch_mode=BTR_SEARCH_LEAF, mode=PAGE_CUR_LE, tuple=0x53ee2c02c8b8) at /10.6-MDEV-34466/storage/innobase/include/btr0pcur.h:430 #9 row_search_on_row_ref (pcur=pcur@entry=0x68347b40, mode=mode@entry=BTR_SEARCH_LEAF, table=table@entry=0x53ee2c01fbe8, ref=0x53ee2c02c8b8, mtr=mtr@entry=0x68347f80) at /10.6-MDEV-34466/storage/innobase/row/row0row.cc:1219 #10 0x00005b42f59ad9a4 in row_get_clust_rec (mode=mode@entry=BTR_SEARCH_LEAF, rec=rec@entry=0x53ee32c04fae "\200", index=index@entry=0x53ee2c022fe8, clust_index=clust_index@entry=0x68347cf8, mtr=mtr@entry=0x68347f80) at /10.6-MDEV-34466/storage/innobase/row/row0row.cc:1255 #11 0x00005b42f59e369f in row_vers_impl_x_locked (caller_trx=caller_trx@entry=0x742a5a5b0c00, rec=rec@entry=0x53ee32c04fae "\200", index=index@entry=0x53ee2c022fe8, offsets=offsets@entry=0x68348440) at /10.6-MDEV-34466/storage/innobase/row/row0vers.cc:417 #12 0x00005b42f5858484 in lock_sec_rec_some_has_impl (caller_trx=0x742a5a5b0c00, rec=rec@entry=0x53ee32c04fae "\200", index=index@entry=0x53ee2c022fe8, offsets=0x68348440) at /10.6-MDEV-34466/storage/innobase/lock/lock0lock.cc:1235 #13 0x00005b42f5863e53 in lock_rec_unlock_unmodified (block=block@entry=0x53ee3252e370, cell=..., lock=lock@entry=0x5b42f85620e8, offsets=@0x68348430: 0x68348440, heap=@0x68348438: 0x0) at /10.6-MDEV-34466/storage/innobase/lock/lock0lock.cc:4556 Note that lock_rec_unlock_unmodified() is invoked under lock_sys(+ lock hash cell) latch. So we have wrong latching order here again, i.e. the whole InnoDB code acquires page latch before locks_sys(+ lock hash cell) latch, but here we acquire lock_sys(+ lock hash cell) latch before page latch, what causes deadlock. The fix is to invoke lock_sec_rec_some_has_impl() without lock_sys latch from lock_rec_unlock_unmodified(), and reacquire the latch after the call.

Vladislav Lesin added a comment - 2024-10-17 10:18

Update:
I have pushed a fix for the above bug, it's under testing now.

Vladislav Lesin added a comment - 2024-10-17 10:18 Update: I have pushed a fix for the above bug, it's under testing now.

Vladislav Lesin added a comment - 2024-10-21 08:11

Update:
This bug was fixed, and the fix passed Susil's testing. I requested code review from Marko, as there are significant changed in comparison with the previous commit, which was approved by Marko. Marko provided code review, and I am addressing his code review notes now. After that there will be one more RQG testing round.

Vladislav Lesin added a comment - 2024-10-21 08:11 Update: This bug was fixed, and the fix passed Susil's testing. I requested code review from Marko, as there are significant changed in comparison with the previous commit, which was approved by Marko. Marko provided code review, and I am addressing his code review notes now. After that there will be one more RQG testing round.

Vladislav Lesin added a comment - 2024-10-23 10:45

Matthias has tested the branch, which contains both this issue and ~~MDEV-34466~~ fixes.

Vladislav Lesin added a comment - 2024-10-23 10:45 Matthias has tested the branch, which contains both this issue and MDEV-34466 fixes.

Marko Mäkelä added a comment - 2024-10-28 08:00

MDEV-35228 was filed for a suspected performance regression due to this logic.

Marko Mäkelä added a comment - 2024-10-28 08:00 MDEV-35228 was filed for a suspected performance regression due to this logic.

MariaDB Server

lock_rec_unlock_unmodified() causes deadlock

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration