[MDEV-30989] MariaDB InnoDB Deadlock after upgrading to 10.6.12 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.6.12
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
Oracle Linux 8.5

Description

One week after updating from 10.6.11 to .12 the database hang up.
Errorlog was full with:
2023-03-26 3:47:42 0 [Warning] Aborted connection 0 to db: 'unconnected' user: 'unauthenticated' host: 'connecting host' (Too many connections)

systemctl was not able to kill/restart the service (had to sudo kill ......)

Some days later it happend again, attached Engine Status and stack-trace from gdb.

Also opend a case: CS0555748

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

engine-status2.txt
210 kB
2023-04-03 07:51
gdb-2023-04-22.7z
280 kB
2023-04-22 09:14
gdb-log.txt
1.80 MB
2023-04-03 07:51

Issue Links

duplicates

MDEV-29835 Partial server freeze

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2023-04-03 11:27

sstamm, thank you for the report. This could be a duplicate of ~~MDEV-29835~~, but I do not have enough information to say it for sure. This thread would be the suspected culprit:

mariadb-10.6.12
Thread 53 (Thread 0x7f7eccb41700 (LWP 3670974)):
#0 0x00007fc08eb9f9bd in syscall () from target:/lib64/libc.so.6
#1 0x0000557789b5f770 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:244
#2 ssux_lock_impl<true>::wr_wait (this=this@entry=0x7fa5e0065878, lk=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:378
#3 0x000055778945fd7f in ssux_lock_impl<true>::wr_lock (this=0x7fa5e0065878) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/atomic_base.h:420
#4 sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/sux_lock.h:428
#5 buf_page_get_low (page_id={m_id = 301090092363763}, zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0buf.cc:2849

For a deeper analysis, in case you saved a core dump of the hung process, could you share the output of thread apply all backtrace full from the same hang? Or at least the output of the following:

thread 53

frame 7

print mtr.m_memo

Marko Mäkelä added a comment - 2023-04-03 11:27 sstamm , thank you for the report. This could be a duplicate of MDEV-29835 , but I do not have enough information to say it for sure. This thread would be the suspected culprit: mariadb-10.6.12 Thread 53 (Thread 0x7f7eccb41700 (LWP 3670974)): #0 0x00007fc08eb9f9bd in syscall () from target:/lib64/libc.so.6 #1 0x0000557789b5f770 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:244 #2 ssux_lock_impl<true>::wr_wait (this=this@entry=0x7fa5e0065878, lk=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:378 #3 0x000055778945fd7f in ssux_lock_impl<true>::wr_lock (this=0x7fa5e0065878) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/atomic_base.h:420 #4 sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/sux_lock.h:428 #5 buf_page_get_low (page_id={m_id = 301090092363763}, zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0buf.cc:2849 For a deeper analysis, in case you saved a core dump of the hung process, could you share the output of thread apply all backtrace full from the same hang? Or at least the output of the following: thread 53 frame 7 print mtr.m_memo

Sebastian Stamm added a comment - 2023-04-03 11:40

Have to wait for the next occurrence, but will try to get a thread apply all backtrace full.

Sebastian Stamm added a comment - 2023-04-03 11:40 Have to wait for the next occurrence, but will try to get a thread apply all backtrace full .

Sebastian Stamm added a comment - 2023-04-22 09:14

Here it is: gdb-2023-04-22.7z

Sebastian Stamm added a comment - 2023-04-22 09:14 Here it is: gdb-2023-04-22.7z

Marko Mäkelä added a comment - 2023-04-24 09:21

The file gdb-2023-04-22.txt in gdb-2023-04-22.7z contains a promising Thread 51, waiting for a page latch

#4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f094c0c8f18)

in a re-entrant call to btr_cur_pessimistic_index(). A shared latch on the block descriptor 0x7f094c0c8f00 is being held by Thread 49 and Thread 16, both executing btr_cur_t::search_leaf(). Thread 49 is waiting for

#4  ssux_lock_impl<true>::rd_wait (this=this@entry=0x7f3404024698)

Thread 16 is waiting for

#4  ssux_lock_impl<true>::wr_lock (this=0x7f094c0c8e78)

Both these blocks are being held by Thread 51:

#14 0x0000562d5eaa6dfa in row_ins_clust_index_entry_low …

        mtr = {m_last = 0x7f094c0c9040, m_last_offset = 113, m_log_mode = 0, m_modifications = 1, m_made_dirty = 1, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7efe2a154a50, Size = 11, Capacity = 16}, small = {{object = 0x7efd9e533948, type = MTR_MEMO_SX_LOCK}, {object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f096c0c5c60, type = MTR_MEMO_PAGE_X_FIX}, {

                object = 0x7f094c0c7880, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c68e0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7efca00653b8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f3404024360, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f3404024540, type = MTR_MEMO_PAGE_SX_MODIFY}, {object = 0x7f1fd4049ac0, type = MTR_MEMO_PAGE_SX_MODIFY}, {

                object = 0x7f094c0c9040, type = MTR_MEMO_PAGE_X_MODIFY}, …

We can see exclusive latches held on both block descriptors by Thread 51: object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX and object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX.

That is, Thread 51 is blocking both threads that are holding a shared latch on a block on which it is waiting for an exclusive latch. This deadlock was fixed in ~~MDEV-29835~~ by making sure that Thread 51 would acquire an exclusive dict_index_t::lock for any ‘tricky’ page split or merge. The btr_cur_t::search_leaf() that was introduced in ~~MDEV-30400~~ fixed some of the hangs, but it seems to have made the remaining hangs easier to hit in practice.

Marko Mäkelä added a comment - 2023-04-24 09:21 The file gdb-2023-04-22.txt in gdb-2023-04-22.7z contains a promising Thread 51, waiting for a page latch #4 sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f094c0c8f18) in a re-entrant call to btr_cur_pessimistic_index() . A shared latch on the block descriptor 0x7f094c0c8f00 is being held by Thread 49 and Thread 16, both executing btr_cur_t::search_leaf() . Thread 49 is waiting for #4 ssux_lock_impl<true>::rd_wait (this=this@entry=0x7f3404024698) Thread 16 is waiting for #4 ssux_lock_impl<true>::wr_lock (this=0x7f094c0c8e78) Both these blocks are being held by Thread 51: #14 0x0000562d5eaa6dfa in row_ins_clust_index_entry_low … mtr = {m_last = 0x7f094c0c9040, m_last_offset = 113, m_log_mode = 0, m_modifications = 1, m_made_dirty = 1, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7efe2a154a50, Size = 11, Capacity = 16}, small = {{object = 0x7efd9e533948, type = MTR_MEMO_SX_LOCK}, {object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f096c0c5c60, type = MTR_MEMO_PAGE_X_FIX}, { object = 0x7f094c0c7880, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c68e0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7efca00653b8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f3404024360, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f3404024540, type = MTR_MEMO_PAGE_SX_MODIFY}, {object = 0x7f1fd4049ac0, type = MTR_MEMO_PAGE_SX_MODIFY}, { object = 0x7f094c0c9040, type = MTR_MEMO_PAGE_X_MODIFY}, … We can see exclusive latches held on both block descriptors by Thread 51: object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX and object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX . That is, Thread 51 is blocking both threads that are holding a shared latch on a block on which it is waiting for an exclusive latch. This deadlock was fixed in MDEV-29835 by making sure that Thread 51 would acquire an exclusive dict_index_t::lock for any ‘tricky’ page split or merge. The btr_cur_t::search_leaf() that was introduced in MDEV-30400 fixed some of the hangs, but it seems to have made the remaining hangs easier to hit in practice.

MariaDB Server

MariaDB InnoDB Deadlock after upgrading to 10.6.12

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration