[MDEV-30989] MariaDB InnoDB Deadlock after upgrading to 10.6.12 Created: 2023-04-03  Updated: 2023-06-12  Resolved: 2023-06-12

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.12
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Sebastian Stamm Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 1
Labels: None
Environment:

Oracle Linux 8.5


Attachments: Text File engine-status2.txt     File gdb-2023-04-22.7z     Text File gdb-log.txt    
Issue Links:
Duplicate
duplicates MDEV-29835 Partial server freeze Closed

 Description   

One week after updating from 10.6.11 to .12 the database hang up.
Errorlog was full with:
2023-03-26 3:47:42 0 [Warning] Aborted connection 0 to db: 'unconnected' user: 'unauthenticated' host: 'connecting host' (Too many connections)

systemctl was not able to kill/restart the service (had to sudo kill ......)

Some days later it happend again, attached Engine Status and stack-trace from gdb.

Also opend a case: CS0555748



 Comments   
Comment by Marko Mäkelä [ 2023-04-03 ]

sstamm, thank you for the report. This could be a duplicate of MDEV-29835, but I do not have enough information to say it for sure. This thread would be the suspected culprit:

mariadb-10.6.12

Thread 53 (Thread 0x7f7eccb41700 (LWP 3670974)):
#0  0x00007fc08eb9f9bd in syscall () from target:/lib64/libc.so.6
#1  0x0000557789b5f770 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:244
#2  ssux_lock_impl<true>::wr_wait (this=this@entry=0x7fa5e0065878, lk=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:378
#3  0x000055778945fd7f in ssux_lock_impl<true>::wr_lock (this=0x7fa5e0065878) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/atomic_base.h:420
#4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/sux_lock.h:428
#5  buf_page_get_low (page_id={m_id = 301090092363763}, zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0buf.cc:2849

For a deeper analysis, in case you saved a core dump of the hung process, could you share the output of thread apply all backtrace full from the same hang? Or at least the output of the following:

thread 53
frame 7
print mtr.m_memo

Comment by Sebastian Stamm [ 2023-04-03 ]

Have to wait for the next occurrence, but will try to get a thread apply all backtrace full.

Comment by Sebastian Stamm [ 2023-04-22 ]

Here it is: gdb-2023-04-22.7z

Comment by Marko Mäkelä [ 2023-04-24 ]

The file gdb-2023-04-22.txt in gdb-2023-04-22.7z contains a promising Thread 51, waiting for a page latch

#4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f094c0c8f18)

in a re-entrant call to btr_cur_pessimistic_index(). A shared latch on the block descriptor 0x7f094c0c8f00 is being held by Thread 49 and Thread 16, both executing btr_cur_t::search_leaf(). Thread 49 is waiting for

#4  ssux_lock_impl<true>::rd_wait (this=this@entry=0x7f3404024698)

Thread 16 is waiting for

#4  ssux_lock_impl<true>::wr_lock (this=0x7f094c0c8e78)

Both these blocks are being held by Thread 51:

#14 0x0000562d5eaa6dfa in row_ins_clust_index_entry_low …
        mtr = {m_last = 0x7f094c0c9040, m_last_offset = 113, m_log_mode = 0, m_modifications = 1, m_made_dirty = 1, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7efe2a154a50, Size = 11, Capacity = 16}, small = {{object = 0x7efd9e533948, type = MTR_MEMO_SX_LOCK}, {object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f096c0c5c60, type = MTR_MEMO_PAGE_X_FIX}, {
                object = 0x7f094c0c7880, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c68e0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7efca00653b8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f3404024360, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f3404024540, type = MTR_MEMO_PAGE_SX_MODIFY}, {object = 0x7f1fd4049ac0, type = MTR_MEMO_PAGE_SX_MODIFY}, {
                object = 0x7f094c0c9040, type = MTR_MEMO_PAGE_X_MODIFY}, …

We can see exclusive latches held on both block descriptors by Thread 51: object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX and object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX.

That is, Thread 51 is blocking both threads that are holding a shared latch on a block on which it is waiting for an exclusive latch. This deadlock was fixed in MDEV-29835 by making sure that Thread 51 would acquire an exclusive dict_index_t::lock for any ‘tricky’ page split or merge. The btr_cur_t::search_leaf() that was introduced in MDEV-30400 fixed some of the hangs, but it seems to have made the remaining hangs easier to hit in practice.

Generated at Thu Feb 08 10:20:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.