Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30989

MariaDB InnoDB Deadlock after upgrading to 10.6.12

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Duplicate
    • 10.6.12
    • N/A
    • None
    • Oracle Linux 8.5

    Description

      One week after updating from 10.6.11 to .12 the database hang up.
      Errorlog was full with:
      2023-03-26 3:47:42 0 [Warning] Aborted connection 0 to db: 'unconnected' user: 'unauthenticated' host: 'connecting host' (Too many connections)

      systemctl was not able to kill/restart the service (had to sudo kill ......)

      Some days later it happend again, attached Engine Status and stack-trace from gdb.

      Also opend a case: CS0555748

      Attachments

        1. engine-status2.txt
          210 kB
        2. gdb-2023-04-22.7z
          280 kB
        3. gdb-log.txt
          1.80 MB

        Issue Links

          Activity

            sstamm, thank you for the report. This could be a duplicate of MDEV-29835, but I do not have enough information to say it for sure. This thread would be the suspected culprit:

            mariadb-10.6.12

            Thread 53 (Thread 0x7f7eccb41700 (LWP 3670974)):
            #0  0x00007fc08eb9f9bd in syscall () from target:/lib64/libc.so.6
            #1  0x0000557789b5f770 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:244
            #2  ssux_lock_impl<true>::wr_wait (this=this@entry=0x7fa5e0065878, lk=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:378
            #3  0x000055778945fd7f in ssux_lock_impl<true>::wr_lock (this=0x7fa5e0065878) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/atomic_base.h:420
            #4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/sux_lock.h:428
            #5  buf_page_get_low (page_id={m_id = 301090092363763}, zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0buf.cc:2849
            

            For a deeper analysis, in case you saved a core dump of the hung process, could you share the output of thread apply all backtrace full from the same hang? Or at least the output of the following:

            thread 53
            frame 7
            print mtr.m_memo
            

            marko Marko Mäkelä added a comment - sstamm , thank you for the report. This could be a duplicate of MDEV-29835 , but I do not have enough information to say it for sure. This thread would be the suspected culprit: mariadb-10.6.12 Thread 53 (Thread 0x7f7eccb41700 (LWP 3670974)): #0 0x00007fc08eb9f9bd in syscall () from target:/lib64/libc.so.6 #1 0x0000557789b5f770 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:244 #2 ssux_lock_impl<true>::wr_wait (this=this@entry=0x7fa5e0065878, lk=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:378 #3 0x000055778945fd7f in ssux_lock_impl<true>::wr_lock (this=0x7fa5e0065878) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/atomic_base.h:420 #4 sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7fa5e0065878) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/sux_lock.h:428 #5 buf_page_get_low (page_id={m_id = 301090092363763}, zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0buf.cc:2849 For a deeper analysis, in case you saved a core dump of the hung process, could you share the output of thread apply all backtrace full from the same hang? Or at least the output of the following: thread 53 frame 7 print mtr.m_memo

            Have to wait for the next occurrence, but will try to get a thread apply all backtrace full.

            sstamm Sebastian Stamm added a comment - Have to wait for the next occurrence, but will try to get a thread apply all backtrace full .

            Here it is: gdb-2023-04-22.7z

            sstamm Sebastian Stamm added a comment - Here it is: gdb-2023-04-22.7z

            The file gdb-2023-04-22.txt in gdb-2023-04-22.7z contains a promising Thread 51, waiting for a page latch

            #4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f094c0c8f18)
            

            in a re-entrant call to btr_cur_pessimistic_index(). A shared latch on the block descriptor 0x7f094c0c8f00 is being held by Thread 49 and Thread 16, both executing btr_cur_t::search_leaf(). Thread 49 is waiting for

            #4  ssux_lock_impl<true>::rd_wait (this=this@entry=0x7f3404024698)
            

            Thread 16 is waiting for

            #4  ssux_lock_impl<true>::wr_lock (this=0x7f094c0c8e78)
            

            Both these blocks are being held by Thread 51:

            #14 0x0000562d5eaa6dfa in row_ins_clust_index_entry_low …
                    mtr = {m_last = 0x7f094c0c9040, m_last_offset = 113, m_log_mode = 0, m_modifications = 1, m_made_dirty = 1, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7efe2a154a50, Size = 11, Capacity = 16}, small = {{object = 0x7efd9e533948, type = MTR_MEMO_SX_LOCK}, {object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f096c0c5c60, type = MTR_MEMO_PAGE_X_FIX}, {
                            object = 0x7f094c0c7880, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c68e0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7efca00653b8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f3404024360, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f3404024540, type = MTR_MEMO_PAGE_SX_MODIFY}, {object = 0x7f1fd4049ac0, type = MTR_MEMO_PAGE_SX_MODIFY}, {
                            object = 0x7f094c0c9040, type = MTR_MEMO_PAGE_X_MODIFY}, …
            

            We can see exclusive latches held on both block descriptors by Thread 51: object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX and object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX.

            That is, Thread 51 is blocking both threads that are holding a shared latch on a block on which it is waiting for an exclusive latch. This deadlock was fixed in MDEV-29835 by making sure that Thread 51 would acquire an exclusive dict_index_t::lock for any ‘tricky’ page split or merge. The btr_cur_t::search_leaf() that was introduced in MDEV-30400 fixed some of the hangs, but it seems to have made the remaining hangs easier to hit in practice.

            marko Marko Mäkelä added a comment - The file gdb-2023-04-22.txt in gdb-2023-04-22.7z contains a promising Thread 51, waiting for a page latch #4 sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f094c0c8f18) in a re-entrant call to btr_cur_pessimistic_index() . A shared latch on the block descriptor 0x7f094c0c8f00 is being held by Thread 49 and Thread 16, both executing btr_cur_t::search_leaf() . Thread 49 is waiting for #4 ssux_lock_impl<true>::rd_wait (this=this@entry=0x7f3404024698) Thread 16 is waiting for #4 ssux_lock_impl<true>::wr_lock (this=0x7f094c0c8e78) Both these blocks are being held by Thread 51: #14 0x0000562d5eaa6dfa in row_ins_clust_index_entry_low … mtr = {m_last = 0x7f094c0c9040, m_last_offset = 113, m_log_mode = 0, m_modifications = 1, m_made_dirty = 1, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7efe2a154a50, Size = 11, Capacity = 16}, small = {{object = 0x7efd9e533948, type = MTR_MEMO_SX_LOCK}, {object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f096c0c5c60, type = MTR_MEMO_PAGE_X_FIX}, { object = 0x7f094c0c7880, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c68e0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7efca00653b8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f3404024360, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f3404024540, type = MTR_MEMO_PAGE_SX_MODIFY}, {object = 0x7f1fd4049ac0, type = MTR_MEMO_PAGE_SX_MODIFY}, { object = 0x7f094c0c9040, type = MTR_MEMO_PAGE_X_MODIFY}, … We can see exclusive latches held on both block descriptors by Thread 51: object = 0x7f094c0c8e60, type = MTR_MEMO_PAGE_X_FIX and object = 0x7f3404024680, type = MTR_MEMO_PAGE_X_FIX . That is, Thread 51 is blocking both threads that are holding a shared latch on a block on which it is waiting for an exclusive latch. This deadlock was fixed in MDEV-29835 by making sure that Thread 51 would acquire an exclusive dict_index_t::lock for any ‘tricky’ page split or merge. The btr_cur_t::search_leaf() that was introduced in MDEV-30400 fixed some of the hangs, but it seems to have made the remaining hangs easier to hit in practice.

            People

              marko Marko Mäkelä
              sstamm Sebastian Stamm
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.