Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32049

Deadlock due to log_free_check(), involving trx_purge_truncate_rseg_history() and trx_undo_assign_low()

Details

    Description

      Hi,
      week ago our production database cluster (1 master, 4 replicas, maxscale as proxy) started to deadlock master approx. every 12 hours. We are still looking for trigger but without any success. No obvious problematic query in PROCESSLIST, nothing

      Finally today we were able to get decent coredump, using quay.io/mariadb-foundation/mariadb-debug:10.6 image. Exact version is 10.6.16-MariaDB-1:10.6.16+maria~ubu2004-log source revision: 07494006dd0887ebfb31564a8fd4c59cf1b299e9, exact image version docker.io/library/mariadb@sha256:fcbe381e5fef20c7a2932b52a070f58987b770c651aedf705332e54d1dfd465f

      SELECTs seems to be running OK, DML queries are blocked. Some in "opening table" some in "sending data".

      I'm attaching both server log, full backtrace and I also have coredump, but it is 700MB bzipped so not attaching but is available.

      Attachments

        Issue Links

          Activity

            Thank you for an excellent report, with a useful mariadbd_full_bt_all_threads.txt to start with. Many threads are waiting for dict_sys.latch, or the caller of a syscall() was invoke with this=dict_sys+72. If I ignore those, I should hopefully find the thread that is blocked while holding an exclusive dict_sys.latch. The first interesting thread is this one:

            10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9

            Thread 236 (Thread 0x7fea195a2700 (LWP 216099)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            …
            #5  srw_lock_impl<true>::wr_lock (line=1435, file=0x561122d9c240 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/trx/trx0undo.cc", this=0x561123344bc8 <trx_sys+16648>) at ./storage/innobase/include/srw_lock.h:528
            No locals.
            #6  trx_undo_assign_low<false> (trx=trx@entry=0x7feb52822580, rseg=rseg@entry=0x561123344bc0 <trx_sys+16640>, undo=undo@entry=0x7feb52822ed0, mtr=mtr@entry=0x7fea1959e790, err=err@entry=0x7fea1959e754) at ./storage/innobase/trx/trx0undo.cc:1435
                    block = <optimized out>
            

            This is executing an INSERT, so it can’t be holding an exclusive dict_sys.latch. It is waiting for an exclusive rseg->latch so that it can assign the first undo log page for the transaction. We have also Thread 234 and Thread 230 waiting for the same rseg->latch == trx_sys+16648 at the same spot. This should still be normal; the interesting threads would be ones that are waiting for something else while holding dict_sys.latch or the rseg->latch.

            In the end, I filtered out these less interesting waits from the output of

            grep -B1 -A3 -w syscall mariadbd_full_bt_all_threads.txt
            

            using replace-regexp in GNU Emacs and got the following:

            10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9

            Thread 196 (Thread 0x7fea185a0700 (LWP 215877)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            No locals.
            #1  0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7feb2c090398) at ./storage/innobase/sync/srw_lock.cc:238
            No locals.
            --
            Thread 159 (Thread 0x7fea402d9700 (LWP 1624)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            No locals.
            #1  0x00005611229ac028 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x7fea00030470) at ./storage/innobase/sync/srw_lock.cc:238
            No locals.
            --
            Thread 149 (Thread 0x7feb52817700 (LWP 57)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            No locals.
            #1  0x0000561122aa3b6d in my_getevents (min_nr=1, nr=256, ev=0x7feb52814d60, ctx=<optimized out>) at ./tpool/aio_linux.cc:105
                    saved_errno = <optimized out>
            --
            Thread 44 (Thread 0x7fea19188700 (LWP 216113)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            No locals.
            #1  0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fea9c096658) at ./storage/innobase/sync/srw_lock.cc:238
            No locals.
            --
            

            We can ignore Thread 149, which is invoking the io_getevents system call; this image does not use the newer io_uring (MDEV-24883). The other three threads are waiting inside btr_cur_t::search_leaf(). This could be a case that was missed in MDEV-29835. The ticket MDEV-31815 had been filed, but because the test case involved ROW_FORMAT=COMPRESSED tables, it hit other bugs too easily.

            I will check the mtr_t::m_memo in all threads deeper, to check if this is related to MDEV-31815.

            marko Marko Mäkelä added a comment - Thank you for an excellent report, with a useful mariadbd_full_bt_all_threads.txt to start with. Many threads are waiting for dict_sys.latch , or the caller of a syscall() was invoke with this=dict_sys+72 . If I ignore those, I should hopefully find the thread that is blocked while holding an exclusive dict_sys.latch . The first interesting thread is this one: 10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9 Thread 236 (Thread 0x7fea195a2700 (LWP 216099)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 … #5 srw_lock_impl<true>::wr_lock (line=1435, file=0x561122d9c240 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/trx/trx0undo.cc", this=0x561123344bc8 <trx_sys+16648>) at ./storage/innobase/include/srw_lock.h:528 No locals. #6 trx_undo_assign_low<false> (trx=trx@entry=0x7feb52822580, rseg=rseg@entry=0x561123344bc0 <trx_sys+16640>, undo=undo@entry=0x7feb52822ed0, mtr=mtr@entry=0x7fea1959e790, err=err@entry=0x7fea1959e754) at ./storage/innobase/trx/trx0undo.cc:1435 block = <optimized out> This is executing an INSERT , so it can’t be holding an exclusive dict_sys.latch . It is waiting for an exclusive rseg->latch so that it can assign the first undo log page for the transaction. We have also Thread 234 and Thread 230 waiting for the same rseg->latch == trx_sys+16648 at the same spot. This should still be normal; the interesting threads would be ones that are waiting for something else while holding dict_sys.latch or the rseg->latch . In the end, I filtered out these less interesting waits from the output of grep -B1 -A3 -w syscall mariadbd_full_bt_all_threads.txt using replace-regexp in GNU Emacs and got the following: 10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9 Thread 196 (Thread 0x7fea185a0700 (LWP 215877)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 No locals. #1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7feb2c090398) at ./storage/innobase/sync/srw_lock.cc:238 No locals. -- Thread 159 (Thread 0x7fea402d9700 (LWP 1624)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 No locals. #1 0x00005611229ac028 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x7fea00030470) at ./storage/innobase/sync/srw_lock.cc:238 No locals. -- Thread 149 (Thread 0x7feb52817700 (LWP 57)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 No locals. #1 0x0000561122aa3b6d in my_getevents (min_nr=1, nr=256, ev=0x7feb52814d60, ctx=<optimized out>) at ./tpool/aio_linux.cc:105 saved_errno = <optimized out> -- Thread 44 (Thread 0x7fea19188700 (LWP 216113)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 No locals. #1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fea9c096658) at ./storage/innobase/sync/srw_lock.cc:238 No locals. -- We can ignore Thread 149, which is invoking the io_getevents system call; this image does not use the newer io_uring ( MDEV-24883 ). The other three threads are waiting inside btr_cur_t::search_leaf() . This could be a case that was missed in MDEV-29835 . The ticket MDEV-31815 had been filed, but because the test case involved ROW_FORMAT=COMPRESSED tables, it hit other bugs too easily. I will check the mtr_t::m_memo in all threads deeper, to check if this is related to MDEV-31815 .

            In 64-bit builds, the offset of buf_block_t::lock is 0x18. Thread 196 is waiting for a page latch on the block at 0x7feb2c090380. An exclusive latch on that page is being held by Thread 42, which is waiting for rseg->latch==trx_sys+16656.

            Thread 159 is waiting for a shared dict_index_t::lock on the index 0x7fea000302f0. An exclusive latch is held by Thread 82, which is waiting for rseg->latch==trx_sys+16656.

            Thread 44 is waiting for a latch on the block 0x7fea9c096640, which is exclusively latched by Thread 52, which is waiting for rseg->latch==trx_sys+16640.

            I will have to double check if I made a mistake in my filtering of the grep output. Currently, this does not look related to MDEV-29835 or MDEV-31815. Rather, it looks like the undo logs could be corrupted in some way.

            marko Marko Mäkelä added a comment - In 64-bit builds, the offset of buf_block_t::lock is 0x18. Thread 196 is waiting for a page latch on the block at 0x7feb2c090380. An exclusive latch on that page is being held by Thread 42, which is waiting for rseg->latch==trx_sys+16656 . Thread 159 is waiting for a shared dict_index_t::lock on the index 0x7fea000302f0. An exclusive latch is held by Thread 82, which is waiting for rseg->latch==trx_sys+16656 . Thread 44 is waiting for a latch on the block 0x7fea9c096640, which is exclusively latched by Thread 52, which is waiting for rseg->latch==trx_sys+16640 . I will have to double check if I made a mistake in my filtering of the grep output. Currently, this does not look related to MDEV-29835 or MDEV-31815 . Rather, it looks like the undo logs could be corrupted in some way.

            I forgot that GNU libc does not call the syscall() wrapper function, but invokes system calls directly. Thread 104 should be blocking the progress of any thread that is trying to allocate the first undo log page for a modification that they are about to make:

            10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9

            Thread 104 (Thread 0x7fe9f1ffb700 (LWP 263700)):
            #0  __lll_lock_wait (futex=futex@entry=0x56112334d2c0 <buf_pool+17024>, private=0) at lowlevellock.c:52
                    __ret = <optimized out>
            #1  0x00007feb60b15235 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x56112334d2c0 <buf_pool+17024>) at ../nptl/pthread_mutex_lock.c:135
            …
            #3  buf_flush_wait (lsn=3681184096958) at ./storage/innobase/buf/buf0flu.cc:1884
            …
            #7  0x0000561122219e9f in log_free_check () at ./storage/innobase/log/log0log.cc:1068
            No locals.
            #8  0x0000561122295748 in trx_purge_truncate_rseg_history …
            #9  trx_purge_truncate_history () at ./storage/innobase/trx/trx0purge.cc:587 …
            #10 0x00005611229a9beb in purge_coordinator_state::do_purge …
            #11 0x00005611229a9186 in purge_coordinator_callback …
            

            This is waiting for a log checkpoint to be completed, so that the ib_logfile0 will not be overwritten while we try to free garbage undo log pages. We see that the page cleaner thread is trying to fulfill an earlier request to urgently advance the checkpoint LSN to an earlier LSN:

            Thread 66 (Thread 0x7fea48d40700 (LWP 58)):
            #0  0x00007feb60b151ad in __GI___pthread_mutex_lock (mutex=0x56112334d2c0 <buf_pool+17024>) at ../nptl/pthreadP.h:54
                    cnt = 0
                    max_cnt = <optimized out>
                    type = <optimized out>
                    __PRETTY_FUNCTION__ = "__pthread_mutex_lock"
                    id = <optimized out>
            #1  0x0000561122240ab9 in inline_mysql_mutex_lock (src_line=src_line@entry=2007, src_file=0x561122d9fd60 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/buf/buf0flu.cc", that=<optimized out>) at ./include/mysql/psi/mysql_thread.h:752
            No locals.
            #2  0x000056112224197d in buf_flush_sync_for_checkpoint (lsn=3681399771675) at ./storage/innobase/buf/buf0flu.cc:2007
                    newest_lsn = 3681399771675
                    measure = <optimized out>
                    checkpoint_lsn = <optimized out>
                    target = <optimized out>
            #3  buf_flush_page_cleaner () at ./storage/innobase/buf/buf0flu.cc:2250
            

            I do not think that this thread is not blocked; it just happens to be waiting for buf_pool.flush_list_mutex at this very moment. But, the checkpoint cannot be advanced, presumably because one of the active threads is holding an exclusive latch on a page whose oldest_modification is earlier than the desired checkpoint LSN. We can only allow shared page latches to be held by other threads while a page is being written out from the buffer pool.

            It will be tricky to fix this.

            Setting a larger innodb_log_file_size should reduce the probability of this hang. Thanks to MDEV-29911, there should be less concern to make innodb_log_file_size even several times innodb_buffer_pool_size.

            marko Marko Mäkelä added a comment - I forgot that GNU libc does not call the syscall() wrapper function, but invokes system calls directly. Thread 104 should be blocking the progress of any thread that is trying to allocate the first undo log page for a modification that they are about to make: 10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9 Thread 104 (Thread 0x7fe9f1ffb700 (LWP 263700)): #0 __lll_lock_wait (futex=futex@entry=0x56112334d2c0 <buf_pool+17024>, private=0) at lowlevellock.c:52 __ret = <optimized out> #1 0x00007feb60b15235 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x56112334d2c0 <buf_pool+17024>) at ../nptl/pthread_mutex_lock.c:135 … #3 buf_flush_wait (lsn=3681184096958) at ./storage/innobase/buf/buf0flu.cc:1884 … #7 0x0000561122219e9f in log_free_check () at ./storage/innobase/log/log0log.cc:1068 No locals. #8 0x0000561122295748 in trx_purge_truncate_rseg_history … #9 trx_purge_truncate_history () at ./storage/innobase/trx/trx0purge.cc:587 … #10 0x00005611229a9beb in purge_coordinator_state::do_purge … #11 0x00005611229a9186 in purge_coordinator_callback … This is waiting for a log checkpoint to be completed, so that the ib_logfile0 will not be overwritten while we try to free garbage undo log pages. We see that the page cleaner thread is trying to fulfill an earlier request to urgently advance the checkpoint LSN to an earlier LSN: Thread 66 (Thread 0x7fea48d40700 (LWP 58)): #0 0x00007feb60b151ad in __GI___pthread_mutex_lock (mutex=0x56112334d2c0 <buf_pool+17024>) at ../nptl/pthreadP.h:54 cnt = 0 max_cnt = <optimized out> type = <optimized out> __PRETTY_FUNCTION__ = "__pthread_mutex_lock" id = <optimized out> #1 0x0000561122240ab9 in inline_mysql_mutex_lock (src_line=src_line@entry=2007, src_file=0x561122d9fd60 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/buf/buf0flu.cc", that=<optimized out>) at ./include/mysql/psi/mysql_thread.h:752 No locals. #2 0x000056112224197d in buf_flush_sync_for_checkpoint (lsn=3681399771675) at ./storage/innobase/buf/buf0flu.cc:2007 newest_lsn = 3681399771675 measure = <optimized out> checkpoint_lsn = <optimized out> target = <optimized out> #3 buf_flush_page_cleaner () at ./storage/innobase/buf/buf0flu.cc:2250 I do not think that this thread is not blocked; it just happens to be waiting for buf_pool.flush_list_mutex at this very moment. But, the checkpoint cannot be advanced, presumably because one of the active threads is holding an exclusive latch on a page whose oldest_modification is earlier than the desired checkpoint LSN. We can only allow shared page latches to be held by other threads while a page is being written out from the buffer pool. It will be tricky to fix this. Setting a larger innodb_log_file_size should reduce the probability of this hang. Thanks to MDEV-29911 , there should be less concern to make innodb_log_file_size even several times innodb_buffer_pool_size .

            In MDEV-30753, some calls to log_free_check() were added to trx_purge_free_segment(), which is invoked while the thread is holding rseg->mutex (10.5) or rseg->latch (later versions). I think that this call needs to be moved before the acquisition of the latch.

            marko Marko Mäkelä added a comment - In MDEV-30753 , some calls to log_free_check() were added to trx_purge_free_segment() , which is invoked while the thread is holding rseg->mutex (10.5) or rseg->latch (later versions). I think that this call needs to be moved before the acquisition of the latch.

            By the way, the exclusive dict_sys.latch is held by Thread 45, which invoked log_free_check() inside dict_stats_update(), while opening tables for executing an UPDATE statement. The culprit for the hang remains trx_purge_truncate_rseg_history(). The call to log_free_check() has to be moved to its caller trx_purge_truncate_history(), before rseg->mutex or rseg->latch is acquired.

            marko Marko Mäkelä added a comment - By the way, the exclusive dict_sys.latch is held by Thread 45, which invoked log_free_check() inside dict_stats_update() , while opening tables for executing an UPDATE statement. The culprit for the hang remains trx_purge_truncate_rseg_history() . The call to log_free_check() has to be moved to its caller trx_purge_truncate_history() , before rseg->mutex or rseg->latch is acquired.
            cuchac Cuchac added a comment -

            Oh, you are co amazing! Fully analyzed and existing patch in 2 hours, incredible, thank you!

            I will be happy to test in production, once this patch lands to some image.

            cuchac Cuchac added a comment - Oh, you are co amazing! Fully analyzed and existing patch in 2 hours, incredible, thank you! I will be happy to test in production, once this patch lands to some image.

            I have now pushed a fix to 10.5 and merged it up to 10.6. Unfortunately, I think that it is very hard to reproduce this; we never encountered this hang in our internal tests.

            marko Marko Mäkelä added a comment - I have now pushed a fix to 10.5 and merged it up to 10.6 . Unfortunately, I think that it is very hard to reproduce this; we never encountered this hang in our internal tests.
            cuchac Cuchac added a comment -

            Hello,
            thanks again for the quick fix. Are there any usable docker images built regularly from 10.6 branch that I can use to test this fix? Our cluster is down every 10 hours so we will know quite soon.
            Or should I build an image from source?

            cuchac Cuchac added a comment - Hello, thanks again for the quick fix. Are there any usable docker images built regularly from 10.6 branch that I can use to test this fix? Our cluster is down every 10 hours so we will know quite soon. Or should I build an image from source?
            danblack Daniel Black added a comment -

            something got stuck in the build pipeline. Both 10.5 and 10.6 images containing this fix will be on https://quay.io/repository/mariadb-foundation/mariadb-devel?tab=tags in a few hours.

            danblack Daniel Black added a comment - something got stuck in the build pipeline. Both 10.5 and 10.6 images containing this fix will be on https://quay.io/repository/mariadb-foundation/mariadb-devel?tab=tags in a few hours.
            danblack Daniel Black added a comment -

            Both are there now on mariadb-devel. Debug images are currently being pushed.

            danblack Daniel Black added a comment - Both are there now on mariadb-devel. Debug images are currently being pushed.
            cuchac Cuchac added a comment -

            Great, I was hoping mariadb-devel will get the update. I just deployed the fix to our cluster. Will let you know.
            Last 2 days there were deadlocks every 4 hours Unfortunately I still does not know how to reproduce this. Too much traffic and only happens in production.

            Hopefully this patch will fix it and we will be happy again

            cuchac Cuchac added a comment - Great, I was hoping mariadb-devel will get the update. I just deployed the fix to our cluster. Will let you know. Last 2 days there were deadlocks every 4 hours Unfortunately I still does not know how to reproduce this. Too much traffic and only happens in production. Hopefully this patch will fix it and we will be happy again

            Thank you cuchac, I am eager to know the outcome.

            marko Marko Mäkelä added a comment - Thank you cuchac , I am eager to know the outcome.
            cuchac Cuchac added a comment -

            Great work everybody! 2 days without deadlock, yay

            cuchac Cuchac added a comment - Great work everybody! 2 days without deadlock, yay
            danblack Daniel Black added a comment - - edited

            Thanks for the feedback. Great to have something clear and actionable.

            As this is the first bug report using quay.io/mariadb-foundation/mariadb-debug, what where some of the good/bad points about it and what would you (request me to) improve?

            danblack Daniel Black added a comment - - edited Thanks for the feedback. Great to have something clear and actionable. As this is the first bug report using quay.io/mariadb-foundation/mariadb-debug, what where some of the good/bad points about it and what would you (request me to) improve?
            cuchac Cuchac added a comment -

            Hello,

            it is great such image exists. I think we would not be able to produce a working backtrace without the image.

            I had some issues that can be fixed by improving documentation:
            1) It was quite hard to find this image. I think https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ and/or https://mariadb.com/kb/en/enabling-core-dumps/ should mention that image in containers section.

            2) We failed to produce coredump several times. We did not realize kernel.core_pattern is not namespaced and our k8s cluster provider had to change the value from default ubuntu value (apport) to plain file name. It is quite weird that kernel.core_pattern is inside container when file path is specified but it is in root namespace when "pipe" format is used (as is default in ubuntu apport).

            cuchac Cuchac added a comment - Hello, it is great such image exists. I think we would not be able to produce a working backtrace without the image. I had some issues that can be fixed by improving documentation: 1) It was quite hard to find this image. I think https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ and/or https://mariadb.com/kb/en/enabling-core-dumps/ should mention that image in containers section. 2) We failed to produce coredump several times. We did not realize kernel.core_pattern is not namespaced and our k8s cluster provider had to change the value from default ubuntu value (apport) to plain file name. It is quite weird that kernel.core_pattern is inside container when file path is specified but it is in root namespace when "pipe" format is used (as is default in ubuntu apport).
            danblack Daniel Black added a comment -

            cuchac Thanks for the feedback. Documented.

            danblack Daniel Black added a comment - cuchac Thanks for the feedback. Documented.

            A high-level description of this bug that InnoDB could hang when removing the history of committed transactions. As far as I can tell, this bug can occur independent of any configuration parameters. All that is needed is some transactions that modify non-temporary InnoDB tables, and sufficiently bad luck. A smaller setting of innodb_log_file_size could increase the probability of hitting this hang.

            marko Marko Mäkelä added a comment - A high-level description of this bug that InnoDB could hang when removing the history of committed transactions. As far as I can tell, this bug can occur independent of any configuration parameters. All that is needed is some transactions that modify non-temporary InnoDB tables, and sufficiently bad luck. A smaller setting of innodb_log_file_size could increase the probability of hitting this hang.

            People

              marko Marko Mäkelä
              cuchac Cuchac
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.