Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35053

Crash in purge_sys_t::iterator::free_history_rseg()

Details

    Description

      The recently extended test innodb.log_file_size_online,encrypted occasionally triggers a crash of InnoDB when built WITH_ASAN=ON or WITH_MSAN=ON. I can reproduce it in combination with CMAKE_BUILD_TYPE=RelWithDebInfo; on https://buildbot.mariadb.org I have seen it fail with CMAKE_BUILD_TYPE=Debug as well. The problem appears to be that the rseg.undo_cached list is empty.

      Attachments

        Issue Links

          Activity

            I was able to reproduce something in rr record:

            10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4

            ==2266368==ERROR: AddressSanitizer: heap-use-after-free on address 0x503000006080 at pc 0x57166638812c bp 0x7ea82bffed00 sp 0x7ea82bffecf0
            READ of size 8 at 0x503000006080 thread T4
            2024-10-01  1:15:07 0 [Note] Server socket created on IP: '127.0.0.1'.
                #0 0x57166638812b in purge_sys_t::get_page(page_id_t) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:791
                #1 0x571666388a83 in purge_sys_t::choose_next_log() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:901
                #2 0x571664cc0ba9 in purge_sys_t::fetch_next_rec() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1014
                #3 0x571664cc0ba9 in trx_purge_attach_undo_recs /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1236
                #4 0x571664cc0ba9 in trx_purge(unsigned long, unsigned long) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1372
                #5 0x571666376180 in purge_coordinator_state::do_purge() /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1505
                #6 0x57166637586d in purge_coordinator_callback /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1589
            0x503000006080 is located 16 bytes inside of 32-byte region [0x503000006070,0x503000006090)
            freed by thread T4 here:
                #0 0x7ea8342fa678 in free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:52
                #1 0x571666387c4d in ut_allocator<std::__detail::_Hash_node<std::pair<page_id_t const, buf_block_t*>, true>, true>::deallocate(std::__detail::_Hash_node<std::pair<page_id_t const, buf_block_t*>, true>*, unsigned long) /home/marko/11.2/storage/innobase/include/ut0new.h:426
                #8 0x571666387c4d in std::unordered_map<page_id_t const, buf_block_t*, purge_sys_t::hasher, std::equal_to<page_id_t>, ut_allocator<std::pair<page_id_t const, buf_block_t*>, true> >::erase(page_id_t const&) /usr/include/c++/13/bits/unordered_map.h:770
                #9 0x571666387c4d in purge_sys_t::get_page(page_id_t) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:788
                #10 0x571666388a83 in purge_sys_t::choose_next_log() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:901
                #11 0x571664cc0ba9 in purge_sys_t::fetch_next_rec() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1014
                #12 0x571664cc0ba9 in trx_purge_attach_undo_recs /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1236
                #13 0x571664cc0ba9 in trx_purge(unsigned long, unsigned long) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1372
                #14 0x571666376180 in purge_coordinator_state::do_purge() /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1505
                #15 0x57166637586d in purge_coordinator_callback /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1589
            previously allocated by thread T4 here:
                #7 0x5716663878b9 in std::unordered_map<page_id_t const, buf_block_t*, purge_sys_t::hasher, std::equal_to<page_id_t>, ut_allocator<std::pair<page_id_t const, buf_block_t*>, true> >::operator[](page_id_t const&) /usr/include/c++/13/bits/unordered_map.h:987
                #8 0x5716663878b9 in purge_sys_t::get_page(page_id_t) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:782
                #9 0x571666388a83 in purge_sys_t::choose_next_log() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:901
                #10 0x571664cc0ba9 in purge_sys_t::fetch_next_rec() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1014
                #11 0x571664cc0ba9 in trx_purge_attach_undo_recs /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1236
                #12 0x571664cc0ba9 in trx_purge(unsigned long, unsigned long) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1372
                #13 0x571666376180 in purge_coordinator_state::do_purge() /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1505
                #14 0x57166637586d in purge_coordinator_callback /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1589
            

            This is related to the purge_sys.pages cache that was implemented in MDEV-34515.

            marko Marko Mäkelä added a comment - I was able to reproduce something in rr record : 10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4 ==2266368==ERROR: AddressSanitizer: heap-use-after-free on address 0x503000006080 at pc 0x57166638812c bp 0x7ea82bffed00 sp 0x7ea82bffecf0 READ of size 8 at 0x503000006080 thread T4 2024-10-01 1:15:07 0 [Note] Server socket created on IP: '127.0.0.1'. #0 0x57166638812b in purge_sys_t::get_page(page_id_t) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:791 #1 0x571666388a83 in purge_sys_t::choose_next_log() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:901 #2 0x571664cc0ba9 in purge_sys_t::fetch_next_rec() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1014 #3 0x571664cc0ba9 in trx_purge_attach_undo_recs /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1236 #4 0x571664cc0ba9 in trx_purge(unsigned long, unsigned long) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1372 #5 0x571666376180 in purge_coordinator_state::do_purge() /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1505 #6 0x57166637586d in purge_coordinator_callback /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1589 … 0x503000006080 is located 16 bytes inside of 32-byte region [0x503000006070,0x503000006090) freed by thread T4 here: #0 0x7ea8342fa678 in free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:52 #1 0x571666387c4d in ut_allocator<std::__detail::_Hash_node<std::pair<page_id_t const, buf_block_t*>, true>, true>::deallocate(std::__detail::_Hash_node<std::pair<page_id_t const, buf_block_t*>, true>*, unsigned long) /home/marko/11.2/storage/innobase/include/ut0new.h:426 … #8 0x571666387c4d in std::unordered_map<page_id_t const, buf_block_t*, purge_sys_t::hasher, std::equal_to<page_id_t>, ut_allocator<std::pair<page_id_t const, buf_block_t*>, true> >::erase(page_id_t const&) /usr/include/c++/13/bits/unordered_map.h:770 #9 0x571666387c4d in purge_sys_t::get_page(page_id_t) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:788 #10 0x571666388a83 in purge_sys_t::choose_next_log() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:901 #11 0x571664cc0ba9 in purge_sys_t::fetch_next_rec() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1014 #12 0x571664cc0ba9 in trx_purge_attach_undo_recs /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1236 #13 0x571664cc0ba9 in trx_purge(unsigned long, unsigned long) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1372 #14 0x571666376180 in purge_coordinator_state::do_purge() /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1505 #15 0x57166637586d in purge_coordinator_callback /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1589 … previously allocated by thread T4 here: … #7 0x5716663878b9 in std::unordered_map<page_id_t const, buf_block_t*, purge_sys_t::hasher, std::equal_to<page_id_t>, ut_allocator<std::pair<page_id_t const, buf_block_t*>, true> >::operator[](page_id_t const&) /usr/include/c++/13/bits/unordered_map.h:987 #8 0x5716663878b9 in purge_sys_t::get_page(page_id_t) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:782 #9 0x571666388a83 in purge_sys_t::choose_next_log() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:901 #10 0x571664cc0ba9 in purge_sys_t::fetch_next_rec() /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1014 #11 0x571664cc0ba9 in trx_purge_attach_undo_recs /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1236 #12 0x571664cc0ba9 in trx_purge(unsigned long, unsigned long) /home/marko/11.2/storage/innobase/trx/trx0purge.cc:1372 #13 0x571666376180 in purge_coordinator_state::do_purge() /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1505 #14 0x57166637586d in purge_coordinator_callback /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1589 This is related to the purge_sys.pages cache that was implemented in MDEV-34515 .

            The bug that AddressSanitizer caught is a rather embarrassing one, introduced by myself in MDEV-34515 but not present in any release. The reference would point to freed memory, all within the same function call:

            diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
            index 65583c284e6..e1f974b29ab 100644
            --- a/storage/innobase/trx/trx0purge.cc
            +++ b/storage/innobase/trx/trx0purge.cc
            @@ -779,13 +779,16 @@ buf_block_t *purge_sys_t::get_page(page_id_t id)
             {
               ut_ad(!recv_sys.recovery_on);
             
            -  buf_block_t*& undo_page= pages[id];
            +  buf_block_t *&h= pages[id];
            +  buf_block_t *undo_page= h;
             
               if (!undo_page)
               {
                 undo_page= buf_pool.page_fix(id); // batch_cleanup() will unfix()
                 if (!undo_page)
                   pages.erase(id);
            +    else
            +      h= undo_page;
               }
             
               return undo_page;
            

            It turns out that this is all that is needed to fix the occasional test failure. The following completed without errors on a 152-core system:

            ./mtr --parallel=160 --repeat=20 innodb.log_file_size_online,encrypted{,,,,,,,}{,,,,,,,,,}{,}{,}
            

            10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4 with the above patch

            innodb.log_file_size_online 'encrypted'  w158 [ 20 pass ]   4115
            innodb.log_file_size_online 'encrypted'  w28 [ 20 pass ]   4248
            --------------------------------------------------------------------------
            The servers were restarted 0 times
            Spent 30775.428 of 219 seconds executing testcases
             
            Completed: All 6400 tests were successful.
            

            Without the patch, this invocation would fail:

            10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4

            innodb.log_file_size_online 'encrypted'  w39 [ 5 fail ]
            241001  4:12:43 [ERROR] mysqld got signal 11 ;
            #11 0x00005d4d5a8bad36 in handle_fatal_signal (sig=<optimized out>) at /home/marko/11.2/sql/signal_handler.cc:230
            #12 <signal handler called>
            #13 0x00005d4d5b40c281 in purge_sys_t::iterator::free_history_rseg (this=this@entry=0x75276950afc0, rseg=...) at /home/marko/11.2/storage/innobase/trx/trx0purge.cc:504
            #14 0x00005d4d5b403313 in purge_sys_t::iterator::free_history (this=<optimized out>) at /home/marko/11.2/storage/innobase/trx/trx0purge.cc:550
            #15 0x00005d4d5b3eef4a in purge_truncation_callback () at /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1139
            #16 0x00005d4d5b68a41b in tpool::task_group::execute (this=0x5d4d5d24ed40 <purge_truncation_task_group>, t=0x5d4d5d24ec80 <purge_truncation_task>) at /home/marko/11.2/tpool/task_group.cc:70
            #17 0x00005d4d5b683219 in tpool::thread_pool_generic::worker_main (this=0x518000000480, thread_var=0x530000010500) at /home/marko/11.2/tpool/tpool_generic.cc:583
            

            marko Marko Mäkelä added a comment - The bug that AddressSanitizer caught is a rather embarrassing one, introduced by myself in MDEV-34515 but not present in any release. The reference would point to freed memory, all within the same function call: diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index 65583c284e6..e1f974b29ab 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -779,13 +779,16 @@ buf_block_t *purge_sys_t::get_page(page_id_t id) { ut_ad(!recv_sys.recovery_on); - buf_block_t*& undo_page= pages[id]; + buf_block_t *&h= pages[id]; + buf_block_t *undo_page= h; if (!undo_page) { undo_page= buf_pool.page_fix(id); // batch_cleanup() will unfix() if (!undo_page) pages.erase(id); + else + h= undo_page; } return undo_page; It turns out that this is all that is needed to fix the occasional test failure. The following completed without errors on a 152-core system: . /mtr --parallel=160 --repeat=20 innodb.log_file_size_online,encrypted{,,,,,,,}{,,,,,,,,,}{,}{,} 10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4 with the above patch innodb.log_file_size_online 'encrypted' w158 [ 20 pass ] 4115 innodb.log_file_size_online 'encrypted' w28 [ 20 pass ] 4248 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 30775.428 of 219 seconds executing testcases   Completed: All 6400 tests were successful. Without the patch, this invocation would fail: 10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4 innodb.log_file_size_online 'encrypted' w39 [ 5 fail ] … 241001 4:12:43 [ERROR] mysqld got signal 11 ; … #11 0x00005d4d5a8bad36 in handle_fatal_signal (sig=<optimized out>) at /home/marko/11.2/sql/signal_handler.cc:230 #12 <signal handler called> #13 0x00005d4d5b40c281 in purge_sys_t::iterator::free_history_rseg (this=this@entry=0x75276950afc0, rseg=...) at /home/marko/11.2/storage/innobase/trx/trx0purge.cc:504 #14 0x00005d4d5b403313 in purge_sys_t::iterator::free_history (this=<optimized out>) at /home/marko/11.2/storage/innobase/trx/trx0purge.cc:550 #15 0x00005d4d5b3eef4a in purge_truncation_callback () at /home/marko/11.2/storage/innobase/srv/srv0srv.cc:1139 #16 0x00005d4d5b68a41b in tpool::task_group::execute (this=0x5d4d5d24ed40 <purge_truncation_task_group>, t=0x5d4d5d24ec80 <purge_truncation_task>) at /home/marko/11.2/tpool/task_group.cc:70 #17 0x00005d4d5b683219 in tpool::thread_pool_generic::worker_main (this=0x518000000480, thread_var=0x530000010500) at /home/marko/11.2/tpool/tpool_generic.cc:583

            Looks correct. Thanks for identifying and fixing it.

            debarun Debarun Banerjee added a comment - Looks correct. Thanks for identifying and fixing it.

            For the record, the SIGSEGV would be fixed by the following. This is normalyl a sign of corrupted database, but we are not supposed to crash on any corruption (except in CMAKE_BUILD_TYPE=Debug where we want to be able to catch corruption bugs):

            @@ -494,16 +494,18 @@ inline dberr_t purge_sys_t::iterator::free_history_rseg(trx_rseg_t &rseg) const
                     if (undo->hdr_page_no == hdr_addr.page)
                       goto found_cached;
                   ut_ad("inconsistent undo logs" == 0);
            -      if (false)
            -      found_cached:
            -        UT_LIST_REMOVE(rseg.undo_cached, undo);
            +    found_cached:
                   static_assert(FIL_NULL == 0xffffffff, "");
                   if (UNIV_UNLIKELY(mach_read_from_4(TRX_RSEG + TRX_RSEG_FORMAT +
                                                      rseg_hdr->page.frame)))
                     trx_rseg_format_upgrade(rseg_hdr, &mtr);
            -      mtr.memset(rseg_hdr, TRX_RSEG + TRX_RSEG_UNDO_SLOTS +
            -                 undo->id * TRX_RSEG_SLOT_SIZE, 4, 0xff);
            -      ut_free(undo);
            +      if (UNIV_LIKELY(undo != nullptr))
            +      {
            +        UT_LIST_REMOVE(rseg.undo_cached, undo);
            +        mtr.memset(rseg_hdr, TRX_RSEG + TRX_RSEG_UNDO_SLOTS +
            +                   undo->id * TRX_RSEG_SLOT_SIZE, 4, 0xff);
            +        ut_free(undo);
            +      }
                   mtr.write<8,mtr_t::MAYBE_NOP>(*rseg_hdr, TRX_RSEG + TRX_RSEG_MAX_TRX_ID +
                                                 rseg_hdr->page.frame,
                                                 trx_sys.get_max_trx_id() - 1);
            

            marko Marko Mäkelä added a comment - For the record, the SIGSEGV would be fixed by the following. This is normalyl a sign of corrupted database, but we are not supposed to crash on any corruption (except in CMAKE_BUILD_TYPE=Debug where we want to be able to catch corruption bugs): @@ -494,16 +494,18 @@ inline dberr_t purge_sys_t::iterator::free_history_rseg(trx_rseg_t &rseg) const if (undo->hdr_page_no == hdr_addr.page) goto found_cached; ut_ad("inconsistent undo logs" == 0); - if (false) - found_cached: - UT_LIST_REMOVE(rseg.undo_cached, undo); + found_cached: static_assert(FIL_NULL == 0xffffffff, ""); if (UNIV_UNLIKELY(mach_read_from_4(TRX_RSEG + TRX_RSEG_FORMAT + rseg_hdr->page.frame))) trx_rseg_format_upgrade(rseg_hdr, &mtr); - mtr.memset(rseg_hdr, TRX_RSEG + TRX_RSEG_UNDO_SLOTS + - undo->id * TRX_RSEG_SLOT_SIZE, 4, 0xff); - ut_free(undo); + if (UNIV_LIKELY(undo != nullptr)) + { + UT_LIST_REMOVE(rseg.undo_cached, undo); + mtr.memset(rseg_hdr, TRX_RSEG + TRX_RSEG_UNDO_SLOTS + + undo->id * TRX_RSEG_SLOT_SIZE, 4, 0xff); + ut_free(undo); + } mtr.write<8,mtr_t::MAYBE_NOP>(*rseg_hdr, TRX_RSEG + TRX_RSEG_MAX_TRX_ID + rseg_hdr->page.frame, trx_sys.get_max_trx_id() - 1);

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.