Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35409

InnoDB can still hang while running out of buffer pool

    XMLWordPrintable

Details

    Description

      It seems that one more regression may have been caused by MDEV-33053. We have seen occasional failures of the test innodb_gis.types where InnoDB would hang during crash recovery, while being low on the buffer pool.

      mleich produced a core dump where this happens during recovery, with the following stack trace in the thread that is waiting to allocate a block:

      buf_LRU_get_free_block
      recv_sys_t::recover_low
      recv_sys_t::recover
      buf_page_get_gen
      trx_undo_mem_create_at_db_start
      trx_undo_lists_init
      trx_rseg_mem_restore
      trx_rseg_array_init
      trx_lists_init_at_db_start
      srv_start
      innodb_init
      

      Both buf_pool.free and buf_pool.flush_list are empty. In buf_pool.LRU there were 248 blocks; the innodb_buffer_pool_size could correspond to 512. I could see at least one block that was read-latched and buffer-fixed, but many of the blocks were actually in a replaceable state.

      It seems to me that the buf_pool_page_cleaner thread was being woken up about once per second, but buf_pool_t::need_LRU_eviction() would likely fail to hold. I believe that the following should prevent this:

      diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
      index 4c270d2bdef..df85feb603a 100644
      --- a/storage/innobase/buf/buf0flu.cc
      +++ b/storage/innobase/buf/buf0flu.cc
      @@ -2564,6 +2564,7 @@ static void buf_flush_page_cleaner()
       ATTRIBUTE_COLD void buf_pool_t::LRU_warn()
       {
         mysql_mutex_assert_owner(&mutex);
      +  try_LRU_scan= false;
         if (!LRU_warned.test_and_set(std::memory_order_acquire))
           sql_print_warning("InnoDB: Could not free any blocks in the buffer pool!"
                             " %zu blocks are in use and %zu free."
      

      The loop in buf_pool_t::need_LRU_eviction() invokes this function. Setting the flag would ensure that buf_flush_page_cleaner will do something to alleviate the situation.

      As far as I understand, this hang is only possible with a small buffer pool when a large part of the buffer pool is being allocated for something else (such as crash recovery, the adaptive hash index, or explicit locks) so that buf_pool.LRU.count is below 256 (BUF_LRU_MIN_LEN). In that regard, this would be a follow-up fix to MDEV-34166.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.