Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34909

DDL during SET GLOBAL innodb_log_file_size may hang when using PMEM

    XMLWordPrintable

Details

    Description

      While testing MDEV-33853, mleich produced a core dump of a server hang, which resulted in an intentional crash with the infamous message that innodb_fatal_semaphore_wait for dict_sys.latch was exceeded.

      In the stack traces that I analyzed, there were several threads trying to execute SET GLOBAL innodb_log_file_size as well as one DDL operation that was waiting on an exclusive log_sys.latch and therefore in the end blocking all threads, while holding exclusive dict_sys.latch.

      I think that the root cause is flawed logic in log_resize_acquire(). It could also make sense to protect that function with LOCK_global_system_variables so that multiple instances of log_resize_acquire() can be executed concurrently. A fix might look as follows. This will need some additional stress testing to validate this.

      diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
      index ce2d3958f9c..27adf670b11 100644
      --- a/storage/innobase/handler/ha_innodb.cc
      +++ b/storage/innobase/handler/ha_innodb.cc
      @@ -18518,9 +18518,7 @@ buffer_pool_load_abort(
       static void innodb_log_file_buffering_update(THD *thd, st_mysql_sys_var*,
                                                    void *, const void *save)
       {
      -  mysql_mutex_unlock(&LOCK_global_system_variables);
         log_sys.set_buffered(*static_cast<const my_bool*>(save));
      -  mysql_mutex_lock(&LOCK_global_system_variables);
       }
       #endif
       
      @@ -18528,7 +18526,6 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
                                               void *var, const void *save)
       {
         ut_ad(var == &srv_log_file_size);
      -  mysql_mutex_unlock(&LOCK_global_system_variables);
       
         if (high_level_read_only)
           ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_READ_ONLY_MODE);
      @@ -18551,13 +18548,15 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
             ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_CANT_CREATE_HANDLER_FILE);
             break;
           case log_t::RESIZE_STARTED:
      +      mysql_mutex_unlock(&LOCK_global_system_variables);
             const lsn_t start{log_sys.resize_in_progress()};
             for (timespec abstime;;)
             {
               if (thd_kill_level(thd))
               {
      +          mysql_mutex_lock(&LOCK_global_system_variables);
                 log_sys.resize_abort();
      -          break;
      +          return;
               }
       
               set_timespec(abstime, 5);
      @@ -18588,9 +18587,9 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
               if (!resizing || resizing > start /* only wait for our resize */)
                 break;
             }
      +      mysql_mutex_lock(&LOCK_global_system_variables);
           }
         }
      -  mysql_mutex_lock(&LOCK_global_system_variables);
       }
       
       static void innodb_log_spin_wait_delay_update(THD *, st_mysql_sys_var*,
      diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
      index d7aae556ce0..7c94876996c 100644
      --- a/storage/innobase/log/log0log.cc
      +++ b/storage/innobase/log/log0log.cc
      @@ -384,6 +384,8 @@ void log_t::close_file()
       /** Acquire all latches that protect the log. */
       static void log_resize_acquire()
       {
      +  mysql_mutex_assert_owner(&LOCK_global_system_variables);
      +
         if (!log_sys.is_pmem())
         {
           while (flush_lock.acquire(log_sys.get_lsn() + 1, nullptr) !=
      

      I am also wondering if the logic regarding flush_lock.acquire() and write_lock.acquire(), which are partly visible above, is correct.

      wlad, I wonder if there is any chance that another thread in log_write_up_to() could get "in between" and start to wait for log_sys.latch before log_resize_acquire() completes log_sys.latch.wr_lock()? If yes, is there any way to acquire the write_lock and flush_lock with an even larger LSN (such as LSN_MAX) and rewind to the actual LSN once done? I have a concern that even with the above patch, a hang could be possible.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.