[MDEV-34909] DDL during SET GLOBAL innodb_log_file_size may hang when using PMEM - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL), 11.6(EOL), 11.7(EOL)
Fix Version/s: 10.11.10, 11.2.6, 11.4.4, 11.6.2
Component/s: Storage Engine - InnoDB
Labels:
- hang

Description

While testing ~~MDEV-33853~~, mleich produced a core dump of a server hang, which resulted in an intentional crash with the infamous message that innodb_fatal_semaphore_wait for dict_sys.latch was exceeded.

In the stack traces that I analyzed, there were several threads trying to execute SET GLOBAL innodb_log_file_size as well as one DDL operation that was waiting on an exclusive log_sys.latch and therefore in the end blocking all threads, while holding exclusive dict_sys.latch.

I think that the root cause is flawed logic in log_resize_acquire(). It could also make sense to protect that function with LOCK_global_system_variables so that multiple instances of log_resize_acquire() can be executed concurrently. A fix might look as follows. This will need some additional stress testing to validate this.

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index ce2d3958f9c..27adf670b11 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -18518,9 +18518,7 @@ buffer_pool_load_abort(

 static void innodb_log_file_buffering_update(THD *thd, st_mysql_sys_var*,

                                              void *, const void *save)

-  mysql_mutex_unlock(&LOCK_global_system_variables);

   log_sys.set_buffered(*static_cast<const my_bool*>(save));

-  mysql_mutex_lock(&LOCK_global_system_variables);

 #endif

@@ -18528,7 +18526,6 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,

                                         void *var, const void *save)

   ut_ad(var == &srv_log_file_size);

-  mysql_mutex_unlock(&LOCK_global_system_variables);

   if (high_level_read_only)

     ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_READ_ONLY_MODE);

@@ -18551,13 +18548,15 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,

       ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_CANT_CREATE_HANDLER_FILE);

       break;

     case log_t::RESIZE_STARTED:

+      mysql_mutex_unlock(&LOCK_global_system_variables);

       const lsn_t start{log_sys.resize_in_progress()};

       for (timespec abstime;;)

         if (thd_kill_level(thd))

+          mysql_mutex_lock(&LOCK_global_system_variables);

           log_sys.resize_abort();

-          break;

+          return;

         set_timespec(abstime, 5);

@@ -18588,9 +18587,9 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,

         if (!resizing || resizing > start /* only wait for our resize */)

           break;

+      mysql_mutex_lock(&LOCK_global_system_variables);

-  mysql_mutex_lock(&LOCK_global_system_variables);

 static void innodb_log_spin_wait_delay_update(THD *, st_mysql_sys_var*,

diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc

index d7aae556ce0..7c94876996c 100644

--- a/storage/innobase/log/log0log.cc

+++ b/storage/innobase/log/log0log.cc

@@ -384,6 +384,8 @@ void log_t::close_file()

 /** Acquire all latches that protect the log. */

 static void log_resize_acquire()

+  mysql_mutex_assert_owner(&LOCK_global_system_variables);

   if (!log_sys.is_pmem())

     while (flush_lock.acquire(log_sys.get_lsn() + 1, nullptr) !=

I am also wondering if the logic regarding flush_lock.acquire() and write_lock.acquire(), which are partly visible above, is correct.

wlad, I wonder if there is any chance that another thread in log_write_up_to() could get "in between" and start to wait for log_sys.latch before log_resize_acquire() completes log_sys.latch.wr_lock()? If yes, is there any way to acquire the write_lock and flush_lock with an even larger LSN (such as LSN_MAX) and rewind to the actual LSN once done? I have a concern that even with the above patch, a hang could be possible.

Attachments

Issue Links

is caused by

MDEV-27812 Allow innodb_log_file_size to change without server restart

Closed

relates to

MDEV-31311 The test innodb.log_file_size_online occasionally hangs

Closed

MDEV-33853 Async rollback prepared transactions during binlog crash recovery

Closed

Activity

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-09-10 14:08

Updated:: 2024-09-24 06:20

Resolved:: 2024-09-24 06:20

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server