Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL), 11.6(EOL), 11.7(EOL)
Description
While testing MDEV-33853, mleich produced a core dump of a server hang, which resulted in an intentional crash with the infamous message that innodb_fatal_semaphore_wait for dict_sys.latch was exceeded.
In the stack traces that I analyzed, there were several threads trying to execute SET GLOBAL innodb_log_file_size as well as one DDL operation that was waiting on an exclusive log_sys.latch and therefore in the end blocking all threads, while holding exclusive dict_sys.latch.
I think that the root cause is flawed logic in log_resize_acquire(). It could also make sense to protect that function with LOCK_global_system_variables so that multiple instances of log_resize_acquire() can be executed concurrently. A fix might look as follows. This will need some additional stress testing to validate this.
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
|
index ce2d3958f9c..27adf670b11 100644
|
--- a/storage/innobase/handler/ha_innodb.cc
|
+++ b/storage/innobase/handler/ha_innodb.cc
|
@@ -18518,9 +18518,7 @@ buffer_pool_load_abort(
|
static void innodb_log_file_buffering_update(THD *thd, st_mysql_sys_var*,
|
void *, const void *save)
|
{
|
- mysql_mutex_unlock(&LOCK_global_system_variables);
|
log_sys.set_buffered(*static_cast<const my_bool*>(save));
|
- mysql_mutex_lock(&LOCK_global_system_variables);
|
}
|
#endif
|
|
@@ -18528,7 +18526,6 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
|
void *var, const void *save)
|
{
|
ut_ad(var == &srv_log_file_size);
|
- mysql_mutex_unlock(&LOCK_global_system_variables);
|
|
if (high_level_read_only)
|
ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_READ_ONLY_MODE);
|
@@ -18551,13 +18548,15 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
|
ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_CANT_CREATE_HANDLER_FILE);
|
break;
|
case log_t::RESIZE_STARTED:
|
+ mysql_mutex_unlock(&LOCK_global_system_variables);
|
const lsn_t start{log_sys.resize_in_progress()};
|
for (timespec abstime;;)
|
{
|
if (thd_kill_level(thd))
|
{
|
+ mysql_mutex_lock(&LOCK_global_system_variables);
|
log_sys.resize_abort();
|
- break;
|
+ return;
|
}
|
|
set_timespec(abstime, 5);
|
@@ -18588,9 +18587,9 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
|
if (!resizing || resizing > start /* only wait for our resize */)
|
break;
|
}
|
+ mysql_mutex_lock(&LOCK_global_system_variables);
|
}
|
}
|
- mysql_mutex_lock(&LOCK_global_system_variables);
|
}
|
|
static void innodb_log_spin_wait_delay_update(THD *, st_mysql_sys_var*,
|
diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
|
index d7aae556ce0..7c94876996c 100644
|
--- a/storage/innobase/log/log0log.cc
|
+++ b/storage/innobase/log/log0log.cc
|
@@ -384,6 +384,8 @@ void log_t::close_file()
|
/** Acquire all latches that protect the log. */
|
static void log_resize_acquire()
|
{
|
+ mysql_mutex_assert_owner(&LOCK_global_system_variables);
|
+
|
if (!log_sys.is_pmem())
|
{
|
while (flush_lock.acquire(log_sys.get_lsn() + 1, nullptr) != |
I am also wondering if the logic regarding flush_lock.acquire() and write_lock.acquire(), which are partly visible above, is correct.
wlad, I wonder if there is any chance that another thread in log_write_up_to() could get "in between" and start to wait for log_sys.latch before log_resize_acquire() completes log_sys.latch.wr_lock()? If yes, is there any way to acquire the write_lock and flush_lock with an even larger LSN (such as LSN_MAX) and rewind to the actual LSN once done? I have a concern that even with the above patch, a hang could be possible.
Attachments
Issue Links
- is caused by
-
MDEV-27812 Allow innodb_log_file_size to change without server restart
-
- Closed
-
- relates to
-
MDEV-31311 The test innodb.log_file_size_online occasionally hangs
-
- Closed
-
-
MDEV-33853 Async rollback prepared transactions during binlog crash recovery
-
- Closed
-
A cleaner option would seem to be to introduce a log_sys.checkpoint_latch that would be acquired together with an exclusive lock_sys.latch, to cover starting and aborting log resizing, as well as checkpoints, which is where log resizing can be completed. In that way we would not have to extend the hold time of LOCK_global_system_variables, like the above tentative patch does. (It is not a good idea to hold a mutex while creating a potentially huge ib_logfile101 for the resizing.)
I will have to check if that could work for log_t::set_buffered() as well, to change the O_DIRECT‌ness of the ib_logfile0. Possibly, set_buffered() could actually work based on exclusive log_sys.latch only.