[MDEV-36082] Race condition between log_t::resize_start() and log_t::resize_abort() - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.11.11
Fix Version/s: 10.11.12, 11.4.6, 11.8.2
Component/s: Storage Engine - InnoDB
Labels:
- race

Description

mleich produced an rr replay trace where SET GLOBAL innodb_log_file_size leads to a debug assertion failure:

10.11-MDEV-29445

mariadbd: /data/Server/10.11-MDEV-29445/storage/innobase/log/log0log.cc:972: lsn_t log_t::write_buf() [with resizing_and_latch resizing = log_t::RESIZING; lsn_t = long unsigned int]: Assertion `resizing == RETAIN_LATCH || (resizing == RESIZING) == (resize_in_progress() > 1)' failed.

We have log_sys.resize_lsn = 1, which means that log resizing is only about to start, and hence log_t::writer_update() had better not yet have assigned log_sys.writer=log_writer_resizing.

In this trace, one log resizing is about to start, and another is being interrupted:

SET GLOBAL innodb_log_file_size = 104857600 + 52428800 /* E_R Thread2 QNO 466 CON_ID 195 */ ;

SET GLOBAL innodb_log_file_size = 104857600 /* E_R Thread2 QNO 454 CON_ID 192 */ ;

Attachments

Issue Links

is caused by

MDEV-27812 Allow innodb_log_file_size to change without server restart

Closed

relates to

MDEV-35810 Missing error handling in log resizing around ib_logfile101

Closed

Activity

Ascending order - Click to sort in descending order

Debarun Banerjee added a comment - 2025-02-17 06:09

marko This fix would help fixing the assert but the root of the problem seems to lie elsewhere. The problem is that the constraint that "I should abort my own resize" is violated. Unless fixed, it could cause other issues e.g. we could abort a future resize operation which would return success to end user.

I have added more details in https://github.com/MariaDB/server/pull/3835 for your consideration.

Debarun Banerjee added a comment - 2025-02-17 06:09 marko This fix would help fixing the assert but the root of the problem seems to lie elsewhere. The problem is that the constraint that "I should abort my own resize" is violated. Unless fixed, it could cause other issues e.g. we could abort a future resize operation which would return success to end user. I have added more details in https://github.com/MariaDB/server/pull/3835 for your consideration.

Marko Mäkelä added a comment - 2025-02-17 12:20

I ended up introducing a state variable log_sys.resize_initiator in order to avoid another potential glitch when one SET GLOBAL innodb_log_file_size fails to notice that the log resizing had completed, and then a subsequent SET GLOBAL innodb_log_file_size was started. By keeping track of the thread that initiated the log resizing, we can make sure that only that thread will wait for its own log resizing to be completed.

Marko Mäkelä added a comment - 2025-02-17 12:20 I ended up introducing a state variable log_sys.resize_initiator in order to avoid another potential glitch when one SET GLOBAL innodb_log_file_size fails to notice that the log resizing had completed, and then a subsequent SET GLOBAL innodb_log_file_size was started. By keeping track of the thread that initiated the log resizing, we can make sure that only that thread will wait for its own log resizing to be completed.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2025-02-13 11:14

Updated:: 2025-02-24 10:24

Resolved:: 2025-02-19 09:03

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server