[MDEV-34750] SET GLOBAL innodb_log_file_size is not crash safe - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.11.9, 11.1.6, 11.2.5, 11.4.3, 11.5.2, 11.6.1
Fix Version/s: 10.11.10, 11.2.6, 11.4.4
Component/s: Storage Engine - InnoDB
Labels:

Description

~~MDEV-33894~~ included an inadvertent change that causes SET GLOBAL innodb_log_file_size to write incorrect data to the being-resized log file (ib_logfile101). This does not affect the memory mapped log implementation on 64-bit Linux systems (when the log is stored in a mount -o dax file system or in /dev/shm). As far as I can tell, the fix should be this simple:

diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc

index 93f8db6bfc0..84219dbc830 100644

--- a/storage/innobase/log/log0log.cc

+++ b/storage/innobase/log/log0log.cc

@@ -804,7 +804,7 @@ void log_t::resize_write_buf(size_t length) noexcept

   ut_a(os_file_write_func(IORequestWrite, "ib_logfile101", resize_log.m_file,

-                          buf, offset, length) == DB_SUCCESS);

+                          resize_flush_buf, offset, length) == DB_SUCCESS);

 /** Write buf to ib_logfile0.

Attachments

Issue Links

blocks

MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0

Closed

causes

MDEV-35411 innodb.log_file_size_online occasionally fails

Closed

relates to

MDEV-33894 MariaDB does unexpected storage read IO for the redo log

Closed

MDEV-34802 Recovery fails to note some log corruption

Closed

MDEV-34875 provide mechanism to monitor the status of dynamic redo log resize operation

Open

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä created issue - 2024-08-13 10:39

Marko Mäkelä made changes - 2024-08-13 10:39

Field	Original Value	New Value
Link		This issue is caused by ~~MDEV-33894~~ [ ~~MDEV-33894~~ ]

Marko Mäkelä made changes - 2024-08-13 10:39

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä made changes - 2024-08-13 10:52

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Marko Mäkelä made changes - 2024-08-13 10:52

Assignee

Marko Mäkelä [ marko ]

Matthias Leich [ mleich ]

Marko Mäkelä added a comment - 2024-08-14 12:36

A more complex fix seems to be necessary. I think that a reproducible test case will be needed.

Marko Mäkelä added a comment - 2024-08-14 12:36 A more complex fix seems to be necessary. I think that a reproducible test case will be needed.

Matthias Leich added a comment - 2024-08-15 09:43

Per RQG test simplifier the following scenario can cause a replay
1. Start the DB server (innodb_log_file_size : 100663296)
2. One session creates two tables with 100 rows each
3. On session runs a loop containing
SET GLOBAL innodb_log_file_size = 16777216
only.
There are no other sessions running DDL or data modifying SQL.
4. At some point of time SIGKILL is sent to the DB server process.
5. A restart of the DB server is tried.

Replay likelihood of

the simplified RQG test if rr is invoked : ~ 20%
a mixture of different tests (50% invoke rr): ~ 15%

Matthias Leich added a comment - 2024-08-15 09:43 Per RQG test simplifier the following scenario can cause a replay 1. Start the DB server (innodb_log_file_size : 100663296) 2. One session creates two tables with 100 rows each 3. On session runs a loop containing SET GLOBAL innodb_log_file_size = 16777216 only. There are no other sessions running DDL or data modifying SQL. 4. At some point of time SIGKILL is sent to the DB server process. 5. A restart of the DB server is tried. Replay likelihood of the simplified RQG test if rr is invoked : ~ 20% a mixture of different tests (50% invoke rr): ~ 15%

Marko Mäkelä added a comment - 2024-08-19 08:30

I am able to reproduce this on the first attempt while modifying the regression test of this feature as follows:

diff --git a/mysql-test/suite/innodb/t/log_file_size_online.test b/mysql-test/suite/innodb/t/log_file_size_online.test

index 65551f13dbc..c2715a2ce91 100644

--- a/mysql-test/suite/innodb/t/log_file_size_online.test

+++ b/mysql-test/suite/innodb/t/log_file_size_online.test

@@ -25,17 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size';

 let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB;

 --source include/search_pattern_in_file.inc

-UPDATE t SET b='' WHERE a<10;

+--connect con1,localhost,root

+send UPDATE t SET b='' WHERE a<10;

+--connection default

 SET GLOBAL innodb_log_file_size=5242880;

 SHOW VARIABLES LIKE 'innodb_log_file_size';

 SELECT global_value FROM information_schema.system_variables

 WHERE variable_name = 'innodb_log_file_size';

+--let $shutdown_timeout=0

 --let $restart_parameters=

 --source include/restart_mysqld.inc

+--disconnect con1

-SELECT * FROM t WHERE a<10;

+SELECT a FROM t WHERE a<10;

 SHOW VARIABLES LIKE 'innodb_log_file_size';

 let SEARCH_PATTERN = InnoDB: Resized log to 5\\.000MiB;

Notably, the test did not attempt to kill and restart the server while log resizing was in progress. It did not run any SQL workload during log resizing either; there could only be some workload for purging the committed transaction history.

Marko Mäkelä added a comment - 2024-08-19 08:30 I am able to reproduce this on the first attempt while modifying the regression test of this feature as follows: diff --git a/mysql-test/suite/innodb/t/log_file_size_online.test b/mysql-test/suite/innodb/t/log_file_size_online.test index 65551f13dbc..c2715a2ce91 100644 --- a/mysql-test/suite/innodb/t/log_file_size_online.test +++ b/mysql-test/suite/innodb/t/log_file_size_online.test @@ -25,17 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size'; let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB; --source include/search_pattern_in_file.inc -UPDATE t SET b='' WHERE a<10; +--connect con1,localhost,root +send UPDATE t SET b='' WHERE a<10; +--connection default SET GLOBAL innodb_log_file_size=5242880; SHOW VARIABLES LIKE 'innodb_log_file_size'; SELECT global_value FROM information_schema.system_variables WHERE variable_name = 'innodb_log_file_size'; +--let $shutdown_timeout=0 --let $restart_parameters= --source include/restart_mysqld.inc +--disconnect con1 -SELECT * FROM t WHERE a<10; +SELECT a FROM t WHERE a<10; SHOW VARIABLES LIKE 'innodb_log_file_size'; let SEARCH_PATTERN = InnoDB: Resized log to 5\\.000MiB; Notably, the test did not attempt to kill and restart the server while log resizing was in progress. It did not run any SQL workload during log resizing either; there could only be some workload for purging the committed transaction history.

Marko Mäkelä made changes - 2024-08-19 08:30

Assignee	Matthias Leich [ mleich ]	Marko Mäkelä [ marko ]
Status	In Testing [ 10301 ]	Stalled [ 10000 ]

Marko Mäkelä made changes - 2024-08-19 08:30

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2024-08-19 08:50

After applying the tentative patch in the Description, the test would pass more easily, but still fail in at least two distinct ways:

innodb.log_file_size_online 'slow'       w18 [ fail ]  Found warnings/errors in server log file!

        Test ended at 2024-08-19 11:47:48

line

2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=0, page number=307] log sequence number 7166110 is in the future! Current system log sequence number 7166040.

2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=5, page number=246] log sequence number 7167328 is in the future! Current system log sequence number 7166040.

2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

and

CURRENT_TEST: innodb.log_file_size_online

mysqltest: At line 42: query 'SELECT a FROM t WHERE a<10' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'

…

2024-08-19 11:48:22 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(7080562) at 7412957

Marko Mäkelä added a comment - 2024-08-19 08:50 After applying the tentative patch in the Description, the test would pass more easily, but still fail in at least two distinct ways: innodb.log_file_size_online 'slow' w18 [ fail ] Found warnings/errors in server log file! Test ended at 2024-08-19 11:47:48 line 2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=0, page number=307] log sequence number 7166110 is in the future! Current system log sequence number 7166040. 2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=5, page number=246] log sequence number 7167328 is in the future! Current system log sequence number 7166040. 2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. and CURRENT_TEST: innodb.log_file_size_online mysqltest: At line 42: query 'SELECT a FROM t WHERE a<10' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB' … 2024-08-19 11:48:22 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(7080562) at 7412957

Marko Mäkelä added a comment - 2024-08-19 09:17

I figured out a better change to the test, which should result in a deterministic result of the SELECT statement:

diff --git a/mysql-test/suite/innodb/t/log_file_size_online.test b/mysql-test/suite/innodb/t/log_file_size_online.test

index 65551f13dbc..2ef888b5941 100644

--- a/mysql-test/suite/innodb/t/log_file_size_online.test

+++ b/mysql-test/suite/innodb/t/log_file_size_online.test

@@ -25,13 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size';

 let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB;

 --source include/search_pattern_in_file.inc

-UPDATE t SET b='' WHERE a<10;

+--connect con1,localhost,root

+send UPDATE t SET b='' WHERE a<10;

+--connection default

 SET GLOBAL innodb_log_file_size=5242880;

 SHOW VARIABLES LIKE 'innodb_log_file_size';

 SELECT global_value FROM information_schema.system_variables

 WHERE variable_name = 'innodb_log_file_size';

+--connection con1

+reap;

+--disconnect con1

+--connection default

+--let $shutdown_timeout=0

 --let $restart_parameters=

 --source include/restart_mysqld.inc

This would fail in yet another way, with a result difference where the UPDATE had not been performed at all. Before the server had been killed, it had completed a log resize:

10.11 27a3366663d91d4181cafc7d4a8569d06d6a759e
2024-08-19 12:01:51 0 [Note] InnoDB: Resized log to 5.000MiB; start LSN=7054124

The recovered LSN was 7055423, which corresponds to a further invocation of log_checkpoint_low().

Because the above failure was observed with the parent commit of ~~MDEV-33894~~, the log resizing must already have been broken before that change.

Marko Mäkelä added a comment - 2024-08-19 09:17 I figured out a better change to the test, which should result in a deterministic result of the SELECT statement: diff --git a/mysql-test/suite/innodb/t/log_file_size_online.test b/mysql-test/suite/innodb/t/log_file_size_online.test index 65551f13dbc..2ef888b5941 100644 --- a/mysql-test/suite/innodb/t/log_file_size_online.test +++ b/mysql-test/suite/innodb/t/log_file_size_online.test @@ -25,13 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size'; let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB; --source include/search_pattern_in_file.inc -UPDATE t SET b='' WHERE a<10; +--connect con1,localhost,root +send UPDATE t SET b='' WHERE a<10; +--connection default SET GLOBAL innodb_log_file_size=5242880; SHOW VARIABLES LIKE 'innodb_log_file_size'; SELECT global_value FROM information_schema.system_variables WHERE variable_name = 'innodb_log_file_size'; +--connection con1 +reap; +--disconnect con1 +--connection default + +--let $shutdown_timeout=0 --let $restart_parameters= --source include/restart_mysqld.inc This would fail in yet another way, with a result difference where the UPDATE had not been performed at all. Before the server had been killed, it had completed a log resize: 10.11 27a3366663d91d4181cafc7d4a8569d06d6a759e 2024-08-19 12:01:51 0 [Note] InnoDB: Resized log to 5.000MiB; start LSN=7054124 The recovered LSN was 7055423, which corresponds to a further invocation of log_checkpoint_low() . Because the above failure was observed with the parent commit of MDEV-33894 , the log resizing must already have been broken before that change.

Marko Mäkelä made changes - 2024-08-19 09:21

Link

This issue blocks ~~MDEV-34062~~ [ ~~MDEV-34062~~ ]

Marko Mäkelä made changes - 2024-08-19 09:22

Link

This issue is caused by ~~MDEV-33894~~ [ ~~MDEV-33894~~ ]

Marko Mäkelä made changes - 2024-08-19 09:22

Link

This issue relates to ~~MDEV-33894~~ [ ~~MDEV-33894~~ ]

Marko Mäkelä added a comment - 2024-08-22 14:59

Thanks to https://rr-project.org, I determined that the correct buffer to write to the ib_logfile101 is log_sys.resize_buf as it was at the start of log_t::write_buf(). Also, log_t::resize_start() must always use a resizing checkpoint target of the current LSN.

Occasional debug assertion failures or messages related to the log sequence number being in the future are possible, because crash recovery failed to flag an error when the log cannot be scanned past the FILE_CHECKPOINT record.

With all these fixed, I am still observing some occasional failures of the revised test innodb.log_file_size_online. I will continue debugging.

Marko Mäkelä added a comment - 2024-08-22 14:59 Thanks to https://rr-project.org , I determined that the correct buffer to write to the ib_logfile101 is log_sys.resize_buf as it was at the start of log_t::write_buf() . Also, log_t::resize_start() must always use a resizing checkpoint target of the current LSN. Occasional debug assertion failures or messages related to the log sequence number being in the future are possible, because crash recovery failed to flag an error when the log cannot be scanned past the FILE_CHECKPOINT record. With all these fixed, I am still observing some occasional failures of the revised test innodb.log_file_size_online . I will continue debugging.

Marko Mäkelä made changes - 2024-08-23 05:29

Assignee	Marko Mäkelä [ marko ]	Debarun Banerjee [ JIRAUSER54513 ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Marko Mäkelä made changes - 2024-08-23 05:47

Link

This issue relates to ~~MDEV-34802~~ [ ~~MDEV-34802~~ ]

Marko Mäkelä added a comment - 2024-08-23 05:53

The remaining failures occurred because my fix was rounding down the latest LSN to the log_sys.write_size, while it should have rounded it up. After this fix, SET GLOBAL innodb_log_file_size=… is more likely to be blocked until there have been further writes to InnoDB tables, because an entire new log block will have to be filled. In the worst case, this will require the LSN to grow by 4095 bytes.

I filed ~~MDEV-34802~~ for the lapse in recovery that allows InnoDB to start up on a corrupted log. It affects also older versions.

Marko Mäkelä added a comment - 2024-08-23 05:53 The remaining failures occurred because my fix was rounding down the latest LSN to the log_sys.write_size , while it should have rounded it up. After this fix, SET GLOBAL innodb_log_file_size=… is more likely to be blocked until there have been further writes to InnoDB tables, because an entire new log block will have to be filled. In the worst case, this will require the LSN to grow by 4095 bytes. I filed MDEV-34802 for the lapse in recovery that allows InnoDB to start up on a corrupted log. It affects also older versions.

Debarun Banerjee made changes - 2024-08-27 11:39

Assignee	Debarun Banerjee [ JIRAUSER54513 ]	Marko Mäkelä [ marko ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Marko Mäkelä made changes - 2024-08-29 08:33

Assignee	Marko Mäkelä [ marko ]	Debarun Banerjee [ JIRAUSER54513 ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Matthias Leich added a comment - 2024-08-29 09:55

origin/bb-10.11-~~MDEV-34802~~-~~MDEV-34750~~ bfaffdff6324953b46b92b04e68d405d33875d64 2024-08-28T07:51:32+03:00
behaved well in RQG testing. No new problems.

Matthias Leich added a comment - 2024-08-29 09:55 origin/bb-10.11- MDEV-34802 - MDEV-34750 bfaffdff6324953b46b92b04e68d405d33875d64 2024-08-28T07:51:32+03:00 behaved well in RQG testing. No new problems.

Marko Mäkelä added a comment - 2024-08-29 12:10

To avoid a regression where SET GLOBAL innodb_log_file_size would never finish, I revised the fix so that redundant FILE_CHECKPOINT records may be written in order to reach the resize target LSN.

Marko Mäkelä added a comment - 2024-08-29 12:10 To avoid a regression where SET GLOBAL innodb_log_file_size would never finish, I revised the fix so that redundant FILE_CHECKPOINT records may be written in order to reach the resize target LSN.

Debarun Banerjee added a comment - 2024-08-29 12:12

Thanks marko, the patch looks good.

Debarun Banerjee added a comment - 2024-08-29 12:12 Thanks marko , the patch looks good.

Debarun Banerjee made changes - 2024-08-29 12:12

Assignee	Debarun Banerjee [ JIRAUSER54513 ]	Marko Mäkelä [ marko ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Marko Mäkelä made changes - 2024-08-29 13:04

issue.field.resolutiondate

2024-08-29 13:04:41.0

2024-08-29 13:04:41.236

Marko Mäkelä made changes - 2024-08-29 13:04

Fix Version/s		10.11.10 [ 29904 ]
Fix Version/s		11.2.6 [ 29906 ]
Fix Version/s		11.4.4 [ 29907 ]
Fix Version/s	10.11 [ 27614 ]
Fix Version/s	11.2 [ 28603 ]
Fix Version/s	11.4 [ 29301 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2024-09-05 09:31

Link

This issue relates to MDEV-34875 [ MDEV-34875 ]

Marko Mäkelä made changes - 2024-11-13 09:18

Link

This issue causes ~~MDEV-35411~~ [ ~~MDEV-35411~~ ]

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-08-13 10:39

Updated:: 2024-11-13 09:18

Resolved:: 2024-08-29 13:04

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration