MDEV-33894 included an inadvertent change that causes SET GLOBAL innodb_log_file_size to write incorrect data to the being-resized log file (ib_logfile101). This does not affect the memory mapped log implementation on 64-bit Linux systems (when the log is stored in a mount -o dax file system or in /dev/shm). As far as I can tell, the fix should be this simple:
Per RQG test simplifier the following scenario can cause a replay
1. Start the DB server (innodb_log_file_size : 100663296)
2. One session creates two tables with 100 rows each
3. On session runs a loop containing
SET GLOBAL innodb_log_file_size = 16777216
only.
There are no other sessions running DDL or data modifying SQL.
4. At some point of time SIGKILL is sent to the DB server process.
5. A restart of the DB server is tried.
Replay likelihood of
the simplified RQG test if rr is invoked : ~ 20%
a mixture of different tests (50% invoke rr): ~ 15%
Matthias Leich
added a comment - Per RQG test simplifier the following scenario can cause a replay
1. Start the DB server (innodb_log_file_size : 100663296)
2. One session creates two tables with 100 rows each
3. On session runs a loop containing
SET GLOBAL innodb_log_file_size = 16777216
only.
There are no other sessions running DDL or data modifying SQL.
4. At some point of time SIGKILL is sent to the DB server process.
5. A restart of the DB server is tried.
Replay likelihood of
the simplified RQG test if rr is invoked : ~ 20%
a mixture of different tests (50% invoke rr): ~ 15%
@@ -25,17 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size';
let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB;
--source include/search_pattern_in_file.inc
-UPDATE t SET b='' WHERE a<10;
+--connect con1,localhost,root
+send UPDATE t SET b='' WHERE a<10;
+--connection default
SET GLOBAL innodb_log_file_size=5242880;
SHOW VARIABLES LIKE 'innodb_log_file_size';
SELECT global_value FROM information_schema.system_variables
WHERE variable_name = 'innodb_log_file_size';
+--let $shutdown_timeout=0
--let $restart_parameters=
--source include/restart_mysqld.inc
+--disconnect con1
-SELECT * FROM t WHERE a<10;
+SELECT a FROM t WHERE a<10;
SHOW VARIABLES LIKE 'innodb_log_file_size';
let SEARCH_PATTERN = InnoDB: Resized log to 5\\.000MiB;
Notably, the test did not attempt to kill and restart the server while log resizing was in progress. It did not run any SQL workload during log resizing either; there could only be some workload for purging the committed transaction history.
Marko Mäkelä
added a comment - I am able to reproduce this on the first attempt while modifying the regression test of this feature as follows:
diff --git a/mysql-test/suite/innodb/t/log_file_size_online.test b/mysql-test/suite/innodb/t/log_file_size_online.test
index 65551f13dbc..c2715a2ce91 100644
--- a/mysql-test/suite/innodb/t/log_file_size_online.test
+++ b/mysql-test/suite/innodb/t/log_file_size_online.test
@@ -25,17 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size';
let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB;
--source include/search_pattern_in_file.inc
-UPDATE t SET b='' WHERE a<10;
+--connect con1,localhost,root
+send UPDATE t SET b='' WHERE a<10;
+--connection default
SET GLOBAL innodb_log_file_size=5242880;
SHOW VARIABLES LIKE 'innodb_log_file_size';
SELECT global_value FROM information_schema.system_variables
WHERE variable_name = 'innodb_log_file_size';
+--let $shutdown_timeout=0
--let $restart_parameters=
--source include/restart_mysqld.inc
+--disconnect con1
-SELECT * FROM t WHERE a<10;
+SELECT a FROM t WHERE a<10;
SHOW VARIABLES LIKE 'innodb_log_file_size';
let SEARCH_PATTERN = InnoDB: Resized log to 5\\.000MiB;
Notably, the test did not attempt to kill and restart the server while log resizing was in progress. It did not run any SQL workload during log resizing either; there could only be some workload for purging the committed transaction history.
After applying the tentative patch in the Description, the test would pass more easily, but still fail in at least two distinct ways:
innodb.log_file_size_online 'slow' w18 [ fail ] Found warnings/errors in server log file!
Test ended at 2024-08-19 11:47:48
line
2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=0, page number=307] log sequence number 7166110 is in the future! Current system log sequence number 7166040.
2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=5, page number=246] log sequence number 7167328 is in the future! Current system log sequence number 7166040.
2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
and
CURRENT_TEST: innodb.log_file_size_online
mysqltest: At line 42: query 'SELECT a FROM t WHERE a<10' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
…
2024-08-19 11:48:22 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(7080562) at 7412957
Marko Mäkelä
added a comment - After applying the tentative patch in the Description, the test would pass more easily, but still fail in at least two distinct ways:
innodb.log_file_size_online 'slow' w18 [ fail ] Found warnings/errors in server log file!
Test ended at 2024-08-19 11:47:48
line
2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=0, page number=307] log sequence number 7166110 is in the future! Current system log sequence number 7166040.
2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2024-08-19 11:47:47 0 [ERROR] InnoDB: Page [page id: space=5, page number=246] log sequence number 7167328 is in the future! Current system log sequence number 7166040.
2024-08-19 11:47:47 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
and
CURRENT_TEST: innodb.log_file_size_online
mysqltest: At line 42: query 'SELECT a FROM t WHERE a<10' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
…
2024-08-19 11:48:22 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(7080562) at 7412957
@@ -25,13 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size';
let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB;
--source include/search_pattern_in_file.inc
-UPDATE t SET b='' WHERE a<10;
+--connect con1,localhost,root
+send UPDATE t SET b='' WHERE a<10;
+--connection default
SET GLOBAL innodb_log_file_size=5242880;
SHOW VARIABLES LIKE 'innodb_log_file_size';
SELECT global_value FROM information_schema.system_variables
WHERE variable_name = 'innodb_log_file_size';
+--connection con1
+reap;
+--disconnect con1
+--connection default
+
+--let $shutdown_timeout=0
--let $restart_parameters=
--source include/restart_mysqld.inc
This would fail in yet another way, with a result difference where the UPDATE had not been performed at all. Before the server had been killed, it had completed a log resize:
The recovered LSN was 7055423, which corresponds to a further invocation of log_checkpoint_low().
Because the above failure was observed with the parent commit of MDEV-33894, the log resizing must already have been broken before that change.
Marko Mäkelä
added a comment - I figured out a better change to the test, which should result in a deterministic result of the SELECT statement:
diff --git a/mysql-test/suite/innodb/t/log_file_size_online.test b/mysql-test/suite/innodb/t/log_file_size_online.test
index 65551f13dbc..2ef888b5941 100644
--- a/mysql-test/suite/innodb/t/log_file_size_online.test
+++ b/mysql-test/suite/innodb/t/log_file_size_online.test
@@ -25,13 +25,21 @@ SHOW VARIABLES LIKE 'innodb_log_file_size';
let SEARCH_PATTERN = InnoDB: Resized log to 4\\.000MiB;
--source include/search_pattern_in_file.inc
-UPDATE t SET b='' WHERE a<10;
+--connect con1,localhost,root
+send UPDATE t SET b='' WHERE a<10;
+--connection default
SET GLOBAL innodb_log_file_size=5242880;
SHOW VARIABLES LIKE 'innodb_log_file_size';
SELECT global_value FROM information_schema.system_variables
WHERE variable_name = 'innodb_log_file_size';
+--connection con1
+reap;
+--disconnect con1
+--connection default
+
+--let $shutdown_timeout=0
--let $restart_parameters=
--source include/restart_mysqld.inc
This would fail in yet another way, with a result difference where the UPDATE had not been performed at all. Before the server had been killed, it had completed a log resize:
10.11 27a3366663d91d4181cafc7d4a8569d06d6a759e
2024-08-19 12:01:51 0 [Note] InnoDB: Resized log to 5.000MiB; start LSN=7054124
The recovered LSN was 7055423, which corresponds to a further invocation of log_checkpoint_low() .
Because the above failure was observed with the parent commit of MDEV-33894 , the log resizing must already have been broken before that change.
Thanks to https://rr-project.org, I determined that the correct buffer to write to the ib_logfile101 is log_sys.resize_buf as it was at the start of log_t::write_buf(). Also, log_t::resize_start() must always use a resizing checkpoint target of the current LSN.
Occasional debug assertion failures or messages related to the log sequence number being in the future are possible, because crash recovery failed to flag an error when the log cannot be scanned past the FILE_CHECKPOINT record.
With all these fixed, I am still observing some occasional failures of the revised test innodb.log_file_size_online. I will continue debugging.
Marko Mäkelä
added a comment - Thanks to https://rr-project.org , I determined that the correct buffer to write to the ib_logfile101 is log_sys.resize_buf as it was at the start of log_t::write_buf() . Also, log_t::resize_start() must always use a resizing checkpoint target of the current LSN.
Occasional debug assertion failures or messages related to the log sequence number being in the future are possible, because crash recovery failed to flag an error when the log cannot be scanned past the FILE_CHECKPOINT record.
With all these fixed, I am still observing some occasional failures of the revised test innodb.log_file_size_online . I will continue debugging.
The remaining failures occurred because my fix was rounding down the latest LSN to the log_sys.write_size, while it should have rounded it up. After this fix, SET GLOBAL innodb_log_file_size=… is more likely to be blocked until there have been further writes to InnoDB tables, because an entire new log block will have to be filled. In the worst case, this will require the LSN to grow by 4095 bytes.
I filed MDEV-34802 for the lapse in recovery that allows InnoDB to start up on a corrupted log. It affects also older versions.
Marko Mäkelä
added a comment - The remaining failures occurred because my fix was rounding down the latest LSN to the log_sys.write_size , while it should have rounded it up. After this fix, SET GLOBAL innodb_log_file_size=… is more likely to be blocked until there have been further writes to InnoDB tables, because an entire new log block will have to be filled. In the worst case, this will require the LSN to grow by 4095 bytes.
I filed MDEV-34802 for the lapse in recovery that allows InnoDB to start up on a corrupted log. It affects also older versions.
origin/bb-10.11-MDEV-34802-MDEV-34750 bfaffdff6324953b46b92b04e68d405d33875d64 2024-08-28T07:51:32+03:00
behaved well in RQG testing. No new problems.
Matthias Leich
added a comment - origin/bb-10.11- MDEV-34802 - MDEV-34750 bfaffdff6324953b46b92b04e68d405d33875d64 2024-08-28T07:51:32+03:00
behaved well in RQG testing. No new problems.
To avoid a regression where SET GLOBAL innodb_log_file_size would never finish, I revised the fix so that redundant FILE_CHECKPOINT records may be written in order to reach the resize target LSN.
Marko Mäkelä
added a comment - To avoid a regression where SET GLOBAL innodb_log_file_size would never finish, I revised the fix so that redundant FILE_CHECKPOINT records may be written in order to reach the resize target LSN.
A more complex fix seems to be necessary. I think that a reproducible test case will be needed.