[MDEV-31851] After crash recovery, Checksum mismatch + Failing assertion: !i || prev_id + 1 == space_id, - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 11.2.0, 10.5, 10.6, 10.9(EOL), 10.10(EOL), 10.11, 11.1(EOL), 11.2(EOL)
Fix Version/s: 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3
Component/s: Storage Engine - InnoDB
Labels:
- rr-profile-analyzed

Description

origin/11.2 f9003c73a15d892423845a2df96c8bd73bbe96ca 2023-08-01T19:43:04+05:30

Scenario:

1. Bootstrap, start the server, generate some initial data

2. One session runs a DDL/DML mix.

3. During 2. is ongoing stop the server with SIGKILL

4. Restart attempt which fails with

....

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3011]2023-08-03 23:43:46 0 [Note] InnoDB: innodb_page_size=32768

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3015]2023-08-03 23:43:46 0 [Note] InnoDB: Compressed tables use zlib 1.2.11

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3040]2023-08-03 23:43:46 0 [Note] InnoDB: Number of transaction pools: 1

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3046]2023-08-03 23:43:46 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3074]2023-08-03 23:43:46 0 [Note] InnoDB: Initializing buffer pool, total size = 24.000MiB, chunk size = 1.000MiB

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3136]2023-08-03 23:43:46 0 [Note] InnoDB: Completed initialization of buffer pool

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3222]2023-08-03 23:43:46 0 [ERROR] InnoDB: Checksum mismatch in the first page of file .//undo006

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3226]2023-08-03 23:43:46 0 [ERROR] InnoDB: Unable to read first page of file .//undo006

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3230]2023-08-03 23:43:46 0x7f85c2d05c40[rr 1392192 3233]  InnoDB: Assertion failure in file /data/Server/11.2/storage/innobase/srv/srv0start.cc line 803

# 2023-08-03T23:43:46 [1211400] | [rr 1392192 3235]InnoDB: Failing assertion: !i || prev_id + 1 == space_id

# 2023-08-03T23:45:26 [1211400] | Thread 1 (Thread 1392192.1392192):

# 2023-08-03T23:45:26 [1211400] | #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50

# 2023-08-03T23:45:26 [1211400] | #1  0x00007f85c2d2e859 in __GI_abort () at abort.c:79

# 2023-08-03T23:45:26 [1211400] | #2  0x0000559454e4c0f1 in ut_dbg_assertion_failed (expr=expr@entry=0x55945597da92 "!i || prev_id + 1 == space_id", file=file@entry=0x55945597cd68 "/data/Server/11.2/storage/innobase/srv/srv0start.cc", line=line@entry=803) at /data/Server/11.2/storage/innobase/ut/ut0dbg.cc:60

# 2023-08-03T23:45:26 [1211400] | #3  0x0000559454e45c33 in srv_all_undo_tablespaces_open (n_undo=<optimized out>, create_new_undo=<optimized out>) at /data/Server/11.2/storage/innobase/srv/srv0start.cc:803

# 2023-08-03T23:45:26 [1211400] | #4  srv_undo_tablespaces_init (create_new_undo=<optimized out>, mtr=0x7fff0fd5f7e0) at /data/Server/11.2/storage/innobase/srv/srv0start.cc:877

# 2023-08-03T23:45:26 [1211400] | #5  0x0000559454e46dc6 in srv_start (create_new_db=false) at /data/Server/11.2/storage/innobase/srv/srv0start.cc:1423

# 2023-08-03T23:45:26 [1211400] | #6  0x0000559455448b4a in innodb_init (p=<optimized out>) at /data/Server/11.2/storage/innobase/handler/ha_innodb.cc:4171

# 2023-08-03T23:45:26 [1211400] | #7  0x00005594551f460a in ha_initialize_handlerton (plugin=0x559458706270) at /data/Server/11.2/sql/handler.cc:687

# 2023-08-03T23:45:26 [1211400] | #8  0x0000559454fba8ae in plugin_initialize (tmp_root=tmp_root@entry=0x7fff0fd60c70, plugin=plugin@entry=0x559458706270, argc=argc@entry=0x5594560cef60 <remaining_argc>, argv=argv@entry=0x559458652fb0, options_only=<optimized out>) at /data/Server/11.2/sql/sql_plugin.cc:1465

# 2023-08-03T23:45:26 [1211400] | #9  0x0000559454fbb7b3 in plugin_init (argc=argc@entry=0x5594560cef60 <remaining_argc>, argv=<optimized out>, flags=0) at /data/Server/11.2/sql/sql_plugin.cc:1758

# 2023-08-03T23:45:26 [1211400] | #10 0x0000559454eab03d in init_server_components () at /data/Server/11.2/sql/mysqld.cc:5271

# 2023-08-03T23:45:26 [1211400] | #11 0x0000559454eaffe2 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /data/Server/11.2/sql/mysqld.cc:5900

# 2023-08-03T23:45:26 [1211400] | #12 0x00007f85c2d300b3 in __libc_start_main (main=0x559454e6b5f0 <main(int, char**)>, argc=60, argv=0x7fff0fd63e28, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff0fd63e18) at ../csu/libc-start.c:308

# 2023-08-03T23:45:26 [1211400] | #13 0x0000559454ea697e in _start () at /data/Server/11.2/include/mysql/plugin.h:215

# 2023-08-03T23:45:26 [1211400] | (rr) quit

pluto:/data/results/1691087822/TBR-1963$ _RR_TRACE_DIR=./1/rr/ rr replay --mark-stdio

RQG

====

# git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG

# GIT_SHOW: HEAD -> master, origin/master, origin/HEAD 9593f65f8c01f5afb1da093bf9a5912977ea5e00 2023-08-03T20:38:05+02:00

# rqg.pl  : Version 4.3.0 (2023-06)

# $RQG_HOME/rqg.pl \

# --grammar=conf/runtime/alter_online.yy \

# --gendata=conf/runtime/alter_online.zz \

# --mysqld=--innodb_data_file_path=ibdata1:1M:autoextend:autoshrink \

# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \

# --mysqld=--loose-idle_write_transaction_timeout=0 \

# --mysqld=--loose-idle_transaction_timeout=0 \

# --mysqld=--loose-idle_readonly_transaction_timeout=0 \

# --mysqld=--connect_timeout=60 \

# --mysqld=--interactive_timeout=28800 \

# --mysqld=--slave_net_timeout=60 \

# --mysqld=--net_read_timeout=30 \

# --mysqld=--net_write_timeout=60 \

# --mysqld=--loose-table_lock_wait_timeout=50 \

# --mysqld=--wait_timeout=28800 \

# --no_mask \

# --queries=10000000 \

# --seed=random \

# --reporters=None \

# --reporters=ErrorLog \

# --reporters=Deadlock \

# --validators=None \

# --mysqld=--log_output=none \

# --mysqld=--log_bin_trust_function_creators=1 \

# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \

# --engine=InnoDB \

# --restart_timeout=360 \

# --mysqld=--plugin-load-add=file_key_management.so \

# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \

# --mysqld=--plugin-load-add=provider_lzo.so \

# --mysqld=--plugin-load-add=provider_bzip2.so \

# --mysqld=--plugin-load-add=provider_lzma.so \

# --mysqld=--plugin-load-add=provider_snappy.so \

# --mysqld=--plugin-load-add=provider_lz4.so \

# --mysqld=--loose-innodb_compression_level=1 \

# --duration=300 \

# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \

# --mysqld=--lock-wait-timeout=15 \

# --mysqld=--innodb-lock-wait-timeout=10 \

# --mysqld=--loose-innodb_fast_shutdown=1 \

# --mysqld=--sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION \

# --mysqld=--loose-innodb_read_only_compressed=OFF \

# --reporters=CrashRecovery \

# --duration=300 \

# --mysqld=--innodb_stats_persistent=off \

# --mysqld=--innodb_adaptive_hash_index=on \

# --mysqld=--innodb_random_read_ahead=OFF \

# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \

# --mysqld=--loose-max-statement-time=30 \

# --threads=1 \

# --mysqld=--innodb-use-native-aio=0 \

# --mysqld=--loose-gdb \

# --mysqld=--loose-debug-gdb \

# --rr=Extended \

# --rr_options=--wait \

# --mysqld=--innodb_undo_log_truncate=OFF \

# --mysqld=--loose_innodb_change_buffering=none \

# --mysqld=--innodb_undo_tablespaces=16 \

# --mysqld=--innodb_rollback_on_timeout=OFF \

# --mysqld=--innodb_page_size=32K \

# --mysqld=--innodb-buffer-pool-size=24M \

# <local settings>

There is a second replay of the problem with some other test.

pluto:/data/results/1691087822/TBR-1963_1$ _RR_TRACE_DIR=./1/rr/ rr replay --mark-stdio

Attachments

Issue Links

relates to

MDEV-33341 innodb.undo_space_dblwr test case fails with Unknown Storage Engine InnoDB

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2023-08-04 14:17

thiru, please provide more detail and attach a copy of the data directory that fails to recover. If I understood the scenario correctly, the server was previously killed while it was in the middle of overwriting the first page of an undo tablespace.

If my understanding is correct, we would need two fixes:

Change the assertion failure to a refusal to start up InnoDB.
On checksum mismatch of the first page of the system tablespace or the first page of any undo tablespace, try to recover the page from the doublewrite buffer.

Marko Mäkelä added a comment - 2023-08-04 14:17 thiru , please provide more detail and attach a copy of the data directory that fails to recover. If I understood the scenario correctly, the server was previously killed while it was in the middle of overwriting the first page of an undo tablespace. If my understanding is correct, we would need two fixes: Change the assertion failure to a refusal to start up InnoDB. On checksum mismatch of the first page of the system tablespace or the first page of any undo tablespace, try to recover the page from the doublewrite buffer.

Thirunarayanan Balathandayuthapani added a comment - 2023-08-07 10:07

Analysis is that we do kill the server when we're about to write first page of undo tablespace. But the valid copy of the first page exist in doublewrite buffer.
While starting the server, we try to open the undo tablespaces by reading the first page from the disk. Invalid copy exist in disk and it leads to
checksum mismatch error and refuse to start up the server.

We would need two fixes like marko suggested.

Test case to repeat the issue:

--source include/have_innodb.inc

--source include/have_debug.inc

--source include/not_embedded.inc

SET GLOBAL debug_dbug='+d,ib_log_checkpoint_avoid_hard';

let INNODB_PAGE_SIZE=`select @@innodb_page_size`;

let MYSQLD_DATADIR=`select @@datadir`;

show variables like 'innodb_doublewrite';

create table t1(f1 int not null, f2 int not null)engine=innodb;

insert into t1 values (1, 1);

--source ../include/no_checkpoint_start.inc

--echo # Make the first page dirty for undo tablespace

set global innodb_saved_page_number_debug = 0;

set global innodb_fil_make_page_dirty_debug = 1;

--echo # Ensure that all dirty pages in the system are flushed.

set global innodb_buf_flush_list_now = 1;

--let CLEANUP_IF_CHECKPOINT=drop table t1;

--source include/no_checkpoint_end.inc

perl;

use IO::Handle;

my $fname= "$ENV{'MYSQLD_DATADIR'}/undo001";

my $page_size = $ENV{INNODB_PAGE_SIZE};

open(FILE, "+<", $fname) or die;

sysread(FILE, $page, $page_size)==$page_size||die "Unable to read $name\n";

substr($page, 28, 4) = pack("N", 1000);

sysseek(FILE, 0, 0)||die "Unable to seek $fname\n";

die unless syswrite(FILE, $page, $page_size) == $page_size;

close FILE;

EOF

--source include/start_mysqld.inc

check table t1;

drop table t1;

For the above test case, add the following code too:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index f570a29e55e..9030b78a114 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -1893,6 +1893,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn)

     return true;

+  DBUG_EXECUTE_IF("ib_log_checkpoint_avoid_hard", goto do_nothing;);

   ut_ad(!recv_no_log_write);

   ut_ad(oldest_lsn > log_sys.last_checkpoint_lsn);

   /* Repeat the FILE_MODIFY records after the checkpoint, in case some

And it repeats the same issue:

2023-08-07 15:32:01 0 [Note] InnoDB: Completed initialization of buffer pool

2023-08-07 15:32:01 0 [ERROR] InnoDB: Checksum mismatch in the first page of file .//undo001

2023-08-07 15:32:01 0 [ERROR] InnoDB: Unable to read first page of file .//undo001

2023-08-07 15:32:01 0x7f5a89471d40  InnoDB: Assertion failure in file /home/thiru/mariadb_parent/server/11.2-new/storage/innobase/srv/srv0start.cc line 803

InnoDB: Failing assertion: !i || prev_id + 1 == space_id

Thirunarayanan Balathandayuthapani added a comment - 2023-08-07 10:07 Analysis is that we do kill the server when we're about to write first page of undo tablespace. But the valid copy of the first page exist in doublewrite buffer. While starting the server, we try to open the undo tablespaces by reading the first page from the disk. Invalid copy exist in disk and it leads to checksum mismatch error and refuse to start up the server. We would need two fixes like marko suggested. Test case to repeat the issue: --source include/have_innodb.inc --source include/have_debug.inc --source include/not_embedded.inc SET GLOBAL debug_dbug='+d,ib_log_checkpoint_avoid_hard'; let INNODB_PAGE_SIZE=`select @@innodb_page_size`; let MYSQLD_DATADIR=`select @@datadir`; show variables like 'innodb_doublewrite'; create table t1(f1 int not null, f2 int not null)engine=innodb; insert into t1 values (1, 1); --source ../include/no_checkpoint_start.inc --echo # Make the first page dirty for undo tablespace set global innodb_saved_page_number_debug = 0; set global innodb_fil_make_page_dirty_debug = 1; --echo # Ensure that all dirty pages in the system are flushed. set global innodb_buf_flush_list_now = 1; --let CLEANUP_IF_CHECKPOINT=drop table t1; --source include/no_checkpoint_end.inc perl; use IO::Handle; my $fname= "$ENV{'MYSQLD_DATADIR'}/undo001"; my $page_size = $ENV{INNODB_PAGE_SIZE}; open(FILE, "+<", $fname) or die; sysread(FILE, $page, $page_size)==$page_size||die "Unable to read $name\n"; substr($page, 28, 4) = pack("N", 1000); sysseek(FILE, 0, 0)||die "Unable to seek $fname\n"; die unless syswrite(FILE, $page, $page_size) == $page_size; close FILE; EOF --source include/start_mysqld.inc check table t1; drop table t1; For the above test case, add the following code too: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index f570a29e55e..9030b78a114 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1893,6 +1893,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn) return true; } + DBUG_EXECUTE_IF("ib_log_checkpoint_avoid_hard", goto do_nothing;); ut_ad(!recv_no_log_write); ut_ad(oldest_lsn > log_sys.last_checkpoint_lsn); /* Repeat the FILE_MODIFY records after the checkpoint, in case some And it repeats the same issue: 2023-08-07 15:32:01 0 [Note] InnoDB: Completed initialization of buffer pool 2023-08-07 15:32:01 0 [ERROR] InnoDB: Checksum mismatch in the first page of file .//undo001 2023-08-07 15:32:01 0 [ERROR] InnoDB: Unable to read first page of file .//undo001 2023-08-07 15:32:01 0x7f5a89471d40 InnoDB: Assertion failure in file /home/thiru/mariadb_parent/server/11.2-new/storage/innobase/srv/srv0start.cc line 803 InnoDB: Failing assertion: !i || prev_id + 1 == space_id

Thirunarayanan Balathandayuthapani added a comment - 2023-09-21 14:22

Patch is in bb-10.6-MDEV-31851

Thirunarayanan Balathandayuthapani added a comment - 2023-09-21 14:22 Patch is in bb-10.6- MDEV-31851

Marko Mäkelä added a comment - 2023-09-26 09:18

Thank you. The change looks correct, but I suggested some simplifications and code deduplication. Please test this on 10.11 as well; I do not think that after ~~MDEV-19229~~ it is valid to assume that the first undo tablespace ID is 1. In 11.0 it might be valid again, thanks to ~~MDEV-29986~~.

Marko Mäkelä added a comment - 2023-09-26 09:18 Thank you. The change looks correct, but I suggested some simplifications and code deduplication. Please test this on 10.11 as well; I do not think that after MDEV-19229 it is valid to assume that the first undo tablespace ID is 1. In 11.0 it might be valid again, thanks to MDEV-29986 .

Marko Mäkelä added a comment - 2023-09-28 11:02

I tested the revised patch and provided some suggestions. I think that it would be good to refactor the doublewrite recovery code a little more, to ensure consistent messaging and error handling and to avoid code duplication.

Marko Mäkelä added a comment - 2023-09-28 11:02 I tested the revised patch and provided some suggestions. I think that it would be good to refactor the doublewrite recovery code a little more, to ensure consistent messaging and error handling and to avoid code duplication.

Marko Mäkelä added a comment - 2023-10-16 13:15

I think that we’d better not change innodb_undo_tablespaces to a 32-bit unsigned integer in a GA release. Can you update the patch accordingly?

Marko Mäkelä added a comment - 2023-10-16 13:15 I think that we’d better not change innodb_undo_tablespaces to a 32-bit unsigned integer in a GA release. Can you update the patch accordingly?

Marko Mäkelä added a comment - 2023-10-17 11:51

Thank you, the fix looks fine. Let us use the value ULINT_UNDEFINED for the case that the first page is corrupted.

Marko Mäkelä added a comment - 2023-10-17 11:51 Thank you, the fix looks fine. Let us use the value ULINT_UNDEFINED for the case that the first page is corrupted.

Marko Mäkelä added a comment - 2023-11-06 15:03

The fix broke the doublewrite buffer recovery, caught by frequent failures of the test innodb_fts.crash_recovery,release. A fixup was pushed.

Marko Mäkelä added a comment - 2023-11-06 15:03 The fix broke the doublewrite buffer recovery, caught by frequent failures of the test innodb_fts.crash_recovery,release . A fixup was pushed.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Matthias Leich

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2023-08-04 14:11

Updated:: 2025-03-06 04:47

Resolved:: 2023-10-17 14:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server