[MDEV-30725] Assertion `mach_read_from_8(b->page.frame + 16U) <= m_commit_lsn' failed in mtr_t::commit on failed crash recovery Created: 2023-02-25  Updated: 2023-03-04  Resolved: 2023-03-04

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 11.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Roel Van de Paar Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: crash-recovery


 Description   

Test setup:

  • Single SQL input thread
  • The original run client was pquery (not required)
  • About 18k lines were processed before kill -9 of server (runtime: 10 seconds)
  • A copy of the data directory was taken before the first crash recovery attempt
  • The default server storage engine was InnoDB (unless changed in the SQL)
  • This crash recovery issue is readily reproducible using the original data directory
  • The branch in use was bb-11.0-oalter (MDEV-16329), likely unrelated
  • On startup, crash recovery fails with:

11.0.1 bb-11.0-oalter f9b33ac570337be320f718d52fd88d301a2bc1e7

2023-02-25 12:11:51 0 [Note] InnoDB: Initializing buffer pool, total size = 300.000MiB, chunk size = 4.688MiB
2023-02-25 12:11:51 0 [Note] InnoDB: Completed initialization of buffer pool
2023-02-25 12:11:51 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)
2023-02-25 12:11:51 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=989138
2023-02-25 12:11:51 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=71, page number=3] with future log sequence number 1399073
2023-02-25 12:11:51 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=191, page number=3] with future log sequence number 1420843
2023-02-25 12:11:51 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=183, page number=3] with future log sequence number 1397557
2023-02-25 12:11:51 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=45, page number=3] with future log sequence number 1479977
2023-02-25 12:11:51 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=45, page number=2] with future log sequence number 1479977
2023-02-25 12:11:51 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=3, page number=59] with future log sequence number 1485497
2023-02-25 12:11:51 0 [ERROR] InnoDB: Page [page id: space=3, page number=59] log sequence number 1485497 is in the future! Current system log sequence number 1376107.
2023-02-25 12:11:51 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-02-25 12:11:51 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 424 row operations to undo
2023-02-25 12:11:51 0 [Note] InnoDB: Trx id counter is 1276
2023-02-25 12:11:51 0 [Note] InnoDB: Starting final batch to recover 68 pages from redo log.
2023-02-25 12:11:51 0 [Note] InnoDB: Last binlog file './binlog.000001', position 106953
2023-02-25 12:11:51 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-02-25 12:11:51 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-02-25 12:11:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-02-25 12:11:51 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-02-25 12:11:51 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-02-25 12:11:51 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
mysqld: /test/bb-11.0-oalter_dbg/storage/innobase/mtr/mtr0mtr.cc:352: void mtr_t::commit(): Assertion `mach_read_from_8(b->page.frame + 16U) <= m_commit_lsn' failed.

11.0.1 bb-11.0-oalter f9b33ac570337be320f718d52fd88d301a2bc1e7

Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22422564496960) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x1464a8fde640 (LWP 4184421))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22422564496960) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=22422564496960) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=22422564496960, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00001464cb270476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00001464cb2567f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00001464cb25671b in __assert_fail_base (fmt=0x1464cb40b150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x557403332c70 "mach_read_from_8(b->page.frame + 16U) <= m_commit_lsn", 
    file=0x557403333b28 "/test/bb-11.0-oalter_dbg/storage/innobase/mtr/mtr0mtr.cc", line=352, function=<optimized out>)
    at ./assert/assert.c:92
#6  0x00001464cb267e96 in __GI___assert_fail (
    assertion=0x557403332c70 "mach_read_from_8(b->page.frame + 16U) <= m_commit_lsn", 
    file=0x557403333b28 "/test/bb-11.0-oalter_dbg/storage/innobase/mtr/mtr0mtr.cc", line=352, 
    function=0x557403332169 "void mtr_t::commit()") at ./assert/assert.c:101
#7  0x0000557402c50aba in mtr_t::commit (this=this@entry=0x1464a8fdcef0)
    at /test/bb-11.0-oalter_dbg/storage/innobase/mtr/mtr0mtr.cc:352
#8  0x0000557402d90830 in trx_undo_truncate_end (undo=..., limit=0, is_temp=is_temp@entry=false)
    at /test/bb-11.0-oalter_dbg/storage/innobase/trx/trx0undo.cc:869
#9  0x0000557402d908e6 in trx_undo_try_truncate (trx=...) at /test/bb-11.0-oalter_dbg/storage/innobase/trx/trx0undo.cc:882
#10 0x0000557402d39bd7 in row_undo_rec_get (node=0x146484001860)
    at /test/bb-11.0-oalter_dbg/storage/innobase/row/row0undo.cc:296
#11 row_undo (thr=0x146484001680, node=0x146484001860) at /test/bb-11.0-oalter_dbg/storage/innobase/row/row0undo.cc:390
#12 row_undo_step (thr=thr@entry=0x146484001680) at /test/bb-11.0-oalter_dbg/storage/innobase/row/row0undo.cc:452
#13 0x0000557402c9608e in que_thr_step (thr=0x146484001680) at /test/bb-11.0-oalter_dbg/storage/innobase/que/que0que.cc:595
#14 que_run_threads_low (thr=0x146484001680) at /test/bb-11.0-oalter_dbg/storage/innobase/que/que0que.cc:653
#15 que_run_threads (thr=0x146484001680) at /test/bb-11.0-oalter_dbg/storage/innobase/que/que0que.cc:673
#16 0x0000557402d796bc in trx_rollback_active (trx=trx@entry=0x1464c9297b80)
    at /test/bb-11.0-oalter_dbg/storage/innobase/trx/trx0roll.cc:608
#17 0x0000557402d7a3dc in trx_rollback_recovered (all=all@entry=true)
    at /test/bb-11.0-oalter_dbg/storage/innobase/trx/trx0roll.cc:746
#18 0x0000557402d7a7c8 in trx_rollback_all_recovered () at /test/bb-11.0-oalter_dbg/storage/innobase/trx/trx0roll.cc:792
#19 0x0000557402ef0b6f in tpool::task_group::execute (this=0x55740433d1e0 <rollback_all_recovered_group>, 
    t=t@entry=0x55740433d1c0 <rollback_all_recovered_task>) at /test/bb-11.0-oalter_dbg/tpool/task_group.cc:70
#20 0x0000557402ef0dc7 in tpool::task::execute (this=0x55740433d1c0 <rollback_all_recovered_task>)
    at /test/bb-11.0-oalter_dbg/tpool/task.cc:32
#21 0x0000557402eeeb52 in tpool::thread_pool_generic::worker_main (this=0x557404cf1f40, thread_var=0x557404cf2270)
    at /test/bb-11.0-oalter_dbg/tpool/tpool_generic.cc:580
#22 0x0000557402eefece in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/11/bits/invoke.h:74
#23 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>
    (__fn=<optimized out>) at /usr/include/c++/11/bits/invoke.h:96
#24 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:253
#25 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:260
#26 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool:
:thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:211
#27 0x00001464cb6192b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#28 0x00001464cb2c2b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#29 0x00001464cb354a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

A full copy of the data dir is available in this ticket, allowing 'first startup after the crash' to be attempted as often as necessary to locate the bug



 Comments   
Comment by Roel Van de Paar [ 2023-02-25 ]

The xz archive contains the following files:

  • MYEXTRA: This file contains the options used during the test. The most relevant ones are usually located towards the end of the file
  • start_recovery: Running this script creates a recovery user with all privileges. Please make sure to change BASEDIR first
  • start, stop, cl, kill: Handy scripts to start, stop, client into, and kill the server. Remember to change directories as needed
  • root-access.sql, recovery-user.sql: used by start and start_recovery respectively for access (where necessary)
  • mysqld: This directory contains the mysqld (or a copy of mariadbd renamed to mysqld) along with the required ldd libs
  • default.node.tld_thread-0.sql: the full SQL trace used, including the server responses where still received
  • data.original: This is the original data directory. Copy its contents to ./data and use the latter instead
  • tmp.original: This is the original temporary storage directory. Copy its contents to ./tmp and use the latter instead
  • log/master.original.err: This contains the error log from the original trial prior to the "kill -9" command. It may provide debug clues
  • log/master.err: This file provides an example of what the error log looks like after crash recovery fails

Note that the data directory may contain additional relevant/helpful files.

To debug any issues, (repeat as necessary): First, change BASEDIR in start_recovery. Then:
rm -Rf data tmp; cp -r data.original data; cp -r tmp.original tmp; ./start_recovery; sleep 4; vi ./log/master.err

Comment by Roel Van de Paar [ 2023-02-25 ]

Based on the SQL trace, it looks like the server may have been processing a large CREATE TABLE t6(c1 ENUM('a','abcdefghija.... statement.

Comment by Marko Mäkelä [ 2023-02-25 ]

By far it is easiest to troubleshoot recovery failures if the killed server was traced by rr record. If we do not have an exact trace of what happened before the recovery failed, it is very difficult to say what is causing the error.

The circular write-ahead log file ib_logfile0 is 96MiB in size. The two checkpoint blocks in the file are as follows:

001000 00 00 00 00 00 05 e9 08 00 00 00 00 00 12 df e7
001010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
001030 00 00 00 00 00 00 00 00 00 00 00 00 a4 bd 8b c9
001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
002000 00 00 00 00 00 0f 17 d2 00 00 00 00 00 14 90 be
002010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
002030 00 00 00 00 00 00 00 00 00 00 00 00 0f c5 74 81

The second block starting at 0x2000 corresponds to the checkpoint LSN that was chosen for recovery, 989138 (0xf17d2).

When I tried to recover from a copy of the data directory, recv_sys.lsn advanced up to 0x14f284.

If we assume that the log file was never rebuilt (by backup or log resizing) and that the log had never wrapped around, the log file offsets should be equal to file offsets. The log file has not yet wrapped around, as we can see from the end of the output of od -Ax -t x1 ib_logfile0:

14f3d0 dc a2 76 c1 20 0e 2d 1f 06 b1 0a 08 01 05 01 42
14f3e0 00 00 00 00 04 f9 b0 00 00 00 3b 13 7d 80 00 14
14f3f0 92 35 77 2d 1f 00 b8 1b 38 9a 01 b8 f1 38 6f 2e
14f400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
6000000

In the MDEV-14425 format, if the log has not wrapped around and never been rebuilt (by log resizing or restoring a backup), the LSN is equal to the file offset. It is so also in this case:

000000 50 68 79 73 00 00 00 00 00 00 00 00 00 00 30 00

The log payload area starts at 0x3000, and the header indicates that the first LSN corresponding to that offset is 0x3000.

The last non-zero byte in the ib_logfile0 is right before 0x14f400 (1373184). The 5 bytes immediately before that look like the end of a mini-transaction. But, the error log mentions larger "LSN in the future" than that:

2023-02-25 14:55:23 0 [Note] InnoDB: Ignoring a doublewrite copy of page [page id: space=3, page number=59] with future log sequence number 1485497
2023-02-25 14:55:23 0 [ERROR] InnoDB: Page [page id: space=3, page number=59] log sequence number 1485497 is in the future! Current system log sequence number 1376107.
2023-02-25 14:55:23 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

Okay, the last error message should be adjusted for MDEV-27199.

The final recovered mini-transaction starts at LSN=1370507 (0x14e98b) and ends at 1372804 (0x14f284). Here is its log:

14e980 3b 00 7d a1 9e ef 01 f4 a3 a9 1a 37 2d 00 2e 00
14e990 00 00 26 37 2d 00 2e 00 00 00 40 37 2d 00 2e 00
14e9a0 00 00 80 b2 03 80 c4 80 20 10 ff 35 2d 00 80 55
14e9b0 01 34 2d 00 41 01 b7 00 00 00 00 00 00 c6 d3 00
14e9c0 06 0b c3 78 04 ff d3 02 06 0b 39 2d 00 42 ff ff
14e9d0 ff ff 00 00 b7 00 ff ff ff ff 00 00 34 2d 00 41
14e9e0 00 b3 80 13 04 35 2d 00 80 45 02 35 2d 02 80 81
14e9f0 01 b7 00 00 00 00 00 00 c6 d3 00 06 0b 56 2d 00
14ea00 80 4c 06 0b 3a 2d 02 80 82 ff ff ff ff 00 00 b7
14ea10 00 ff ff ff ff 00 00 35 2d 02 80 81 00 b2 0f 01
14ea20 b7 00 00 00 00 00 00 c6 d3 00 06 0b 56 2d 00 80
14ea30 4c 06 0b b2 04 fe 35 2d 02 80 7d 01 12 2d 40 a1
14ea40 01 b2 31 52 2e 2d 03 06 81 8b 08 00 05 00 79 00
14ea50 00 00 40 37 2d 40 08 00 00 00 23 37 2d 23 0c 00
14ea60 00 00 40 20 87 64 2d 40 06 00 18 00 00 d0 87 00
14ea70 00 00 00 00 01 79 00 00 00 00 04 f9 b0 00 00 00
14ea80 3b 16 c8 80 00 16 b9 6a 6a 6a 6a 6a 6a 6a 6a 6a
14ea90 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a 6a
*
14f250 6a 6a 6a 6a 6a 6a 6a 77 2d 03 00 51 3b 20 e6 77
14f260 2d 23 00 28 7a 29 ed 77 2d 00 00 6d c8 8a bf 77
14f270 2d 02 00 77 57 c0 2b 77 2d 40 00 d1 1e ae 54 01
14f280 0b 80 2c c5 00 00 00 3b 13 32 80 00 14 92 30 77

After the end-of-mini-transaction marker at 0x14f27f we have a CRC-32C of the mini-transaction. The byte at 0x14f284 is 0, which is not a valid first byte for a mini-transaction.

The parsing of the log looks entirely correct to me. The log got somehow corrupted. How did you copy it? Could the mariadbd process still have been running while you already started copying files? Please note that killing a process is not instantaneous. See MDEV-28976 for another example of what can happen when one doesn’t wait for the server process to be killed.

Please reproduce this bug with an rr record of the server process that was killed before the failed recovery. Only in that way we can know what went wrong when writing the log or the too-new data pages. As far as I can tell, there is nothing wrong with the recovery side.

I would recommend building with cmake -DWITH_PMEM=ON and placing the ib_logfile0 somewhere under /dev/shm. In that way, the log writing can be much more conveniently debugged by setting data watchpoints on the log_sys.buf, which would be memory-mapped in that case.

Comment by Roel Van de Paar [ 2023-02-27 ]

Thank you Marko for checking and debugging this. This was the first attempt at creating a quality ticket for some crash recovery issues I am seeing.

Thank you also for the input, I am actioning this as suggested.

Comment by Marko Mäkelä [ 2023-02-27 ]

Roel, my initial impression is that the server process had not been killed at the time when the data files were already being copied. With things like O_DIRECT (MDEV-24854) and asynchronous I/O (MDEV-24883 enabled it for /dev/shm too), it probably is more important to wait for the process to actually terminate before starting to copy any files.

Comment by Roel Van de Paar [ 2023-02-27 ]

marko Understood & thank you. After the kill -9 I added a wait, as well as a sync, and a further small delay of 2 seconds before copying. I trust that should suffice to cover any grounds. All runs are in /dev/shm.

Comment by Roel Van de Paar [ 2023-03-01 ]

The issue did not reproduce on trunk (though other/different issues were observed, for which separate tickets will be created).
I will try bb-11.0-oalter next in case the issue was branch specific.
If it does not reproduce there, the new script updates likely prevent a race condition in the copy-after-crash or similar.

Comment by Roel Van de Paar [ 2023-03-04 ]

Further testing against bb-11.0-oalter did not reproduce this issue. The time spend here was still very valuable as it sorted out the kinks in the crash recovery testing setup, and many other crash recovery issues were observed since.

Generated at Thu Feb 08 10:18:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.