[MDEV-17780] innodb.truncate_recover crashes in recovery due to out-of-bounds page read Created: 2018-11-20  Updated: 2018-11-20  Resolved: 2018-11-20

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.10, 10.4.0, 10.2.19
Fix Version/s: 10.4.1, 10.2.20, 10.3.12

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: crash, recovery

Issue Links:
Relates
relates to MDEV-13564 TRUNCATE TABLE and undo tablespace tr... Closed
relates to MDEV-17049 Enable --suite=innodb_undo on buildbot Closed
relates to MDEV-17679 innodb.undo_truncate_recover failed i... Closed

 Description   

10.4 dde2ca4aa108b611b5fdfc970146b28461ef08bf

innodb.undo_truncate_recover '2,4k,innodb' w3 [ fail ]
        Test ended at 2018-11-20 12:27:24
 
CURRENT_TEST: innodb.undo_truncate_recover
2018-11-20 12:27:24 0 [Note] InnoDB: Starting final batch to recover 2983 pages from redo log.
2018-11-20 12:27:24 0 [ERROR] [FATAL] InnoDB: Trying to read page number 2165 in space 1, space name innodb_undo001, which is outside the tablespace bounds. Byte offset 0, len 4096
#7  0x0000000000d28620 in fil_io (type=..., sync=<optimized out>, 
    page_id=..., page_size=..., byte_offset=0, len=4096, buf=0x7f6ba54fb000, 
    message=0x7f6ba4dbd6a0, ignore_missing_space=false)
    at /mariadb/10.4/storage/innobase/fil/fil0fil.cc:4350
(gdb) p *space
$1 = {name = 0x3cf9120 "innodb_undo001", id = 1, max_lsn = 0, 
  stop_new_ops = false, is_being_truncated = false, redo_skipped_count = 0, 
  purpose = FIL_TYPE_TABLESPACE, chain = {count = 1, start = 0x3cf9240, 
    end = 0x3cf9240, node = &fil_node_t::chain, init = 51966}, size = 1792, 
  size_in_header = 3584, free_len = 9, free_limit = 2560, recv_size = 0, 
  flags = 192, n_reserved_extents = 0, n_pending_flushes = 0, 

When starting recovery with --debug=d,ib_log, we will see the following:

2018-11-20 12:42:37 0 [Note] InnoDB: Read redo log up to LSN=23183360
recv_group_scan_log_recs: ib_log: scan 23080787: multi-log rec MLOG_FILE_CREATE2 len 21 page 1:640
2018-11-20 12:42:48 0 [Note] InnoDB: Starting final batch to recover 2983 pages from redo log.
recv_addr_trim: ib_log: discarding log beyond end of tablespace [page id: space=1, page number=640] before LSN 23080787
recv_addr_trim: ib_log: Discarding MLOG_4BYTES for page 1:2165 at 22948721
recv_addr_trim: ib_log: Discarding MLOG_2BYTES for page 1:2165 at 22948728
2018-11-20 12:42:57 0 [Note] InnoDB: To recover: 1268 pages from log
2018-11-20 12:42:57 0 [ERROR] [FATAL] InnoDB: Trying to read page number 2165 in space 1, space name innodb_undo001, which is outside the tablespace bounds. Byte offset 0, len 4096

It appears that the recv_addr_trim() did not have the desired effect, or there is a bug in recv_apply_hashed_log_recs() or recv_read_in_area().



 Comments   
Comment by Marko Mäkelä [ 2018-11-20 ]

It looks like can reliably reproduce this with a 10.4 that was built with cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=clang-7 -DCMAKE_CXX_COMPILER=clang++-7 -DCMAKE_C_FLAGS=-O2 -DCMAKE_CXX_FLAGS=-O2. This is not the combination that I am normally using. Without -O2 the test passes.

Comment by Marko Mäkelä [ 2018-11-20 ]

The problem was that recv_addr_trim() is not properly detaching the elements from the hash table. It is easier to not detach the elements at all, and instead make recv_apply_hashed_log_recs() ignore hash buckets that have no log records attached.

Generated at Thu Feb 08 08:39:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.