[MDEV-30911] Multi-batch recovery of ROW_FORMAT=COMPRESSED table hangs Created: 2023-03-23  Updated: 2023-03-23  Resolved: 2023-03-23

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 11.0, 11.1
Fix Version/s: 11.1.1, 11.0.2

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: hang, recovery, regression-11.0, rr-profile

Attachments: File data.tar.xz    
Issue Links:
Problem/Incident
is caused by MDEV-30216 Read-ahead unnecessarily allocates an... Closed

 Description   

Some time ago, mleich encountered a case where a multi-batch crash recovery would hang:

11.0 d5d7c8ba96fbf8800d7273034f56cbb39f379982

2023-02-21  9:06:01 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=24728170
2023-02-21  9:06:06 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-27.ibd' with space ID 1381. Another data file called ./cool_down/t3.ibd exists with the same space ID.
2023-02-21  9:06:06 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-27.ibd' with space ID 1381. Another data file called ./cool_down/t3.ibd exists with the same space ID.
2023-02-21  9:06:06 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-27.ibd' with space ID 1381. Another data file called ./cool_down/t3.ibd exists with the same space ID.
2023-02-21  9:06:06 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1442. Another data file called ./cool_down/t1.ibd exists with the same space ID.
2023-02-21  9:06:06 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1442. Another data file called ./cool_down/t1.ibd exists with the same space ID.
2023-02-21  9:06:07 0 [Note] InnoDB: Starting a batch to recover 412 pages from redo log.
2023-02-21  9:06:10 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1442. Another data file called ./cool_down/t1.ibd exists with the same space ID.
2023-02-21  9:06:10 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1497. Another data file called ./cool_down/t2.ibd exists with the same space ID.
2023-02-21  9:06:10 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1497. Another data file called ./cool_down/t2.ibd exists with the same space ID.
2023-02-21  9:06:11 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1497. Another data file called ./cool_down/t2.ibd exists with the same space ID.
2023-02-21  9:06:11 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1538. Another data file called ./test/t2.ibd exists with the same space ID.
2023-02-21  9:06:11 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1538. Another data file called ./test/t2.ibd exists with
 the same space ID.
2023-02-21  9:06:11 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1538. Another data file called ./test/t2.ibd exists with the same space ID.
2023-02-21  9:06:12 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1588. Another data file called ./test/t1.ibd exists with the same space ID.
2023-02-21  9:06:12 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1588. Another data file called ./test/t1.ibd exists with the same space ID.
2023-02-21  9:06:12 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-24ccd8-35.ibd' with space ID 1588. Another data file called ./test/t1.ibd exists with the same space ID.
2023-02-21  9:06:12 0 [Note] InnoDB: Starting a batch to recover 688 pages from redo log.
2023-02-21  9:06:13 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 1 row operations to undo
2023-02-21  9:06:13 0 [Note] InnoDB: Trx id counter is 9832
2023-02-21  9:06:13 0 [Note] InnoDB: Starting final batch to recover 90 pages from redo log.
230221  9:56:01 [ERROR] mysqld got signal 6 ;

As you can see from the last two lines, InnoDB was hung for about 50 minutes before the environment issued kill -ABRT. In the resulting core dump, we would have 2 entries in recv_sys.pages:

#12 0x0000558fc708eebd in recv_sys_t::apply (this=0x558fc8eb6380 <recv_sys>, last_batch=last_batch@entry=true) at /data/Server/bb-11.1-MDEV-30545A/storage/innobase/log/log0recv.cc:3486
3486	            my_cond_wait(&cond, &mutex.m_mutex);
(gdb) p *this
$6 = {mutex = {m_mutex = {global = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
        __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, 
          __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}, 
      file = 0x558fc81d3ae0 "/data/Server/bb-11.1-MDEV-30545A/storage/innobase/log/log0recv.cc", name = 0x558fc81d51e1 "mutex", line = 3458, count = 0, create_flags = 0, active_flags = 0, id = 15, 
      thread = 139641080355264, locked_mutex = 0x6120000037d8, used_mutex = 0x612000003858, prev = 0x0, next = 0x0}, m_psi = 0x0}, cond = {__data = {__wseq = {__value64 = 2, __value32 = {__low = 2, 
          __high = 0}}, __g1_start = {__value64 = 0, __value32 = {__low = 0, __high = 0}}, __g_refs = {2, 0}, __g_size = {0, 0}, __g1_orig_size = 0, __wrefs = 8, __g_signals = {0, 0}}, 
    __size = "\002", '\000' <repeats 15 times>, "\002", '\000' <repeats 19 times>, "\b\000\000\000\000\000\000\000\000\000\000", __align = 2}, apply_batch_on = true, found_corrupt_log = false, 
  found_corrupt_fs = false, recovery_on = true, apply_log_recs = true, len = 100663296, offset = 35194518, lsn = 35194518, file_checkpoint = 24945758, progress_time = 1676999161, 
  pages = std::map with 2 elements = {[{m_id = 6605659701248}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f00acbdecb8, tail = 0x7f00ad006f68}}, [{m_id = 6605659701251}] = {
      state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f00acc00460, tail = 0x7f00ad007058}}}, truncated_undo_spaces = {{lsn = 0, pages = 0} <repeats 127 times>}, dblwr = {
    pages = std::deque with 0 elements}, last_stored_lsn = 26198559, files = std::vector of length 0, capacity 0, blocks = {count = 15, start = 0x7f00acf33970, end = 0x7f00acb338a0, 
    node = &buf_block_t::unzip_LRU, init = 51966}, after_apply = false}
(gdb) p pages
$7 = std::map with 2 elements = {[{m_id = 6605659701248}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f00acbdecb8, tail = 0x7f00ad006f68}}, [{m_id = 6605659701251}] = {
    state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f00acc00460, tail = 0x7f00ad007058}}}

Recovery is waiting for these pages to be read in. Both are for the tablespace 0x602 = 1538, which does exist in the core dump:

(gdb) p *(fil_space_t*)0x612000009f40
$128 = {<ilist_node<unflushed_spaces_tag_t>> = {next = 0x0, prev = 0x0}, <ilist_node<default_encrypt_tag_t>> = {next = 0x0, prev = 0x0}, <ilist_node<space_list_tag_t>> = {next = 0x61200000a0e0, 
    prev = 0x612000009de0}, <ilist_node<named_spaces_tag_t>> = {next = 0x612000008ef0, prev = 0x6120000088f0}, hash = 0x0, max_lsn = 35194518, id = 1538, is_being_truncated = false, 
  purpose = FIL_TYPE_TABLESPACE, chain = {count = 1, start = 0x60600016a720, end = 0x60600016a720, node = &fil_node_t::chain, init = 51966}, size = 640, size_in_header = 640, free_len = 2, free_limit = 256, 
  recv_size = 640, n_reserved_extents = 0, committed_size = {m = std::atomic<unsigned int> = { 640 }}, n_pending = std::atomic<unsigned int> = { 0 }, latch = {writer = {
      lock = std::atomic<unsigned int> = { 0 }}, readers = std::atomic<unsigned int> = { 0 }}, latch_owner = 0, latch_count = {m = std::atomic<unsigned int> = { 0 }}, crypt_data = 0x0, 
  is_in_unflushed_spaces = false, is_in_default_encrypt = false, is_corrupted = {<std::__atomic_flag_base> = {_M_i = false}, <No data fields>}, freed_range_mutex = {<std::__mutex_base> = {_M_mutex = {__data = {
          __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
        __align = 0}}, <No data fields>}, freed_ranges = {ranges = std::set with 0 elements}, last_freed_lsn = 0, flags = 41}

It turns out that also the two missed pages exist in the buffer pool in the core dump:

(gdb) p *(buf_block_t *) 0x7f00ad232520
$137 = {page = {id_ = {m_id = 6605659701251}, hash = 0x0, oldest_modification_ = {m = std::atomic<unsigned long> = { 0 }}, lock = {lock = {writer = {lock = std::atomic<unsigned int> = { 0 }}, 
        readers = std::atomic<unsigned int> = { 0 }}, recursive = 0, writer = std::atomic<unsigned long> = { 0 }, readers_lock = {lock = std::atomic<unsigned int> = { 0 }}, 
      readers = std::atomic<std::unordered_multiset<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > *> = { 0x60600060ff20 }}, 
    frame = 0x7f00ad2a0000 "\031Z", <incomplete sequence \365>, zip = {data = 0x7f00acef8000 "\031Z", <incomplete sequence \365>, m_end = 4558, m_nonempty = 1, n_blobs = 30, ssize = 4, m_start = 118, 
      m_external = false, fix = {m = std::atomic<unsigned int> = { 1073741824 }}}, in_zip_hash = false, in_LRU_list = true, in_page_hash = true, in_free_list = false, list = {prev = 0x0, next = 0x0}, LRU = {
      prev = 0x7f00ad0319c0, next = 0x7f00ad031dd0}, old = 0, freed_page_clock = 0, access_time = {m_counter = std::atomic<unsigned int> = { 4137157028 }}}, in_withdraw_list = false, in_unzip_LRU_list = true, 
  unzip_LRU = {prev = 0x7f00ad131d00, next = 0x7f00ad131b60}, modify_clock = 2, n_bytes = 0, n_fields = 1, n_hash_helps = 0, left_side = true, n_pointers = {m_counter = std::atomic<unsigned long> = { 0 }}, 
  curr_n_fields = 0, curr_n_bytes = 0, curr_left_side = 0, index = 0x0}
(gdb) p *(buf_block_t *) 0x7f00ad131b60
$138 = {page = {id_ = {m_id = 6605659701248}, hash = 0x0, oldest_modification_ = {m = std::atomic<unsigned long> = { 0 }}, lock = {lock = {writer = {lock = std::atomic<unsigned int> = { 0 }}, 
        readers = std::atomic<unsigned int> = { 0 }}, recursive = 0, writer = std::atomic<unsigned long> = { 0 }, readers_lock = {lock = std::atomic<unsigned int> = { 0 }}, 
      readers = std::atomic<std::unordered_multiset<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > *> = { 0x0 }}, frame = 0x7f00ad170000 "\361\352\366Z", 
    zip = {data = 0x7f00ad0b2000 "\361\352\366Z", m_end = 0, m_nonempty = 0, n_blobs = 0, ssize = 4, m_start = 0, m_external = false, fix = {m = std::atomic<unsigned int> = { 1073741824 }}}, 
    in_zip_hash = false, in_LRU_list = true, in_page_hash = true, in_free_list = false, list = {prev = 0x0, next = 0x0}, LRU = {prev = 0x7f00ad031dd0, next = 0x7f00ad032450}, old = 0, freed_page_clock = 0, 
    access_time = {m_counter = std::atomic<unsigned int> = { 4137157027 }}}, in_withdraw_list = false, in_unzip_LRU_list = true, unzip_LRU = {prev = 0x7f00ad232520, next = 0x7f00ad0326c0}, modify_clock = 2, 
  n_bytes = 0, n_fields = 1, n_hash_helps = 0, left_side = true, n_pointers = {m_counter = std::atomic<unsigned long> = { 0 }}, curr_n_fields = 0, curr_n_bytes = 0, curr_left_side = 0, index = 0x0}

Both are unlocked and have page.zip.fix=UNFIXED, that is, no page read is in progress. For some reason, the log application was never invoked on these pages. The reason may be that the pages are in ROW_FORMAT=COMPRESSED format (see page.zip.data).

data.tar.xz recovered just fine for me when I was using a large buffer pool. With innodb_buffer_pool_size=8m I can reproduce the hang, also when using rr record.



 Comments   
Comment by Marko Mäkelä [ 2023-03-23 ]

I can reproduce this with the 11.0 commit that implemented MDEV-30216 but not its parent.

The fix is simple: In every call to buf_read_page_low(), set the least significant bit of zip_size if crash recovery is in progress. In MDEV-30216, only one caller of the function performed this adjustment. Before MDEV-30216, buf_read_page_low() was invoking recv_recovery_is_on() itself.

Generated at Thu Feb 08 10:19:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.