[MDEV-21899] INSERT into a secondary index with zero-data-length key is not crash-safe Created: 2020-03-09  Updated: 2020-03-27  Resolved: 2020-03-27

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - InnoDB
Affects Version/s: 10.5.2
Fix Version/s: 10.5.3

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: None

Attachments: File MDEV-21899.tgz     File rqg.log    
Issue Links:
Problem/Incident
is caused by MDEV-21724 Optimize page_cur_insert_rec_low() re... Closed
Relates
relates to MDEV-12353 Efficient InnoDB redo log record format Closed

 Description   

Problem found during RQG testing invoking the RQG reporter "Mariabackup".
Scenario:
1. More or less heavy DDL/DML load
     During that some call   mariabackup ... --backup ... is made.
2. mariabackup ... --prepare  ...  is called first time.
 
2020-03-06 19:20:38 0 [Note] InnoDB: Starting final batch to recover 491 pages from redo log.
2020-03-06 19:20:39 0 [ERROR] InnoDB: Not applying DELETE_ROW_FORMAT_DYNAMIC due to corruption on [page id: space=167, page number=36]
2020-03-06 19:20:39 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
2020-03-06 19:20:39 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1580] with error Data structure corruption
[00] FATAL ERROR: 2020-03-06 19:20:40 mariabackup: innodb_init() returned 39 (Data structure corruption).
 
origin/HEAD, origin/10.5 23685378bafa38a74957e8f07e2d56b57e90fa53 2020-03-06T11:06:59+02:00
 
 
Some second variant for the same failure is
1. Only on session and that runs
    REPLACE INTO <random table which was created earlier> ( _field_no_pk ) VALUES ( <random _mediumint  value) ;
2. At some point of time the server gets killed (SIGKILL)
3. Attempt to restart the server
 
2020-03-23 14:12:49 0 [Note] InnoDB: Completed initialization of buffer pool
2020-03-23 14:12:49 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-03-23 14:12:49 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=335923252
2020-03-23 14:12:49 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 2 row operations to undo
2020-03-23 14:12:49 0 [Note] InnoDB: Trx id counter is 21150
2020-03-23 14:12:49 0 [Note] InnoDB: Starting final batch to recover 272 pages from redo log.
2020-03-23 14:12:49 0 [ERROR] InnoDB: Not applying DELETE_ROW_FORMAT_DYNAMIC due to corruption on [page id: space=5, page number=22]
2020-03-23 14:12:49 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
2020-03-23 14:12:49 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1574] with error Data structure corruption
 
origin/10.5 02fe99750536aad7c9e75513be47c91c8c568419 2020-03-23T10:45:08+01:00
 
The bug affects builds with and without debug.



 Comments   
Comment by Marko Mäkelä [ 2020-03-09 ]

In the case that I analyzed, I observed 3 or 4 mini-transactions for the page:

  1. Create the B-tree page (root page of a secondary index, page 36 in the tablespace).
  2. Insert one record to the B-tree page, with the key 0x80000000 (looks like the PRIMARY KEY was an empty string).
  3. Update the PAGE_MAX_TRX_ID and delete-mark the record.
  4. Delete a non-existing record with the non-existing predecessor at byte offset 0x8a.

We could detect the inconsistency earlier, by checking that the predecessor record is not beyond PAGE_HEAP_TOP. We could copy some further consistency checks from the INSERT function.

Until we have a reasonably repeatable test case, it is not known what is causing this. It would seem that some log is missing. A bug in Mariabackup?

Comment by Marko Mäkelä [ 2020-03-26 ]

I think that I finally found the cause. It was rather challenging, so I am providing details for future reference, to aid in similar debugging. The patch is simple, and the cause is MDEV-21724:

diff --git a/storage/innobase/page/page0cur.cc b/storage/innobase/page/page0cur.cc
index bce7231059a..ec79450995a 100644
--- a/storage/innobase/page/page0cur.cc
+++ b/storage/innobase/page/page0cur.cc
@@ -1551,7 +1551,6 @@ page_cur_insert_rec_low(
   /* Insert the record, possibly copying from the preceding record. */
   ut_ad(mtr->get_log_mode() == MTR_LOG_ALL);
 
-  if (data_size)
   {
     const byte *r= rec;
     const byte *c= cur->rec;

In a provided dataset, there were at least 2 pages for which the record failed to apply: pages 5:9 and 5:12. I concentrated on page 5:9. For that one, I instrumented the crash recovery with gdb as follows:

break ib::error::~error
break page_apply_delete_dynamic
cond 2 block.page.id.m_id==0x500000009
break 'recv_recover_page(buf_block_t*, mtr_t&, std::_Rb_tree_iterator<std::pair<page_id_t const, page_recv_t> > const&, fil_space_t*, mlog_init_t::init*)'
cond 3 block.page.id.m_id==0x500000009
command 3
set scheduler-locking on
enable 4
continue
break log_phys_t::apply
disable 4
continue
display/x *begin()@len
display *this
command 4
continue
end
continue

There were 2 page_apply_delete_dynamic() calls to this page. The second one failed, because it was pointing to an area in the page that was filled with zero bytes. That is, the next-record loop to find the record owner in the sparse page directory would exceed n_recs and the page would correctly be regarded as corrupted.

Aimed with the knowledge on the LSN and the redo log record contents, I used the provided rr replay environment. First, I think that I let it run to the end (to the SIGKILL that was originally initiated by RQG), and did when to see how many steps there are in total. It was above 4 million. Then, I did roughly the following in the end (omitting lots of steps):

run 4000000
break row_upd_sec_index_entry
cond 1 log_sys.lsn._M_i>176623800
break mtr_buf_t::close
cond 2 log_sys.lsn._M_i>176623800
define find_page_105
  set $id=((unsigned long long)$arg0)<<32|$arg1
  set $chunk=buf_pool->chunks
  set $i=buf_pool->n_chunks
  set $found=0
  while ($i-- && !$found)
    set $j = $chunk->size
    set $b = $chunk->blocks
    while ($j--)
      if ($b->page.id.m_id==$id)
        print $b
        set $found=1
        loop_break
      end
      set $b++
    end
    set $chunk++
  end
end
 
document find_page_105
Buffer pool look-up: find_page_105 space_id page_no"
end
find_page_105 5 3
watch -l *(ulonglong*)$1.frame[0x38]@8
watch -l $1.frame[0x7a]
display/x $1.frame[0x38]@8
display log_sys.lsn

The first hardware watchpoint is for PAGE_MAX_TRX_ID on the page. The page turned out to be a secondary index page.

The second watchpoint is for a byte that used to be a delete-mark flag that was set back and forth by the test. It was accidentally hit by the page_cur_insert_rec_low(). Thanks to the breakpoint on mtr_buf_t::close, which was not called for that insert, I knew that something is wrong. So, I went back in time by issuing

when
run (the reported number - a little bit)

and single-stepped through the function page_cur_insert_rec_low(). The condition that my patch above removed is redundant and outright harmful.

It is a rare corner case that a secondary index record has no data payload at all. It is possible when the PRIMARY KEY consists of variable-length string columns that are all empty, and all secondary index columns are either NULL or the empty string.

We failed to write the INSERT_REUSE_DYNAMIC redo log record for the following statement:

REPLACE INTO `table100_innodb_default` ( `col_text_latin1_unique_default_null` ) VALUES ( 5900288 );

This was on the secondary index col_char_latin1_unique, and the value is (NULL,''), as expected.

Generated at Thu Feb 08 09:10:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.