Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25745

InnoDB recovery fails with [ERROR] InnoDB: Not applying INSERT_REUSE_REDUNDANT due to corruption

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.5.4, 10.5.2, 10.5.3, 10.5.5, 10.5.6, 10.5.7, 10.5.8, 10.5.9, 10.5.10, 10.6.0, 10.6.1
    • 10.6.2, 10.5.11
    • Operating System: Amazon Linux 2 AMI
      Hardware: Amazon AWS t3a.medium
      3-Node Galera Cluster

    Description

      We have several production databases that we migrated early 2021 from single node setup to a 3-node Galera Cluster for high availability. During the migration project we implemented a tool for managing the cluster and one of the features of that tool is "cycle-db-cluster" functionality that replaces the oldest node of the cluster with a totally new one. Since the project we have kept the cluster up to date by replacing nodes periodically using this automated process.

      At 2021-05-11 the replace process failed as a new 10.5.10 node could not join our cluster that consisted of 10.5.9 nodes. In mysqld.log the new node reported [ERROR] InnoDB: Not applying INSERT_REUSE_REDUNDANT due to corruption on [page id: space=0, page number=14408]. We tried to repeat the process several times also using MariaDB version 10.5.9 and a new virtual server each time but the issue persisted. Our mysqldumps worked just fine so we scheduled a service break for 2021-05-12 and started a new cluster using MariaDB 10.5.10 and data from mysqldumps. An important detail is that we imported mysqldumps to a 1-node cluster and added 2 nodes after imports had finished – so joining definitely worked shortly after import. We also replaced the first node of the cluster by using the "cycle-db-cluster" functionality ~1 hour after the service break.

      At 2021-05-20 we tried to replace a node in the cluster for the first time since the day we set it up. Again joining a new node to the cluster failed and mysqld.log in the new joining node contains [ERROR] InnoDB: Not applying INSERT_REUSE_REDUNDANT due to corruption on [page id: space=0, page number=2366]. So now we have almost new 3-node 10.5.10 cluster where we can't join new nodes. The cluster itself continues to perform just fine and mysqldumps succeed. For me it seems that the system detects a corruption for some reason while the data inside databases is actually totally OK.

      Attachments:

      20210520-cycle-db-cluster.log:

      Contains log of the automated cycle-db-cluster process. It contains details how a new db server was set up from scratch all the way to attempting to start the mysqld process.

      Conf.zip:

      Database configuration files that we use for our nodes. These have a few variables that are replaced automatically during node setup process.

      20210520-clusterjoin-failure-mysqld.log:

      mysqld.log from the node join failure at 2021-05-20.

      Attachments

        Issue Links

          Activity

            The following record is the first to be found to be inconsistent with the page contents:

            #0  0x000056264c48b4d4 in log_phys_t::apply (this=this@entry=0x7f155569b3b0, 
                block=
                  @0x7f15441952d0: {page = {id_ = {m_id = 2366},
                at /mariadb/10.5m/storage/innobase/log/log0recv.cc:397
            397	            if (page_apply_insert_redundant(block, subtype & 1, prev_rec,
            (rr) p *this
            $1 = {<log_rec_t> = {next = 0x7f155569f320, lsn = 18962387413}, start_lsn = 18962387336}
            

            The same mini-transaction is modifying this page twice (I guess, inserting the buffered record into the change buffer and maybe updating PAGE_MAX_TRX_ID):

            ?func: ib_log: scan 18962387336: rec 20 len 72 page 0:2366
            ?func: ib_log: scan 18962387336: rec b3 len 4 page 0:2366
            

            The reason for the inconsistency is:

                if (UNIV_UNLIKELY(free_rec + data_size > heap_top))
                  goto corrupted;
            

            That is, we are supposed to reuse the space that was occupied by a previously deleted record (the top of the PAGE_FREE stack), but the allocation is extending past the end of PAGE_HEAP_TOP. This looks serious indeed.

            mleich, I think that we’d better use innodb_page_size=4k and indexed VARCHAR columns for reproducing this. If we use a large page size or fixed-size records, we’d never hit this. The check would only fail if the last-allocated record in the page was freed and subsequently reallocated. I hope that this corruption is also reproducible with ROW_FORMAT=REDUNDANT tables (and without change buffering).

            I do not think I can do more on this until we have an rr replay trace.

            marko Marko Mäkelä added a comment - The following record is the first to be found to be inconsistent with the page contents: #0 0x000056264c48b4d4 in log_phys_t::apply (this=this@entry=0x7f155569b3b0, block= @0x7f15441952d0: {page = {id_ = {m_id = 2366}, … at /mariadb/10.5m/storage/innobase/log/log0recv.cc:397 397 if (page_apply_insert_redundant(block, subtype & 1, prev_rec, (rr) p *this $1 = {<log_rec_t> = {next = 0x7f155569f320, lsn = 18962387413}, start_lsn = 18962387336} The same mini-transaction is modifying this page twice (I guess, inserting the buffered record into the change buffer and maybe updating PAGE_MAX_TRX_ID ): ?func: ib_log: scan 18962387336: rec 20 len 72 page 0:2366 ?func: ib_log: scan 18962387336: rec b3 len 4 page 0:2366 The reason for the inconsistency is: if (UNIV_UNLIKELY(free_rec + data_size > heap_top)) goto corrupted; That is, we are supposed to reuse the space that was occupied by a previously deleted record (the top of the PAGE_FREE stack), but the allocation is extending past the end of PAGE_HEAP_TOP . This looks serious indeed. mleich , I think that we’d better use innodb_page_size=4k and indexed VARCHAR columns for reproducing this. If we use a large page size or fixed-size records, we’d never hit this. The check would only fail if the last-allocated record in the page was freed and subsequently reallocated. I hope that this corruption is also reproducible with ROW_FORMAT=REDUNDANT tables (and without change buffering). I do not think I can do more on this until we have an rr replay trace.

            mleich was able to produce an rr replay trace for this:

            ssh pluto
            rr replay /data/Results/1622138407/TBR-1093/dev/shm/vardir/1622138407/123/1/rr/latest-trace
            

            It reports Not applying INSERT_REUSE_REDUNDANT for a change buffer page, due to exactly the same reason as in the data directory that was provided by tomitukiainen.

            Next week, I will get the details from the failed recovery, and then debug the trace of the server run that ended in SIGKILL, to figure out what caused the problem.
            If we are lucky, this could explain MDEV-25783 as well. (For that to be possible, I think that something related to the change buffer merge must skip writing redo log.)

            marko Marko Mäkelä added a comment - mleich was able to produce an rr replay trace for this: ssh pluto rr replay /data/Results/1622138407/TBR-1093/dev/shm/vardir/1622138407/123/1/rr/latest-trace It reports Not applying INSERT_REUSE_REDUNDANT for a change buffer page, due to exactly the same reason as in the data directory that was provided by tomitukiainen . Next week, I will get the details from the failed recovery, and then debug the trace of the server run that ended in SIGKILL, to figure out what caused the problem. If we are lucky, this could explain MDEV-25783 as well. (For that to be possible, I think that something related to the change buffer merge must skip writing redo log.)

            Below, I am documenting my workflow at finding the root cause of this failure. Hopefully it will serve an educational purpose. I believe that with rr replay traces of the killed server and misbehaving recovered server, as well as a copy of the data directory before the recovery was started, any recovery bug can be diagnosed.

            When we fail to apply the log record, it is for page 1160 in the system tablespace. We can also determine the LSN:

            #2  0x000055de458007c8 in page_apply_insert_redundant (block=..., reuse=true, prev=2652, enc_hdr=44, hdr_c=11, data_c=10, data=0x25592084fe0d, data_len=88)
                at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:2321
            2321	    ib::error() << (reuse
            (rr) p/x block.frame[16]@8
            $1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xa2, 0xd7, 0x1}
            (rr) up
            #3  0x000055de457b21e1 in log_phys_t::apply (this=0x25592084fde8, block=..., last_offset=@0x606001178b6c: 0) at /data/Server/bb-10.5-marko/storage/innobase/log/log0recv.cc:397
            397	            if (page_apply_insert_redundant(block, subtype & 1, prev_rec,
            (rr) p/x *this
            $2 = {<log_rec_t> = {next = 0x25592084fe68, lsn = 0xa6a6b2}, start_lsn = 0xa6a64a}
            

            Based on this recovery failure, for the rr replay execution trace of the server that ended in SIGKILL we must check every modification to the page between LSN 0xa2d701 and 0xa6a6b2. The log record that we failed to apply had been written by the following (and I do not yet conclude that there was anything wrong with those writes):

            #7  0x000055c213bccd96 in mtr_t::commit (this=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/mtr/mtr0mtr.cc:442
            #8  0x000055c213b3f878 in ibuf_mtr_commit (mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/ibuf0ibuf.ic:64
            #9  0x000055c213b4ae1b in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0)
                at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3466
            #10 0x000055c213b4b2f7 in ibuf_insert (op=IBUF_OP_INSERT, entry=0x61a000844508, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0)
                at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3603
            #11 0x000055c213e02647 in btr_cur_search_to_nth_level_func (index=0x6160000f03f0, level=0, tuple=0x61a000844508, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x208943901ab0, ahi_latch=0x0, 
                file=0x55c214795e40 "/data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc", line=2974, mtr=0x208943901e70, autoinc=0) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:1650
            #12 0x000055c213c8b3e2 in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x6160000f03f0, offsets_heap=0x619000cb9d80, heap=0x619000cb8e80, entry=0x61a000844508, trx_id=0, thr=0x621004e35de0)
                at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:2971
            #13 0x000055c213c8c59f in row_ins_sec_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0, check_foreign=true) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3294
            #14 0x000055c213c8c784 in row_ins_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3342
            #15 0x000055c213c8d14d in row_ins_index_entry_step (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3509
            #16 0x000055c213c8d6b6 in row_ins (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3668
            #17 0x000055c213c8de86 in row_ins_step (thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3818
            #18 0x000055c213cb83ea in row_insert_for_mysql (mysql_rec=0x61f0001f80b8 "\377f", prebuilt=0x6220002b2170, ins_mode=ROW_INS_NORMAL) at /data/Server/bb-10.5-marko/storage/innobase/row/row0mysql.cc:1419
            #19 0x000055c213a3b359 in ha_innobase::write_row (this=0x61d0008638b8, record=0x61f0001f80b8 "\377f") at /data/Server/bb-10.5-marko/storage/innobase/handler/ha_innodb.cc:7669
            (rr) frame 21
            #21 0x000055c2129b21fd in write_record (thd=thd@entry=0x62b0000af218, table=table@entry=0x619000bad198, info=info@entry=0x208943903d40, sink=sink@entry=0x0) at /data/Server/bb-10.5-marko/sql/sql_insert.cc:2107
            2107	  else if (unlikely((error=table->file->ha_write_row(table->record[0]))))
            (rr) p thd.query_string
            $5 = {string = {str = 0x62b0000b6238 "INSERT IGNORE INTO `t1` ( `pk`, `col_char` ) VALUES ( NULL, 'f' ) /* E_R Thread1 QNO 1506 CON_ID 16 */", length = 102}, cs = 0x55c21527dc60 <my_charset_latin1>}
            (rr) display
            2: /x log_sys.lsn._M_i = 0xa6a6b2
            (rr) when
            Current event: 623570
            (rr) reverse-continue
            Continuing.
             
            Thread 18 hit Hardware access (read/write) watchpoint 1: *(ulong*)0x64000007d1c0
             
            Value = 9683160
            0x000055c213b3a125 in std::__atomic_base<unsigned long>::load (__m=std::memory_order_relaxed, this=0x64000007d1c0) at /usr/include/c++/9/bits/atomic_base.h:419
            419		return __atomic_load_n(&_M_i, int(__m));
            2: /x log_sys.lsn._M_i = 0xa6a64a
            (rr) bt
            #0  0x000055c213b3a125 in std::__atomic_base<unsigned long>::load (__m=std::memory_order_relaxed, this=0x64000007d1c0) at /usr/include/c++/9/bits/atomic_base.h:419
            #1  Atomic_counter<unsigned long>::operator unsigned long (this=0x64000007d1c0) at /data/Server/bb-10.5-marko/include/my_counter.h:45
            #2  0x000055c213b39ac2 in buf_page_t::oldest_modification (this=0x64000007d178) at /data/Server/bb-10.5-marko/storage/innobase/include/buf0buf.h:936
            #3  0x000055c213b537da in mtr_t::is_block_dirtied (block=0x64000007d178) at /data/Server/bb-10.5-marko/storage/innobase/include/mtr0mtr.ic:35
            #4  0x000055c213b5385c in mtr_t::memo_push (this=0x2089438ffd00, object=0x64000007d178, type=MTR_MEMO_PAGE_X_FIX) at /data/Server/bb-10.5-marko/storage/innobase/include/mtr0mtr.ic:57
            #5  0x000055c213e45123 in buf_page_mtr_lock (block=0x64000007d178, rw_latch=2, mtr=0x2089438ffd00, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370)
                at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:2866
            #6  0x000055c213e461de in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=10, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370, 
                mtr=0x2089438ffd00, err=0x2089438fe430, allow_ibuf_merge=false) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:3413
            #7  0x000055c213e4671a in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=10, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370, 
                mtr=0x2089438ffd00, err=0x2089438fe430, allow_ibuf_merge=false) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:3482
            #8  0x000055c213ad6899 in btr_block_get_func (index=..., page=1160, mode=2, merge=true, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370, mtr=0x2089438ffd00)
                at /data/Server/bb-10.5-marko/storage/innobase/include/btr0btr.h:237
            #9  0x000055c213dfd9ff in btr_cur_latch_leaves (block=0x64000007d178, latch_mode=36, cursor=0x2089438ffb50, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:370
            #10 0x000055c213e0333a in btr_cur_search_to_nth_level_func (index=0x616000006ff0, level=0, tuple=0x619000cb93f0, mode=PAGE_CUR_LE, latch_mode=36, cursor=0x2089438ffb50, ahi_latch=0x0, 
                file=0x55c21472bfa0 "/data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc", line=3276, mtr=0x2089438ffd00, autoinc=0) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:1848
            #11 0x000055c213b407c8 in btr_pcur_open_low (index=0x616000006ff0, level=0, tuple=0x619000cb93f0, mode=PAGE_CUR_LE, latch_mode=36, cursor=0x2089438ffb50, 
                file=0x55c21472bfa0 "/data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc", line=3276, autoinc=0, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/btr0pcur.ic:441
            #12 0x000055c213b4a525 in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0)
                at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3276
            (rr) when
            Current event: 623569
            

            It seems that the same mini-transaction is modifying the change buffer page two times. The recovery failure occurred for one of those 2 log records.

            Let us follow the trace in the forward direction from an earlier point of time:

            (rr) display
            2: /x log_sys.lsn._M_i = 0x890396
            (rr) cond 1 log_sys.lsn._M_i>=0xa2d701
            (rr) continue
            Thread 45 hit Hardware access (read/write) watchpoint 1: *(ulong*)0x64000007d1c0
             
            Value = 9683160
            std::__atomic_base<unsigned long>::load (__m=std::memory_order_relaxed, this=0x64000007d1c0) at /usr/include/c++/9/bits/atomic_base.h:419
            419		return __atomic_load_n(&_M_i, int(__m));
            2: /x log_sys.lsn._M_i = 0xa2d701
            (rr) when
            Current event: 575927
            (rr) frame 26
            #26 0x000055c2129b21fd in write_record (thd=thd@entry=0x62b000150218, table=table@entry=0x6190012de598, info=info@entry=0x44a50e105d40, sink=sink@entry=0x0) at /data/Server/bb-10.5-marko/sql/sql_insert.cc:2107
            2107	  else if (unlikely((error=table->file->ha_write_row(table->record[0]))))
            (rr) p thd.query_string
            $6 = {string = {str = 0x62b000157238 "INSERT IGNORE INTO `t1` ( `pk` ) VALUES ( 10090 ) /* E_R Thread6 QNO 1528 CON_ID 21 */", length = 86}, cs = 0x55c21527dc60 <my_charset_latin1>}
            

            This was the last change for which the change buffer page had been written out. From this point onwards, the actions of the recovery must have diverged from the modifications to the page.

            Because the LSN range is very wide here, we’d better ignore accesses of block->page.oldest_modification and instead check each change of PAGE_HEAP_TOP of the page.

            (rr) break buf_page_read_complete thread 1
            Breakpoint 3 at 0x55de45a541b3: file /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc, line 4100.
            (rr) disable 2
            (rr) reverse-continue
            Continuing.
             
            Thread 1 hit Breakpoint 3, buf_page_read_complete (bpage=0x55de4575e8c8 <mtr_t::memo_push(void*, mtr_memo_type_t)+230>, node=...) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:4100
            4100	{
            (rr) n
            4101	  const page_id_t id(bpage->id());
            (rr) 
            4117	  const byte *frame= bpage->zip.data
            (rr) 
            4118	    ? bpage->zip.data
            (rr) 
            4117	  const byte *frame= bpage->zip.data
            (rr) 
            4123	  if (!buf_page_decrypt_after_read(bpage, node))
            (rr) p/x frame[40]@2
            $6 = {0x8, 0xf3}
            (rr) set $f=frame
            (rr) display/x $f[40]@2
            1: /x $f[40]@2 = {0x8, 0xf3}
            (rr) dump binary memory recovered-page.bin $f $f+srv_page_size
            (rr) enable
            (rr) continue
            

            At the time of the failed recovery, PAGE_HEAP_TOP was 0xf4a (3914). At the time of the page flush (as well as at the start of the recovery), the field was 0x8f3 (2291).

            The contents of the page only slightly differs between the start of the recovery and what ought to be the corresponding point of logical time before the SIGKILL, in the last 8 bytes of the page. The least 32 bits of the LSN and the page checksum are not updated until the page is about to be written from the buffer pool. I saved the srv_page_size=4096 bytes of dump binary memory and compared the files:

            diff -u <(od -Ax -t x1 recover_start.bin) <(od -Ax -t x1 before_kill.bin)
            

            --- /dev/fd/63	2021-05-31 06:10:33.779551414 +0000
            +++ /dev/fd/62	2021-05-31 06:10:33.779551414 +0000
            @@ -109,5 +109,5 @@
             0008f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
             *
             000fe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 74
            -000ff0 05 b3 03 ff 02 4b 00 65 00 a2 d7 01 8e 48 0a 5d
            +000ff0 05 b3 03 ff 02 4b 00 65 00 00 00 00 00 00 00 00
             001000
            

            So, we should have a sound starting point for examining the divergence.

            Interestingly, near the start of the recovery, we have a page_create_low() that will reset the PAGE_HEAP_TOP to 0x7d (125). The PAGE_HEAP_TOP will grow to 0x8ba (2059) until it is reset in another page_create_low(). It will monotonically grow until we seemingly get a glitch: 3878, 3932, 4055, 3932, 4043, 2305, 2414, …, 3803, 3914, 4025, 3914, (failure).

            In the rr replay trace that was terminated by SIGKILL, we can observe the changes of PAGE_HEAP_TOP as follows:

            (rr) set $f=block->frame
            (rr) display/x $f[40]@2
            3: /x $f[40]@2 = {0x8, 0xf3}
            (rr) watch -l $f[40]@2
            Hardware watchpoint 3: -location $f[40]@2
            (rr) c
            Continuing.
            [Switching to Thread 314150.314741]
             
            Thread 18 hit Hardware watchpoint 3: -location $f[40]@2
             
            Old value = <incomplete sequence \363>
            New value = "\000"
            __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:261
            261	../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
            2: /x log_sys.lsn._M_i = 0xa33520
            3: /x $f[40]@2 = {0x0, 0x0}
            (rr) command 3
            Type commands for breakpoint(s) 3, one per line.
            End with a line saying just "end".
            >when
            >c
            >end
            (rr) c
            Continuing.
            (rr) c
            

            Everything seems to be in sync, and also the first monotonicity glitch turns out to be benign. First, the PAGE_HEAP_TOP increases to 4025 because the change buffer tree shrinks due to a change buffer merge during purge (and the records from an adjacent change buffer page are merged to our page). The decerase to 3914 occurs because the very last record of the page is being merged to an index page that is being accessed by dict_stats_analyze_index() (updating persistent statistics).

            The real divergence occurs later. I saved some rr replay output of monitoring changes to PAGE_HEAP_TOP and compared them:

            diff -u <(sed -ne 's/.*n=\([1-9][0-9]*\)) at.*/\1/p' heap-top-recovered|uniq) <(sed -ne 's/.*n=\([1-9][0-9]*\)) at.*/\1/p' heap-top-write|uniq)
            

            @@ -57,6 +57,25 @@
             4055
             3932
             4043
            +234
            +…
            +2196
             2305
             2414
             2523
            @@ -75,3 +94,4 @@
             3914
             4025
             3914
            +4023
            

            For some reason, recovery did not log updates of PAGE_HEAP_TOP from 234 to 2196. The last record must be what recovery was supposed to apply, but did not due to the flagged corruption. Let us check whether redo logging was properly enabled during this page_create_low():

            Thread 45 hit Hardware watchpoint 3: -location $f[40]@2
             
            Old value = "\017\\"
            New value = <incomplete sequence \313>
            mach_write_to_2 (b=0x64000047b028 <incomplete sequence \313>, n=4043) at /data/Server/bb-10.5-marko/storage/innobase/include/mach0data.ic:62
            62	}
            2: /x log_sys.lsn._M_i = 0xa4f621
            3: /x $f[40]@2 = {0xf, 0xcb}
            Current event: 598722
            [Switching to Thread 314150.314325]
             
            Thread 8 hit Hardware watchpoint 3: -location $f[40]@2
             
            Old value = <incomplete sequence \313>
            New value = "\000"
            __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:261
            261	../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
            2: /x log_sys.lsn._M_i = 0xa6230e
            3: /x $f[40]@2 = {0x0, 0x0}
            Current event: 614654
             
            Thread 8 hit Hardware watchpoint 3: -location $f[40]@2
             
            Old value = "\000"
            New value = "\000}"
            page_create_low (block=0x64000007d178, comp=false) at /data/Server/bb-10.5-marko/storage/innobase/page/page0page.cc:311
            311			memcpy(page + PAGE_DATA, infimum_supremum_redundant,
            2: /x log_sys.lsn._M_i = 0xa6230e
            3: /x $f[40]@2 = {0x0, 0x7d}
            Current event: 614654
            

            That call is within a change buffer page reorganize:

            #3  0x000055c213bff4a4 in page_create (block=0x64000007d178, mtr=0x1ff90f650b70, comp=false) at /data/Server/bb-10.5-marko/storage/innobase/page/page0page.cc:331
            #4  0x000055c213dd847e in btr_page_reorganize_low (cursor=0x1ff90f64f740, index=0x616000006ff0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:1371
            #5  0x000055c213dda7c2 in btr_page_reorganize_block (z_level=6, block=0x64000007d178, index=0x616000006ff0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:1606
            #6  0x000055c213dea411 in btr_can_merge_with_page (cursor=0x1ff90f650800, page_no=1160, merge_block=0x1ff90f64f940, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:5167
            #7  0x000055c213de1e48 in btr_compress (cursor=0x1ff90f650800, adjust=0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:3483
            #8  0x000055c213e100d7 in btr_cur_compress_if_useful (cursor=0x1ff90f650800, adjust=0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:5468
            #9  0x000055c213e11b57 in btr_cur_pessimistic_delete (err=0x1ff90f650690, has_reserved_extents=1, cursor=0x1ff90f650800, flags=0, rollback=false, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:5908
            #10 0x000055c213b4e8e9 in ibuf_delete_rec (page_id=..., pcur=0x1ff90f650800, search_tuple=0x616000874bf0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:4125
            #11 0x000055c213b50024 in ibuf_merge_or_delete_for_page (block=0x640000043928, page_id=..., zip_size=0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:4471
            

            Starting with MDEV-21725, the page reorganize is being logged in an optimized way, instead of being logged as individual inserts. Before MDEV-12353, the operation was covered by a logical log record.

            Immediately after the page reorganize operation, the page contents between recovery and the original write differ as follows:

            diff -u <(od -Ax -t x1 /dev/shm/r-after-reorg.bin) <(od -Ax -t x1 /dev/shm/w-after-reorg.bin)
            

            --- /dev/fd/63	2021-05-31 08:20:19.561244871 +0000
            +++ /dev/fd/62	2021-05-31 08:20:19.561244871 +0000
            @@ -1,5 +1,5 @@
             000000 00 00 00 00 00 00 04 88 00 00 04 89 00 00 04 87
            -000010 00 00 00 00 00 a2 d7 01 45 bf 00 00 00 00 00 00
            +000010 00 00 00 00 00 a5 07 15 45 bf 00 00 00 00 00 00
             000020 00 00 00 00 00 00 00 05 08 94 00 15 00 00 00 00
             000030 08 33 00 02 00 12 00 13 00 00 00 00 00 00 21 5b
             000040 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00
            @@ -112,5 +112,5 @@
             0008a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
             *
             000fe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 74
            -000ff0 05 38 03 84 01 d0 00 65 00 a2 d7 01 8e 48 0a 5d
            +000ff0 05 38 03 84 01 d0 00 65 00 00 00 00 00 00 00 00
             001000
            

            On recovery, the FIL_PAGE_LSN is 0xa2d701, which was the initial FIL_PAGE_LSN of the page. The field might only be updated by recovery after all log has been applied to the page. We have the LSN and checksum difference at the end of the page as well. It is nothing to worry about.

            Similarly, the page contents immediately before the failed log apply are equivalent:

            @@ -1,5 +1,5 @@
             000000 00 00 00 00 00 00 04 88 00 00 04 89 ff ff ff ff
            -000010 00 00 00 00 00 a2 d7 01 45 bf 00 00 00 00 00 00
            +000010 00 00 00 00 00 a6 49 aa 45 bf 00 00 00 00 00 00
             000020 00 00 00 00 00 00 00 0a 0f 4a 00 25 0e ea 00 6f
             000030 00 00 00 02 00 23 00 22 00 00 00 00 00 00 22 d5
             000040 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00
            @@ -195,5 +195,5 @@
             000ef0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
             *
             000fe0 00 00 00 00 00 74 0d bc 0c 08 0a 54 08 a0 06 ec
            -000ff0 05 38 03 84 01 d0 00 65 00 a2 d7 01 8e 48 0a 5d
            +000ff0 05 38 03 84 01 d0 00 65 00 00 00 00 00 00 00 00
             001000
            

            So, the problem is only with the last record. We must look closer at this:

            Current event: 630234
            (rr) bt
            #0  mach_write_to_2 (b=0x64000047b028 "\017\267", n=4023) at /data/Server/bb-10.5-marko/storage/innobase/include/mach0data.ic:62
            #1  0x000055c213bf9f52 in page_mem_alloc_heap<>(buf_block_t *, ulint, ulint *) (block=0x64000007d178, need=109, heap_no=0x1f9a2c2ca1d0) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1071
            #2  0x000055c213bf1062 in page_cur_insert_rec_low (cur=0x1f9a2c2cab58, index=0x616000006ff0, rec=0x6110003ae9fc "", offsets=0x61600066e1f0, mtr=0x1f9a2c2cad00) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1388
            #3  0x000055c213dfb53d in page_cur_tuple_insert (cursor=0x1f9a2c2cab58, tuple=0x61900228cbf0, index=0x616000006ff0, offsets=0x1f9a2c2caab0, heap=0x1f9a2c2caa90, n_ext=0, mtr=0x1f9a2c2cad00) at /data/Server/bb-10.5-marko/storage/innobase/include/page0cur.ic:285
            #4  0x000055c213e09bfd in btr_cur_optimistic_insert (flags=3, cursor=0x1f9a2c2cab50, offsets=0x1f9a2c2caab0, heap=0x1f9a2c2caa90, entry=0x61900228cbf0, rec=0x1f9a2c2caaf0, big_rec=0x1f9a2c2caa70, n_ext=0, thr=0x621004e35de0, mtr=0x1f9a2c2cad00) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:3562
            #5  0x000055c213b4aad5 in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3403
            #6  0x000055c213b4b2f7 in ibuf_insert (op=IBUF_OP_INSERT, entry=0x61a000844508, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3603
            

            A glaring difference is that reuse=false becomes reuse=true on recovery:

            (rr) step
            1595	      mtr->page_insert(*block, reuse,
            (rr) step
            mtr_t::page_insert (this=0x64000047bf52, block=..., reuse=false, prev_rec=1, info_bits=0 '\000', n_fields_s=94292043288653, hdr_c=5, data_c=8, hdr=0x64000047bf4a "aY\031\t\005\004", hdr_l=1, data=0x6110003aea04 "\200", data_l=89) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1103
            

            The actual mystery is: why the log that was written differs from what was buffered by recovery:

            {0x20, 0x55, 0x0, 0x84, 0x8, 0x4, 0x8c, 0xd7, 0x2c, 0x5, 0x8, 0x61, 0x80, 0x0, 0x0, 0x0, 0x1, 0x83, 0xfe, 0x0, 0x40, 0x80, 0x3f, 0x86, 0x8, 0x0, 0x8, 0x80, 0x0, 0x62, 0x0 <repeats 69 times>, 0xf8, 0xfd}
            {0x20, 0x53, 0x0, 0x84, 0x8, 0x5, 0x89, 0xdc, 0x2c, 0x5, 0xa, 0x61, 0x2, 0x0, 0x1, 0x83, 0xfe, 0x0, 0x40, 0x80, 0x3f, 0x86, 0x8, 0x0, 0x8, 0x80, 0x0, 0x66, 0x0 <repeats 69 times>, 0xf8, 0xfa}
            

            The least significant bit of the first differing byte encodes the reuse flag. Let us check which string was present in the ib_logfile0 before recovery was started. (Note: this will miss records that would span the block boundary.)

            od -Ax -t x1 -w512 ib_logfile0 |grep '20 55 00 84 08 04 8c d7 2c'|cut -d\  -f1
            od -Ax -t x1 -w512 ib_logfile0 |grep '20 53 00 84 08 05 89 dc 2c'|cut -d\  -f1
            

            Surprisingly, we find exactly one match for both. The first match occurs at the end of the 512-byte block that starts at file offset 0xa70800. The last 4 bytes are a log block checksum:

            20 55 00 84 08 04 8c d7 2c 05 08 61 80 00 00 15 84 96 bc
            

            The record continues after the 4-byte checksum and the 12-byte header of the following block:

            00 01 83 fe 00 40 80 3f 86 08 00 08 80 00 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f8 fd b2 27 e2 00
            

            This looks like an exact match of the record that we wrote. Before the NUL byte that marks the end of the minitransaction, there is another record: a WRITE to the same page.
            Now, let us look at the other match, which looks like what recovery was reading. It is located in the 512-byte block starting at byte offset 0xa68c00 of ib_logfile0:

            od -Ax -t x1 -w512 -N 99 -j 0xa68c4a ib_logfile0 
            

            a68c4a 20 53 00 84 08 05 89 dc 2c 05 0a 61 02 00 01 83 fe 00 40 80 3f 86 08 00 08 80 00 66 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f8 fa
            

            It is an exact match of the record that was read by recovery! The problem appears to be in the write side after all. I must debug further to find out when those bytes were written, and for which mini-transaction.

            marko Marko Mäkelä added a comment - Below, I am documenting my workflow at finding the root cause of this failure. Hopefully it will serve an educational purpose. I believe that with rr replay traces of the killed server and misbehaving recovered server, as well as a copy of the data directory before the recovery was started, any recovery bug can be diagnosed. When we fail to apply the log record, it is for page 1160 in the system tablespace. We can also determine the LSN: #2 0x000055de458007c8 in page_apply_insert_redundant (block=..., reuse=true, prev=2652, enc_hdr=44, hdr_c=11, data_c=10, data=0x25592084fe0d, data_len=88) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:2321 2321 ib::error() << (reuse (rr) p/x block.frame[16]@8 $1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xa2, 0xd7, 0x1} (rr) up #3 0x000055de457b21e1 in log_phys_t::apply (this=0x25592084fde8, block=..., last_offset=@0x606001178b6c: 0) at /data/Server/bb-10.5-marko/storage/innobase/log/log0recv.cc:397 397 if (page_apply_insert_redundant(block, subtype & 1, prev_rec, (rr) p/x *this $2 = {<log_rec_t> = {next = 0x25592084fe68, lsn = 0xa6a6b2}, start_lsn = 0xa6a64a} Based on this recovery failure, for the rr replay execution trace of the server that ended in SIGKILL we must check every modification to the page between LSN 0xa2d701 and 0xa6a6b2. The log record that we failed to apply had been written by the following (and I do not yet conclude that there was anything wrong with those writes): #7 0x000055c213bccd96 in mtr_t::commit (this=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/mtr/mtr0mtr.cc:442 #8 0x000055c213b3f878 in ibuf_mtr_commit (mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/ibuf0ibuf.ic:64 #9 0x000055c213b4ae1b in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3466 #10 0x000055c213b4b2f7 in ibuf_insert (op=IBUF_OP_INSERT, entry=0x61a000844508, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3603 #11 0x000055c213e02647 in btr_cur_search_to_nth_level_func (index=0x6160000f03f0, level=0, tuple=0x61a000844508, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x208943901ab0, ahi_latch=0x0, file=0x55c214795e40 "/data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc", line=2974, mtr=0x208943901e70, autoinc=0) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:1650 #12 0x000055c213c8b3e2 in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x6160000f03f0, offsets_heap=0x619000cb9d80, heap=0x619000cb8e80, entry=0x61a000844508, trx_id=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:2971 #13 0x000055c213c8c59f in row_ins_sec_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0, check_foreign=true) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3294 #14 0x000055c213c8c784 in row_ins_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3342 #15 0x000055c213c8d14d in row_ins_index_entry_step (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3509 #16 0x000055c213c8d6b6 in row_ins (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3668 #17 0x000055c213c8de86 in row_ins_step (thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3818 #18 0x000055c213cb83ea in row_insert_for_mysql (mysql_rec=0x61f0001f80b8 "\377f", prebuilt=0x6220002b2170, ins_mode=ROW_INS_NORMAL) at /data/Server/bb-10.5-marko/storage/innobase/row/row0mysql.cc:1419 #19 0x000055c213a3b359 in ha_innobase::write_row (this=0x61d0008638b8, record=0x61f0001f80b8 "\377f") at /data/Server/bb-10.5-marko/storage/innobase/handler/ha_innodb.cc:7669 … (rr) frame 21 #21 0x000055c2129b21fd in write_record (thd=thd@entry=0x62b0000af218, table=table@entry=0x619000bad198, info=info@entry=0x208943903d40, sink=sink@entry=0x0) at /data/Server/bb-10.5-marko/sql/sql_insert.cc:2107 2107 else if (unlikely((error=table->file->ha_write_row(table->record[0])))) (rr) p thd.query_string $5 = {string = {str = 0x62b0000b6238 "INSERT IGNORE INTO `t1` ( `pk`, `col_char` ) VALUES ( NULL, 'f' ) /* E_R Thread1 QNO 1506 CON_ID 16 */", length = 102}, cs = 0x55c21527dc60 <my_charset_latin1>} (rr) display 2: /x log_sys.lsn._M_i = 0xa6a6b2 (rr) when Current event: 623570 (rr) reverse-continue Continuing.   Thread 18 hit Hardware access (read/write) watchpoint 1: *(ulong*)0x64000007d1c0   Value = 9683160 0x000055c213b3a125 in std::__atomic_base<unsigned long>::load (__m=std::memory_order_relaxed, this=0x64000007d1c0) at /usr/include/c++/9/bits/atomic_base.h:419 419 return __atomic_load_n(&_M_i, int(__m)); 2: /x log_sys.lsn._M_i = 0xa6a64a (rr) bt #0 0x000055c213b3a125 in std::__atomic_base<unsigned long>::load (__m=std::memory_order_relaxed, this=0x64000007d1c0) at /usr/include/c++/9/bits/atomic_base.h:419 #1 Atomic_counter<unsigned long>::operator unsigned long (this=0x64000007d1c0) at /data/Server/bb-10.5-marko/include/my_counter.h:45 #2 0x000055c213b39ac2 in buf_page_t::oldest_modification (this=0x64000007d178) at /data/Server/bb-10.5-marko/storage/innobase/include/buf0buf.h:936 #3 0x000055c213b537da in mtr_t::is_block_dirtied (block=0x64000007d178) at /data/Server/bb-10.5-marko/storage/innobase/include/mtr0mtr.ic:35 #4 0x000055c213b5385c in mtr_t::memo_push (this=0x2089438ffd00, object=0x64000007d178, type=MTR_MEMO_PAGE_X_FIX) at /data/Server/bb-10.5-marko/storage/innobase/include/mtr0mtr.ic:57 #5 0x000055c213e45123 in buf_page_mtr_lock (block=0x64000007d178, rw_latch=2, mtr=0x2089438ffd00, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:2866 #6 0x000055c213e461de in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=10, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370, mtr=0x2089438ffd00, err=0x2089438fe430, allow_ibuf_merge=false) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:3413 #7 0x000055c213e4671a in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=10, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370, mtr=0x2089438ffd00, err=0x2089438fe430, allow_ibuf_merge=false) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:3482 #8 0x000055c213ad6899 in btr_block_get_func (index=..., page=1160, mode=2, merge=true, file=0x55c21482fc20 "/data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc", line=370, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/btr0btr.h:237 #9 0x000055c213dfd9ff in btr_cur_latch_leaves (block=0x64000007d178, latch_mode=36, cursor=0x2089438ffb50, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:370 #10 0x000055c213e0333a in btr_cur_search_to_nth_level_func (index=0x616000006ff0, level=0, tuple=0x619000cb93f0, mode=PAGE_CUR_LE, latch_mode=36, cursor=0x2089438ffb50, ahi_latch=0x0, file=0x55c21472bfa0 "/data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc", line=3276, mtr=0x2089438ffd00, autoinc=0) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:1848 #11 0x000055c213b407c8 in btr_pcur_open_low (index=0x616000006ff0, level=0, tuple=0x619000cb93f0, mode=PAGE_CUR_LE, latch_mode=36, cursor=0x2089438ffb50, file=0x55c21472bfa0 "/data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc", line=3276, autoinc=0, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/btr0pcur.ic:441 #12 0x000055c213b4a525 in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3276 … (rr) when Current event: 623569 It seems that the same mini-transaction is modifying the change buffer page two times. The recovery failure occurred for one of those 2 log records. Let us follow the trace in the forward direction from an earlier point of time: (rr) display 2: /x log_sys.lsn._M_i = 0x890396 (rr) cond 1 log_sys.lsn._M_i>=0xa2d701 (rr) continue Thread 45 hit Hardware access (read/write) watchpoint 1: *(ulong*)0x64000007d1c0   Value = 9683160 std::__atomic_base<unsigned long>::load (__m=std::memory_order_relaxed, this=0x64000007d1c0) at /usr/include/c++/9/bits/atomic_base.h:419 419 return __atomic_load_n(&_M_i, int(__m)); 2: /x log_sys.lsn._M_i = 0xa2d701 (rr) when Current event: 575927 (rr) frame 26 #26 0x000055c2129b21fd in write_record (thd=thd@entry=0x62b000150218, table=table@entry=0x6190012de598, info=info@entry=0x44a50e105d40, sink=sink@entry=0x0) at /data/Server/bb-10.5-marko/sql/sql_insert.cc:2107 2107 else if (unlikely((error=table->file->ha_write_row(table->record[0])))) (rr) p thd.query_string $6 = {string = {str = 0x62b000157238 "INSERT IGNORE INTO `t1` ( `pk` ) VALUES ( 10090 ) /* E_R Thread6 QNO 1528 CON_ID 21 */", length = 86}, cs = 0x55c21527dc60 <my_charset_latin1>} This was the last change for which the change buffer page had been written out. From this point onwards, the actions of the recovery must have diverged from the modifications to the page. Because the LSN range is very wide here, we’d better ignore accesses of block->page.oldest_modification and instead check each change of PAGE_HEAP_TOP of the page. (rr) break buf_page_read_complete thread 1 Breakpoint 3 at 0x55de45a541b3: file /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc, line 4100. (rr) disable 2 (rr) reverse-continue Continuing.   Thread 1 hit Breakpoint 3, buf_page_read_complete (bpage=0x55de4575e8c8 <mtr_t::memo_push(void*, mtr_memo_type_t)+230>, node=...) at /data/Server/bb-10.5-marko/storage/innobase/buf/buf0buf.cc:4100 4100 { (rr) n 4101 const page_id_t id(bpage->id()); (rr) 4117 const byte *frame= bpage->zip.data (rr) 4118 ? bpage->zip.data (rr) 4117 const byte *frame= bpage->zip.data (rr) 4123 if (!buf_page_decrypt_after_read(bpage, node)) (rr) p/x frame[40]@2 $6 = {0x8, 0xf3} (rr) set $f=frame (rr) display/x $f[40]@2 1: /x $f[40]@2 = {0x8, 0xf3} (rr) dump binary memory recovered-page.bin $f $f+srv_page_size (rr) enable (rr) continue At the time of the failed recovery, PAGE_HEAP_TOP was 0xf4a (3914). At the time of the page flush (as well as at the start of the recovery), the field was 0x8f3 (2291). The contents of the page only slightly differs between the start of the recovery and what ought to be the corresponding point of logical time before the SIGKILL, in the last 8 bytes of the page. The least 32 bits of the LSN and the page checksum are not updated until the page is about to be written from the buffer pool. I saved the srv_page_size=4096 bytes of dump binary memory and compared the files: diff -u <(od -Ax -t x1 recover_start.bin) <(od -Ax -t x1 before_kill.bin) --- /dev/fd/63 2021-05-31 06:10:33.779551414 +0000 +++ /dev/fd/62 2021-05-31 06:10:33.779551414 +0000 @@ -109,5 +109,5 @@ 0008f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 000fe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 74 -000ff0 05 b3 03 ff 02 4b 00 65 00 a2 d7 01 8e 48 0a 5d +000ff0 05 b3 03 ff 02 4b 00 65 00 00 00 00 00 00 00 00 001000 So, we should have a sound starting point for examining the divergence. Interestingly, near the start of the recovery, we have a page_create_low() that will reset the PAGE_HEAP_TOP to 0x7d (125). The PAGE_HEAP_TOP will grow to 0x8ba (2059) until it is reset in another page_create_low() . It will monotonically grow until we seemingly get a glitch: 3878, 3932, 4055, 3932, 4043, 2305 , 2414, …, 3803, 3914, 4025 , 3914, (failure). In the rr replay trace that was terminated by SIGKILL, we can observe the changes of PAGE_HEAP_TOP as follows: (rr) set $f=block->frame (rr) display/x $f[40]@2 3: /x $f[40]@2 = {0x8, 0xf3} (rr) watch -l $f[40]@2 Hardware watchpoint 3: -location $f[40]@2 (rr) c Continuing. [Switching to Thread 314150.314741]   Thread 18 hit Hardware watchpoint 3: -location $f[40]@2   Old value = <incomplete sequence \363> New value = "\000" __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:261 261 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory. 2: /x log_sys.lsn._M_i = 0xa33520 3: /x $f[40]@2 = {0x0, 0x0} (rr) command 3 Type commands for breakpoint(s) 3, one per line. End with a line saying just "end". >when >c >end (rr) c Continuing. (rr) c Everything seems to be in sync, and also the first monotonicity glitch turns out to be benign. First, the PAGE_HEAP_TOP increases to 4025 because the change buffer tree shrinks due to a change buffer merge during purge (and the records from an adjacent change buffer page are merged to our page). The decerase to 3914 occurs because the very last record of the page is being merged to an index page that is being accessed by dict_stats_analyze_index() (updating persistent statistics). The real divergence occurs later. I saved some rr replay output of monitoring changes to PAGE_HEAP_TOP and compared them: diff -u <(sed -ne 's/.*n=\([1-9][0-9]*\)) at.*/\1/p' heap-top-recovered|uniq) <(sed -ne 's/.*n=\([1-9][0-9]*\)) at.*/\1/p' heap-top-write|uniq) @@ -57,6 +57,25 @@ 4055 3932 4043 +234 +… +2196 2305 2414 2523 @@ -75,3 +94,4 @@ 3914 4025 3914 +4023 For some reason, recovery did not log updates of PAGE_HEAP_TOP from 234 to 2196. The last record must be what recovery was supposed to apply, but did not due to the flagged corruption. Let us check whether redo logging was properly enabled during this page_create_low() : Thread 45 hit Hardware watchpoint 3: -location $f[40]@2   Old value = "\017\\" New value = <incomplete sequence \313> mach_write_to_2 (b=0x64000047b028 <incomplete sequence \313>, n=4043) at /data/Server/bb-10.5-marko/storage/innobase/include/mach0data.ic:62 62 } 2: /x log_sys.lsn._M_i = 0xa4f621 3: /x $f[40]@2 = {0xf, 0xcb} Current event: 598722 [Switching to Thread 314150.314325]   Thread 8 hit Hardware watchpoint 3: -location $f[40]@2   Old value = <incomplete sequence \313> New value = "\000" __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:261 261 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory. 2: /x log_sys.lsn._M_i = 0xa6230e 3: /x $f[40]@2 = {0x0, 0x0} Current event: 614654   Thread 8 hit Hardware watchpoint 3: -location $f[40]@2   Old value = "\000" New value = "\000}" page_create_low (block=0x64000007d178, comp=false) at /data/Server/bb-10.5-marko/storage/innobase/page/page0page.cc:311 311 memcpy(page + PAGE_DATA, infimum_supremum_redundant, 2: /x log_sys.lsn._M_i = 0xa6230e 3: /x $f[40]@2 = {0x0, 0x7d} Current event: 614654 That call is within a change buffer page reorganize: #3 0x000055c213bff4a4 in page_create (block=0x64000007d178, mtr=0x1ff90f650b70, comp=false) at /data/Server/bb-10.5-marko/storage/innobase/page/page0page.cc:331 #4 0x000055c213dd847e in btr_page_reorganize_low (cursor=0x1ff90f64f740, index=0x616000006ff0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:1371 #5 0x000055c213dda7c2 in btr_page_reorganize_block (z_level=6, block=0x64000007d178, index=0x616000006ff0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:1606 #6 0x000055c213dea411 in btr_can_merge_with_page (cursor=0x1ff90f650800, page_no=1160, merge_block=0x1ff90f64f940, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:5167 #7 0x000055c213de1e48 in btr_compress (cursor=0x1ff90f650800, adjust=0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0btr.cc:3483 #8 0x000055c213e100d7 in btr_cur_compress_if_useful (cursor=0x1ff90f650800, adjust=0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:5468 #9 0x000055c213e11b57 in btr_cur_pessimistic_delete (err=0x1ff90f650690, has_reserved_extents=1, cursor=0x1ff90f650800, flags=0, rollback=false, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:5908 #10 0x000055c213b4e8e9 in ibuf_delete_rec (page_id=..., pcur=0x1ff90f650800, search_tuple=0x616000874bf0, mtr=0x1ff90f650b70) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:4125 #11 0x000055c213b50024 in ibuf_merge_or_delete_for_page (block=0x640000043928, page_id=..., zip_size=0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:4471 Starting with MDEV-21725 , the page reorganize is being logged in an optimized way, instead of being logged as individual inserts. Before MDEV-12353 , the operation was covered by a logical log record. Immediately after the page reorganize operation, the page contents between recovery and the original write differ as follows: diff -u <(od -Ax -t x1 /dev/shm/r-after-reorg.bin) <(od -Ax -t x1 /dev/shm/w-after-reorg.bin) --- /dev/fd/63 2021-05-31 08:20:19.561244871 +0000 +++ /dev/fd/62 2021-05-31 08:20:19.561244871 +0000 @@ -1,5 +1,5 @@ 000000 00 00 00 00 00 00 04 88 00 00 04 89 00 00 04 87 -000010 00 00 00 00 00 a2 d7 01 45 bf 00 00 00 00 00 00 +000010 00 00 00 00 00 a5 07 15 45 bf 00 00 00 00 00 00 000020 00 00 00 00 00 00 00 05 08 94 00 15 00 00 00 00 000030 08 33 00 02 00 12 00 13 00 00 00 00 00 00 21 5b 000040 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 @@ -112,5 +112,5 @@ 0008a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 000fe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 74 -000ff0 05 38 03 84 01 d0 00 65 00 a2 d7 01 8e 48 0a 5d +000ff0 05 38 03 84 01 d0 00 65 00 00 00 00 00 00 00 00 001000 On recovery, the FIL_PAGE_LSN is 0xa2d701, which was the initial FIL_PAGE_LSN of the page. The field might only be updated by recovery after all log has been applied to the page. We have the LSN and checksum difference at the end of the page as well. It is nothing to worry about. Similarly, the page contents immediately before the failed log apply are equivalent: @@ -1,5 +1,5 @@ 000000 00 00 00 00 00 00 04 88 00 00 04 89 ff ff ff ff -000010 00 00 00 00 00 a2 d7 01 45 bf 00 00 00 00 00 00 +000010 00 00 00 00 00 a6 49 aa 45 bf 00 00 00 00 00 00 000020 00 00 00 00 00 00 00 0a 0f 4a 00 25 0e ea 00 6f 000030 00 00 00 02 00 23 00 22 00 00 00 00 00 00 22 d5 000040 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 @@ -195,5 +195,5 @@ 000ef0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 000fe0 00 00 00 00 00 74 0d bc 0c 08 0a 54 08 a0 06 ec -000ff0 05 38 03 84 01 d0 00 65 00 a2 d7 01 8e 48 0a 5d +000ff0 05 38 03 84 01 d0 00 65 00 00 00 00 00 00 00 00 001000 So, the problem is only with the last record. We must look closer at this: Current event: 630234 (rr) bt #0 mach_write_to_2 (b=0x64000047b028 "\017\267", n=4023) at /data/Server/bb-10.5-marko/storage/innobase/include/mach0data.ic:62 #1 0x000055c213bf9f52 in page_mem_alloc_heap<>(buf_block_t *, ulint, ulint *) (block=0x64000007d178, need=109, heap_no=0x1f9a2c2ca1d0) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1071 #2 0x000055c213bf1062 in page_cur_insert_rec_low (cur=0x1f9a2c2cab58, index=0x616000006ff0, rec=0x6110003ae9fc "", offsets=0x61600066e1f0, mtr=0x1f9a2c2cad00) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1388 #3 0x000055c213dfb53d in page_cur_tuple_insert (cursor=0x1f9a2c2cab58, tuple=0x61900228cbf0, index=0x616000006ff0, offsets=0x1f9a2c2caab0, heap=0x1f9a2c2caa90, n_ext=0, mtr=0x1f9a2c2cad00) at /data/Server/bb-10.5-marko/storage/innobase/include/page0cur.ic:285 #4 0x000055c213e09bfd in btr_cur_optimistic_insert (flags=3, cursor=0x1f9a2c2cab50, offsets=0x1f9a2c2caab0, heap=0x1f9a2c2caa90, entry=0x61900228cbf0, rec=0x1f9a2c2caaf0, big_rec=0x1f9a2c2caa70, n_ext=0, thr=0x621004e35de0, mtr=0x1f9a2c2cad00) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:3562 #5 0x000055c213b4aad5 in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3403 #6 0x000055c213b4b2f7 in ibuf_insert (op=IBUF_OP_INSERT, entry=0x61a000844508, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3603 A glaring difference is that reuse=false becomes reuse=true on recovery: … (rr) step 1595 mtr->page_insert(*block, reuse, (rr) step mtr_t::page_insert (this=0x64000047bf52, block=..., reuse=false, prev_rec=1, info_bits=0 '\000', n_fields_s=94292043288653, hdr_c=5, data_c=8, hdr=0x64000047bf4a "aY\031\t\005\004", hdr_l=1, data=0x6110003aea04 "\200", data_l=89) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1103 The actual mystery is: why the log that was written differs from what was buffered by recovery: {0x20, 0x55, 0x0, 0x84, 0x8, 0x4, 0x8c, 0xd7, 0x2c, 0x5, 0x8, 0x61, 0x80, 0x0, 0x0, 0x0, 0x1, 0x83, 0xfe, 0x0, 0x40, 0x80, 0x3f, 0x86, 0x8, 0x0, 0x8, 0x80, 0x0, 0x62, 0x0 <repeats 69 times>, 0xf8, 0xfd} {0x20, 0x53, 0x0, 0x84, 0x8, 0x5, 0x89, 0xdc, 0x2c, 0x5, 0xa, 0x61, 0x2, 0x0, 0x1, 0x83, 0xfe, 0x0, 0x40, 0x80, 0x3f, 0x86, 0x8, 0x0, 0x8, 0x80, 0x0, 0x66, 0x0 <repeats 69 times>, 0xf8, 0xfa} The least significant bit of the first differing byte encodes the reuse flag. Let us check which string was present in the ib_logfile0 before recovery was started. (Note: this will miss records that would span the block boundary.) od -Ax -t x1 -w512 ib_logfile0 |grep '20 55 00 84 08 04 8c d7 2c'|cut -d\ -f1 od -Ax -t x1 -w512 ib_logfile0 |grep '20 53 00 84 08 05 89 dc 2c'|cut -d\ -f1 Surprisingly, we find exactly one match for both. The first match occurs at the end of the 512-byte block that starts at file offset 0xa70800. The last 4 bytes are a log block checksum: 20 55 00 84 08 04 8c d7 2c 05 08 61 80 00 00 15 84 96 bc The record continues after the 4-byte checksum and the 12-byte header of the following block: 00 01 83 fe 00 40 80 3f 86 08 00 08 80 00 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f8 fd b2 27 e2 00 This looks like an exact match of the record that we wrote. Before the NUL byte that marks the end of the minitransaction, there is another record: a WRITE to the same page. Now, let us look at the other match, which looks like what recovery was reading. It is located in the 512-byte block starting at byte offset 0xa68c00 of ib_logfile0 : od -Ax -t x1 -w512 -N 99 -j 0xa68c4a ib_logfile0 a68c4a 20 53 00 84 08 05 89 dc 2c 05 0a 61 02 00 01 83 fe 00 40 80 3f 86 08 00 08 80 00 66 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f8 fa It is an exact match of the record that was read by recovery! The problem appears to be in the write side after all. I must debug further to find out when those bytes were written, and for which mini-transaction.

            Because no wrap-around of the redo log has taken place (the payload of ib_logfile0 file ends at 0xa99000), it should be rather simple to map the LSN to byte offsets in the file. Right before the SIGKILL, we got the following writes to the log file:

            Thread 23 hit Breakpoint 2, log_write_buf (buf=0x6d6302ac9000 "\200", len=512, start_lsn=11115520, new_data_offset=376) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562
            562		ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0);
            Current event: 670464
            [Switching to Thread 314150.340446]
             
            Thread 1 hit Breakpoint 2, log_write_buf (buf=0x4b191636c000 "\200", len=2048, start_lsn=11115520, new_data_offset=430) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562
            562		ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0);
            Current event: 670870
            [Switching to Thread 314150.340452]
             
            Thread 31 hit Breakpoint 2, log_write_buf (buf=0x6d6302ac9000 "\200", len=512, start_lsn=11117056, new_data_offset=116) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562
            562		ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0);
            Current event: 671278
            [Switching to Thread 314150.340446]
             
            Thread 1 hit Breakpoint 2, log_write_buf (buf=0x4b191636c000 "\200", len=1024, start_lsn=11117056, new_data_offset=312) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562
            562		ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0);
            Current event: 671593
             
            Thread 18 received signal SIGKILL, Killed.
            

            At the time of the SIGKILL, the log_sys.lsn was 0xa9a526. The last start_lsn above is 0xa9a200, and the log_sys.write_lsn is 0xa9a526. It looks like the LSN is the file byte offset plus 0x1a00 (6656). Hence, the record for LSN 0xa723ed should be in the file at offset 0xa709ed. The unexpected record at 0xa68c4a must correspond to the LSN 0xa6a64a. Here we have it:

            1: /x log_sys.lsn._M_i = 0xa6a64a
            Current event: 623570
            (rr) bt
            #0  mtr_t::finish_write (this=0x55c213ae4877 <ilist<mtr_buf_t::block_t, void>::back() const+217>, len=35773916118888) at /data/Server/bb-10.5-marko/storage/innobase/mtr/mtr0mtr.cc:863
            #1  0x000055c213bcc798 in mtr_t::commit (this=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/mtr/mtr0mtr.cc:408
            #2  0x000055c213b3f878 in ibuf_mtr_commit (mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/ibuf0ibuf.ic:64
            #3  0x000055c213b4ae1b in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3466
            #4  0x000055c213b4b2f7 in ibuf_insert (op=IBUF_OP_INSERT, entry=0x61a000844508, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3603
            #5  0x000055c213e02647 in btr_cur_search_to_nth_level_func (index=0x6160000f03f0, level=0, tuple=0x61a000844508, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x208943901ab0, ahi_latch=0x0, file=0x55c214795e40 "/data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc", line=2974, mtr=0x208943901e70, autoinc=0)
                at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:1650
            #6  0x000055c213c8b3e2 in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x6160000f03f0, offsets_heap=0x619000cb9d80, heap=0x619000cb8e80, entry=0x61a000844508, trx_id=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:2971
            #7  0x000055c213c8c59f in row_ins_sec_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0, check_foreign=true) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3294
            #8  0x000055c213c8c784 in row_ins_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3342
            #9  0x000055c213c8d14d in row_ins_index_entry_step (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3509
            #10 0x000055c213c8d6b6 in row_ins (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3668
            #11 0x000055c213c8de86 in row_ins_step (thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3818
            #12 0x000055c213cb83ea in row_insert_for_mysql (mysql_rec=0x61f0001f80b8 "\377f", prebuilt=0x6220002b2170, ins_mode=ROW_INS_NORMAL) at /data/Server/bb-10.5-marko/storage/innobase/row/row0mysql.cc:1419
            #13 0x000055c213a3b359 in ha_innobase::write_row (this=0x61d0008638b8, record=0x61f0001f80b8 "\377f") at /data/Server/bb-10.5-marko/storage/innobase/handler/ha_innodb.cc:7669
            

            This was not found during my initial search, because I only watched the changes of the header field PAGE_HEAP_TOP. A little before the mtr_t::commit(), we actually did write the record that recovery failed to apply (note: reuse=true):

            #0  mtr_t::page_insert (this=0x2089438ffd00, block=..., reuse=true, prev_rec=2652, info_bits=0 '\000', n_fields_s=13, hdr_c=5, data_c=10, hdr=0x64000047bedb "aY\031\t\005\004", hdr_l=1, data=0x611000383e46 "\002", data_l=87) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1118
            #1  0x000055c213bf2303 in page_cur_insert_rec_low (cur=0x2089438ffb58, index=0x616000006ff0, rec=0x611000383e3c "", offsets=0x61600006a2f0, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1595
            

            Now, let us compare the ‘before’ images of the page, before this mini-transaction was started, and before we failed to apply the log record:

            diff -u <(od -Ax -t x1 recovery-before-last.bin) <(od -Ax -t x1 write-before.bin)
            

            --- /dev/fd/63	2021-05-31 11:43:47.758752427 +0000
            +++ /dev/fd/62	2021-05-31 11:43:47.758752427 +0000
            @@ -1,5 +1,5 @@
             000000 00 00 00 00 00 00 04 88 00 00 04 89 ff ff ff ff
            -000010 00 00 00 00 00 a2 d7 01 45 bf 00 00 00 00 00 00
            +000010 00 00 00 00 00 a6 4e aa 45 bf 00 00 00 00 00 00
             000020 00 00 00 00 00 00 00 0a 0f 4a 00 25 0e ea 00 6f
             000030 00 00 00 02 00 23 00 22 00 00 00 00 00 00 22 d5
             000040 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00
            @@ -195,5 +195,5 @@
             000ef0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
             *
             000fe0 00 00 00 00 00 74 0d bc 0c 08 0a 54 08 a0 06 ec
            -000ff0 05 38 03 84 01 d0 00 65 00 a2 d7 01 8e 48 0a 5d
            +000ff0 05 38 03 84 01 d0 00 65 00 00 00 00 00 00 00 00
             001000
            

            Again, only the LSN and checksum fields differ. Recovery failed, because the reuse of the record at the top of the PAGE_FREE stack exceeded the PAGE_HEAP_TOP 0xf4a. Let us check if that actually was the case. The PAGE_FREE points to 0xeea, which is the only record in the PAGE_FREE stack. PAGE_GARBAGE is 0x6f (111) bytes. At the time of writing a log record for the write, we notably have a negative free_offset:

            0x000055c213bf22fe in page_cur_insert_rec_low (cur=0x2089438ffb58, index=0x616000006ff0, rec=0x611000383e3c "", offsets=0x61600006a2f0, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1595
            1595	      mtr->page_insert(*block, reuse,
            1: /x log_sys.lsn._M_i = 0xa6a64a
            (rr) i lo
            r = 0x611000383e46 "\002"
            c = 0x64000047bacb "\001"
            c_end = 0x64000047bb22 "aY\031\t\005\004"
            data_common = 10
            block = 0x64000007d178
            rec_size = 109
            reuse = true
            free_offset = -3
            heap_no = 36
            insert_buf = 0x64000047bedb "aY\031\t\005\004"
            comp = false
            extra_size = 12
            next_rec = 0x64000047bc08 ""
            page_last_insert = 0x64000047b030 <incomplete sequence \347>
            last_insert = 0
            page_n_recs = 0x64000047b036 ""
            insert_rec = 0x64000047bee7 ""
            data_size = 97
            hdr_common = 5
            n_owned = 4
            info_status = 0 '\000'
            

            We are attempting to insert a record of 12+97 bytes, while the record in the PAGE_FREE stack was 15+96 bytes. The origin of the record will move by 3 bytes from 0xeea to 0xee7 accordingly (see insert_rec above). All seems well and good on the write side. It is the recovery side that appears to be broken. Note: this LSN 0xa6a64a matches the start_lsn in the recovery.

            marko Marko Mäkelä added a comment - Because no wrap-around of the redo log has taken place (the payload of ib_logfile0 file ends at 0xa99000), it should be rather simple to map the LSN to byte offsets in the file. Right before the SIGKILL, we got the following writes to the log file: Thread 23 hit Breakpoint 2, log_write_buf (buf=0x6d6302ac9000 "\200", len=512, start_lsn=11115520, new_data_offset=376) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562 562 ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0); Current event: 670464 [Switching to Thread 314150.340446]   Thread 1 hit Breakpoint 2, log_write_buf (buf=0x4b191636c000 "\200", len=2048, start_lsn=11115520, new_data_offset=430) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562 562 ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0); Current event: 670870 [Switching to Thread 314150.340452]   Thread 31 hit Breakpoint 2, log_write_buf (buf=0x6d6302ac9000 "\200", len=512, start_lsn=11117056, new_data_offset=116) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562 562 ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0); Current event: 671278 [Switching to Thread 314150.340446]   Thread 1 hit Breakpoint 2, log_write_buf (buf=0x4b191636c000 "\200", len=1024, start_lsn=11117056, new_data_offset=312) at /data/Server/bb-10.5-marko/storage/innobase/log/log0log.cc:562 562 ut_a(len % OS_FILE_LOG_BLOCK_SIZE == 0); Current event: 671593   Thread 18 received signal SIGKILL, Killed. At the time of the SIGKILL, the log_sys.lsn was 0xa9a526. The last start_lsn above is 0xa9a200, and the log_sys.write_lsn is 0xa9a526 . It looks like the LSN is the file byte offset plus 0x1a00 (6656). Hence, the record for LSN 0xa723ed should be in the file at offset 0xa709ed. The unexpected record at 0xa68c4a must correspond to the LSN 0xa6a64a. Here we have it: 1: /x log_sys.lsn._M_i = 0xa6a64a Current event: 623570 (rr) bt #0 mtr_t::finish_write (this=0x55c213ae4877 <ilist<mtr_buf_t::block_t, void>::back() const+217>, len=35773916118888) at /data/Server/bb-10.5-marko/storage/innobase/mtr/mtr0mtr.cc:863 #1 0x000055c213bcc798 in mtr_t::commit (this=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/mtr/mtr0mtr.cc:408 #2 0x000055c213b3f878 in ibuf_mtr_commit (mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/include/ibuf0ibuf.ic:64 #3 0x000055c213b4ae1b in ibuf_insert_low (mode=36, op=IBUF_OP_INSERT, no_counter=0, entry=0x61a000844508, entry_size=77, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3466 #4 0x000055c213b4b2f7 in ibuf_insert (op=IBUF_OP_INSERT, entry=0x61a000844508, index=0x6160000f03f0, page_id=..., zip_size=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3603 #5 0x000055c213e02647 in btr_cur_search_to_nth_level_func (index=0x6160000f03f0, level=0, tuple=0x61a000844508, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x208943901ab0, ahi_latch=0x0, file=0x55c214795e40 "/data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc", line=2974, mtr=0x208943901e70, autoinc=0) at /data/Server/bb-10.5-marko/storage/innobase/btr/btr0cur.cc:1650 #6 0x000055c213c8b3e2 in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x6160000f03f0, offsets_heap=0x619000cb9d80, heap=0x619000cb8e80, entry=0x61a000844508, trx_id=0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:2971 #7 0x000055c213c8c59f in row_ins_sec_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0, check_foreign=true) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3294 #8 0x000055c213c8c784 in row_ins_index_entry (index=0x6160000f03f0, entry=0x61a000844508, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3342 #9 0x000055c213c8d14d in row_ins_index_entry_step (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3509 #10 0x000055c213c8d6b6 in row_ins (node=0x6220002b30f0, thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3668 #11 0x000055c213c8de86 in row_ins_step (thr=0x621004e35de0) at /data/Server/bb-10.5-marko/storage/innobase/row/row0ins.cc:3818 #12 0x000055c213cb83ea in row_insert_for_mysql (mysql_rec=0x61f0001f80b8 "\377f", prebuilt=0x6220002b2170, ins_mode=ROW_INS_NORMAL) at /data/Server/bb-10.5-marko/storage/innobase/row/row0mysql.cc:1419 #13 0x000055c213a3b359 in ha_innobase::write_row (this=0x61d0008638b8, record=0x61f0001f80b8 "\377f") at /data/Server/bb-10.5-marko/storage/innobase/handler/ha_innodb.cc:7669 This was not found during my initial search, because I only watched the changes of the header field PAGE_HEAP_TOP . A little before the mtr_t::commit() , we actually did write the record that recovery failed to apply (note: reuse=true ): #0 mtr_t::page_insert (this=0x2089438ffd00, block=..., reuse=true, prev_rec=2652, info_bits=0 '\000', n_fields_s=13, hdr_c=5, data_c=10, hdr=0x64000047bedb "aY\031\t\005\004", hdr_l=1, data=0x611000383e46 "\002", data_l=87) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1118 #1 0x000055c213bf2303 in page_cur_insert_rec_low (cur=0x2089438ffb58, index=0x616000006ff0, rec=0x611000383e3c "", offsets=0x61600006a2f0, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1595 Now, let us compare the ‘before’ images of the page, before this mini-transaction was started, and before we failed to apply the log record: diff -u <(od -Ax -t x1 recovery-before-last.bin) <(od -Ax -t x1 write-before.bin) --- /dev/fd/63 2021-05-31 11:43:47.758752427 +0000 +++ /dev/fd/62 2021-05-31 11:43:47.758752427 +0000 @@ -1,5 +1,5 @@ 000000 00 00 00 00 00 00 04 88 00 00 04 89 ff ff ff ff -000010 00 00 00 00 00 a2 d7 01 45 bf 00 00 00 00 00 00 +000010 00 00 00 00 00 a6 4e aa 45 bf 00 00 00 00 00 00 000020 00 00 00 00 00 00 00 0a 0f 4a 00 25 0e ea 00 6f 000030 00 00 00 02 00 23 00 22 00 00 00 00 00 00 22 d5 000040 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 @@ -195,5 +195,5 @@ 000ef0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 000fe0 00 00 00 00 00 74 0d bc 0c 08 0a 54 08 a0 06 ec -000ff0 05 38 03 84 01 d0 00 65 00 a2 d7 01 8e 48 0a 5d +000ff0 05 38 03 84 01 d0 00 65 00 00 00 00 00 00 00 00 001000 Again, only the LSN and checksum fields differ. Recovery failed, because the reuse of the record at the top of the PAGE_FREE stack exceeded the PAGE_HEAP_TOP 0xf4a. Let us check if that actually was the case. The PAGE_FREE points to 0xeea, which is the only record in the PAGE_FREE stack. PAGE_GARBAGE is 0x6f (111) bytes. At the time of writing a log record for the write, we notably have a negative free_offset : 0x000055c213bf22fe in page_cur_insert_rec_low (cur=0x2089438ffb58, index=0x616000006ff0, rec=0x611000383e3c "", offsets=0x61600006a2f0, mtr=0x2089438ffd00) at /data/Server/bb-10.5-marko/storage/innobase/page/page0cur.cc:1595 1595 mtr->page_insert(*block, reuse, 1: /x log_sys.lsn._M_i = 0xa6a64a (rr) i lo r = 0x611000383e46 "\002" c = 0x64000047bacb "\001" c_end = 0x64000047bb22 "aY\031\t\005\004" data_common = 10 block = 0x64000007d178 rec_size = 109 reuse = true free_offset = -3 heap_no = 36 insert_buf = 0x64000047bedb "aY\031\t\005\004" comp = false extra_size = 12 next_rec = 0x64000047bc08 "" page_last_insert = 0x64000047b030 <incomplete sequence \347> last_insert = 0 page_n_recs = 0x64000047b036 "" insert_rec = 0x64000047bee7 "" data_size = 97 hdr_common = 5 n_owned = 4 info_status = 0 '\000' We are attempting to insert a record of 12+97 bytes, while the record in the PAGE_FREE stack was 15+96 bytes. The origin of the record will move by 3 bytes from 0xeea to 0xee7 accordingly (see insert_rec above). All seems well and good on the write side. It is the recovery side that appears to be broken. Note: this LSN 0xa6a64a matches the start_lsn in the recovery.

            After long debugging, I think that I found the bug. It is obvious when you look at the preceding and following lines. This affects the 10.5 and 10.6 series only.

            diff --git a/storage/innobase/page/page0cur.cc b/storage/innobase/page/page0cur.cc
            index 025d4b9b967..f75d1a8be77 100644
            --- a/storage/innobase/page/page0cur.cc
            +++ b/storage/innobase/page/page0cur.cc
            @@ -2443,7 +2443,7 @@ bool page_apply_insert_redundant(const buf_block_t &block, bool reuse,
                 if (UNIV_UNLIKELY(free_rec - fextra_size < heap_bot))
                   goto corrupted;
                 const ulint fdata_size= rec_get_data_size_old(free_rec);
            -    if (UNIV_UNLIKELY(free_rec + data_size > heap_top))
            +    if (UNIV_UNLIKELY(free_rec + fdata_size > heap_top))
                   goto corrupted;
                 if (UNIV_UNLIKELY(extra_size + data_size > fextra_size + fdata_size))
                   goto corrupted;
            

            marko Marko Mäkelä added a comment - After long debugging, I think that I found the bug. It is obvious when you look at the preceding and following lines. This affects the 10.5 and 10.6 series only. diff --git a/storage/innobase/page/page0cur.cc b/storage/innobase/page/page0cur.cc index 025d4b9b967..f75d1a8be77 100644 --- a/storage/innobase/page/page0cur.cc +++ b/storage/innobase/page/page0cur.cc @@ -2443,7 +2443,7 @@ bool page_apply_insert_redundant(const buf_block_t &block, bool reuse, if (UNIV_UNLIKELY(free_rec - fextra_size < heap_bot)) goto corrupted; const ulint fdata_size= rec_get_data_size_old(free_rec); - if (UNIV_UNLIKELY(free_rec + data_size > heap_top)) + if (UNIV_UNLIKELY(free_rec + fdata_size > heap_top)) goto corrupted; if (UNIV_UNLIKELY(extra_size + data_size > fextra_size + fdata_size)) goto corrupted;

            People

              marko Marko Mäkelä
              tomitukiainen Tomi Tukiainen
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.