Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33137

Assertion `end_lsn == page_lsn' failed in recv_recover_page

Details

    Description

      11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

      2023-12-28 23:21:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=159699380
      2023-12-28 23:21:31 0 [Note] InnoDB: End of log at LSN=173644026
      2023-12-28 23:21:31 0 [Note] InnoDB: To recover: 1098 pages
      mysqld: /data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc:3139: buf_block_t* recv_recover_page(buf_block_t*, mtr_t&, page_recv_t&, fil_space_t*, lsn_t): Assertion `end_lsn == page_lsn' failed.
      (rr) bt
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #1  0x000014fdbb011859 in __GI_abort () at abort.c:79
      #2  0x000014fdbb011729 in __assert_fail_base (fmt=0x14fdbb1a7588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55fb4d1115df "end_lsn == page_lsn", 
          file=0x55fb4d10f8b8 "/data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc", line=3139, function=<optimized out>) at assert.c:92
      #3  0x000014fdbb022fd6 in __GI___assert_fail (assertion=0x55fb4d1115df "end_lsn == page_lsn", file=0x55fb4d10f8b8 "/data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc", line=3139, 
          function=0x55fb4d111410 "buf_block_t* recv_recover_page(buf_block_t*, mtr_t&, page_recv_t&, fil_space_t*, lsn_t)") at assert.c:101
      #4  0x000055fb4c86451c in recv_recover_page (block=0x68201ad0, mtr=..., recs=..., space=0x55fb4ff8be70, init_lsn=0) at /data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc:3139
      #5  0x000055fb4c865308 in recv_recover_page (space=0x55fb4ff8be70, bpage=0x68201ad0) at /data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc:3360
      #6  0x000055fb4ca72dd5 in buf_page_t::read_complete (this=0x68201ad0, node=...) at /data/MDEV-33137/11.1/storage/innobase/buf/buf0buf.cc:3532
      #7  0x000055fb4cafbccd in IORequest::read_complete (this=0x612b26e7c2b0, io_error=0) at /data/MDEV-33137/11.1/storage/innobase/fil/fil0fil.cc:2980
      #8  0x000055fb4c8a9521 in read_io_callback (c=0x612b26e7c208) at /data/MDEV-33137/11.1/storage/innobase/os/os0file.cc:3416
      #9  0x000055fb4cb8df7c in tpool::task_group::execute (this=0x55fb4ff5c6c0, t=0x612b26e7c290) at /data/MDEV-33137/11.1/tpool/task_group.cc:70
      #10 0x000055fb4cb8e2c8 in tpool::task::execute (this=0x612b26e7b228) at /data/MDEV-33137/11.1/tpool/task.cc:32
      #11 0x000055fb4cb877c1 in tpool::thread_pool_generic::worker_main (this=0x55fb4ff18370, thread_var=0x55fb4ff18730) at /data/MDEV-33137/11.1/tpool/tpool_generic.cc:583
      #12 0x000055fb4cb8dd70 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__f=
          @0x55fb4ffb1238: (void (tpool::thread_pool_generic::*)(class tpool::thread_pool_generic * const, struct tpool::worker_data *)) 0x55fb4cb876f2 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __t=@0x55fb4ffb1230: 0x55fb4ff18370) at /usr/include/c++/9/bits/invoke.h:73
      #13 0x000055fb4cb8dc52 in std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=
          @0x55fb4ffb1238: (void (tpool::thread_pool_generic::*)(class tpool::thread_pool_generic * const, struct tpool::worker_data *)) 0x55fb4cb876f2 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>) at /usr/include/c++/9/bits/invoke.h:95
      #14 0x000055fb4cb8db71 in std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (
          this=0x55fb4ffb1228) at /usr/include/c++/9/thread:244
      #15 0x000055fb4cb8db0c in std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=0x55fb4ffb1228)
          at /usr/include/c++/9/thread:251
      #16 0x000055fb4cb8daec in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run
          (this=0x55fb4ffb1220) at /usr/include/c++/9/thread:195
      #17 0x000055fb4e428de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
      #18 0x00000c5f66746609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #19 0x000014fdbb10e133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      It looks like the failure appeared after this merge in 10.11

      commit 583a7452992acbb3c9bd74122c4f0086b9e16997
      Merge: 45fadb64c1c 9c5600adde6
      Author: Marko Mäkelä
      Date:   Tue Nov 21 10:23:11 2023 +0200
       
          Merge 10.6 into 10.11
      

      However I couldn't reproduce it on 10.6.

      A couple of test runs which reproduce the failure currently fairly well:

      randgen 3468ff84518decaa5752778562bfe79444b201c6

      # on 11.1, 11.2
      perl ./run.pl --compatibility=110299 --engine=InnoDB --mysqld=--innodb-encrypt-tables --mysqld=--innodb-encrypt-log --mysqld=--innodb-encryption-threads=4 --mysqld=--plugin-load-add=file_key_management --mysqld=--file-key-management-filename=`pwd`/util/file_key_management_keys.txt  --grammar=conf/yy/desc_indexes.yy --scenario=AtomicDDL --filter=conf/ff/restrict_dynamic_vars.ff  --gendata=conf/zz/trx_stress.zz  --queries=1000000   --threads=6 --duration=200 --mysqld=--max-statement-time=30 --mysqld=--lock-wait-timeout=15 --mysqld=--innodb-lock-wait-timeout=10  --base-port=14000 --basedir=/data/bld/10.11-asan --vardir=/dev/shm/var12 --seed=1703604571 --nometadata-reload
       
      # on 10.11
      perl ./run.pl --compatibility=101199 --duration=600 --mysqld=--max-statement-time=60 --mysqld=--lock-wait-timeout=30 --mysqld=--innodb-lock-wait-timeout=15 --threads=6 --mysqld=--in_predicate_conversion_threshold=0 --mysqld=--session_track_transaction_info=STATE --mysqld=--strict_password_validation=OFF --mysqld=--expensive_subquery_limit=1000  --queries=1000000 --reporters=Backtrace,Deadlock,MemoryUsage,FeatureUsage --mysqld=--plugin-maturity=experimental --mysqld=--loose-debug_assert_on_not_freed_memory=0 --gendata=advanced --gendata=simple --mysqld=--innodb-encrypt-tables --mysqld=--innodb-encrypt-log --mysqld=--innodb-encryption-threads=4 --mysqld=--plugin-load-add=file_key_management --mysqld=--file-key-management-filename=`pwd`/util/file_key_management_keys.txt --mysqld=--character-set-server=latin1 --mysqld=--collation-server=latin1_swedish_ci --engine=InnoDB  --gendata=simple --grammar=conf/yy/replication.yy --grammar=conf/yy/current_timestamp.yy --gendata=conf/zz/current_timestamp.zz   --grammar=conf/yy/alter_table.yy --grammar=conf/yy/all_selects.yy:0.0001 --scenario=AtomicDDL --filter=conf/ff/restrict_dynamic_vars.ff --variator=ExecuteAsUnion --base-port=14000 --basedir=/data/bld/10.11-bug-lsn --vardir=/dev/shm/var5 --seed=1701900615
      

      Attachments

        Issue Links

          Activity

            The assertion failure is directly connected with something that we have been observing on many non-debug builders, related to MDEV-22110 and MDEV-24705:

            			ut_ad(end_lsn == page_lsn);
            			if (end_lsn != page_lsn) {
            				sql_print_information(
            					"InnoDB: The last skipped log record"
            					" LSN " LSN_PF
            					" is not equal to page LSN " LSN_PF,
            					end_lsn, page_lsn);
            			}
            

            marko Marko Mäkelä added a comment - The assertion failure is directly connected with something that we have been observing on many non-debug builders, related to MDEV-22110 and MDEV-24705 : ut_ad(end_lsn == page_lsn); if (end_lsn != page_lsn) { sql_print_information( "InnoDB: The last skipped log record" " LSN " LSN_PF " is not equal to page LSN " LSN_PF, end_lsn, page_lsn); }

            Here is some data from the recovery run:

            11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

            (rr) print *l
            $3 = {<log_rec_t> = {next = 0x22e056ae9400, lsn = 167115528}, start_lsn = 167115222}
            (rr) i lo lsn
            page_lsn = 159994270
            start_lsn = 0
            end_lsn = 159993940
            recv_start_lsn = 167115222
            (rr) print recv_sys.lsn
            $5 = 173644026
            (rr) print log_sys.next_checkpoint_lsn
            $6 = 159699380
            

            That is, we have a FIL_PAGE_LSN that is somewhere between the recovery start LSN (the checkpoint LSN) and the end of the recovered log. These do correspond to the log output:

            11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

            2023-12-28 23:21:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=159699380
            2023-12-28 23:21:31 0 [Note] InnoDB: End of log at LSN=173644026
            2023-12-28 23:21:31 0 [Note] InnoDB: To recover: 1098 pages
            mysqld: /data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc:3139: buf_block_t* recv_recover_page(buf_block_t*, mtr_t&, page_recv_t&, fil_space_t*, lsn_t): Assertion `end_lsn == page_lsn' failed.
            

            The last checkpoint had been completed rather shortly before the server had been killed:

            11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

            Thread 4 hit Hardware watchpoint 3: -location log_sys.next_checkpoint_lsn
             
            Old value = 154809477
            New value = 159699380
            log_checkpoint_low (oldest_lsn=159699380, end_lsn=167367010) at /data/MDEV-33137/11.1/storage/innobase/buf/buf0flu.cc:1971
            1971	  log_sys.write_checkpoint(end_lsn);
            Current event: 1159849
            

            At that point the oldest_modification of the page 0x100000005 was 167115222, which is exactly the same as the local debug variable recv_start_lsn in the recovery trace. The write of the page (with FIL_PAGE_LSN=159994270) had taken place when log_sys.lsn and log_sys.flushed_to_disk_lsn were both 161771086, at when 1120317 of the server process. At the time of the write completion, the buf_flush_page_cleaner was in a timed wait for the start of its next 1-second slice.

            marko Marko Mäkelä added a comment - Here is some data from the recovery run: 11.1 50799752dc6306c35a47405c8f556d0b9639a1a9 (rr) print *l $3 = {<log_rec_t> = {next = 0x22e056ae9400, lsn = 167115528}, start_lsn = 167115222} (rr) i lo lsn page_lsn = 159994270 start_lsn = 0 end_lsn = 159993940 recv_start_lsn = 167115222 (rr) print recv_sys.lsn $5 = 173644026 (rr) print log_sys.next_checkpoint_lsn $6 = 159699380 That is, we have a FIL_PAGE_LSN that is somewhere between the recovery start LSN (the checkpoint LSN) and the end of the recovered log. These do correspond to the log output: 11.1 50799752dc6306c35a47405c8f556d0b9639a1a9 2023-12-28 23:21:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=159699380 2023-12-28 23:21:31 0 [Note] InnoDB: End of log at LSN=173644026 2023-12-28 23:21:31 0 [Note] InnoDB: To recover: 1098 pages mysqld: /data/MDEV-33137/11.1/storage/innobase/log/log0recv.cc:3139: buf_block_t* recv_recover_page(buf_block_t*, mtr_t&, page_recv_t&, fil_space_t*, lsn_t): Assertion `end_lsn == page_lsn' failed. The last checkpoint had been completed rather shortly before the server had been killed: 11.1 50799752dc6306c35a47405c8f556d0b9639a1a9 Thread 4 hit Hardware watchpoint 3: -location log_sys.next_checkpoint_lsn   Old value = 154809477 New value = 159699380 log_checkpoint_low (oldest_lsn=159699380, end_lsn=167367010) at /data/MDEV-33137/11.1/storage/innobase/buf/buf0flu.cc:1971 1971 log_sys.write_checkpoint(end_lsn); Current event: 1159849 At that point the oldest_modification of the page 0x100000005 was 167115222, which is exactly the same as the local debug variable recv_start_lsn in the recovery trace. The write of the page (with FIL_PAGE_LSN =159994270) had taken place when log_sys.lsn and log_sys.flushed_to_disk_lsn were both 161771086, at when 1120317 of the server process. At the time of the write completion, the buf_flush_page_cleaner was in a timed wait for the start of its next 1-second slice.

            After the last write of the undo page 0x100000005 and the server kill, we got a number of mini-transactions that had modified the page:

            LSN when operation
            167115528 1155381 trx_undo_report_row_operation
            167241876 1155458 trx_t::commit in dict_stats_save during ANALYZE TABLE
            167753178 1160455 trx_purge_free_segment in purge_truncation_callback
            167753244 1160455 trx_purge_free_segment in purge_truncation_callback
            167753310 1160455 trx_purge_free_segment in purge_truncation_callback
            167753376 1160455 trx_purge_free_segment in purge_truncation_callback
            167753508 1160455 trx_purge_truncate_rseg_history in purge_truncation_callback

            The purge_truncation_callback is something that was refactored into a separate task in MDEV-32050. Previously, this code was invoked much less frequently. The purpose of this code is to mark undo pages as freed after all history that they may contain has been purged. The rollback segment header page 0x100000005 itself was never marked as freed, and this execution is with innodb_undo_log_truncate=OFF.

            But, we need to answer the question: Where did the LSN 159993940 come from, and what is causing that mismatch? This should be our mini-transaction of interest:

            11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

            #1  0x00005644c8b6d694 in mach_write_to_8 (b=0x68258010, n=159993940)
                at /data/MDEV-33137/11.1/storage/innobase/include/mach0data.inl:342
            #2  0x00005644c8b6f82a in mtr_t::commit (this=0x105506e366b0)
                at /data/MDEV-33137/11.1/storage/innobase/mtr/mtr0mtr.cc:439
            #3  0x00005644c8ca3a90 in trx_purge_free_segment (rseg_hdr=0x68201040, …)
            …
            (rr) when
            Current event: 1117280
            (rr) frame 2
            #2  0x00005644c8b6f82a in mtr_t::commit (this=0x105506e366b0)
                at /data/MDEV-33137/11.1/storage/innobase/mtr/mtr0mtr.cc:439
            439	            mach_write_to_8(bpage->frame + FIL_PAGE_LSN, m_commit_lsn);
            (rr) p/x *m_log.m_first_block.m_data@m_log.m_size
            $17 = {0x39, 0x1, 0x5, 0x17, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, 0xb7, 0x25, 
              0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, 0x34, 0x1, 0x5, 0x14, 0xf, 0x34, 0x1, 
              0x5, 0x8, 0xf, 0x46, 0x1, 0x2, 0xcf, 0x64, 0x29, 0xc6, 0x34, 0x1, 0x0, 0x93, 
              0xa, 0x2, 0x1, 0x3b, 0x35, 0x1, 0x0, 0x2e, 0xb, 0x20, 0x77, 0x1, 0x5, 0x25, 
              0x34, 0x0, 0x32, 0xb, 0x77, 0x1, 0x2, 0x6e, 0xbc, 0xf8, 0x6, 0xeb, 0x77, 
              0x1, 0x0, 0xae, 0x6e, 0xe4, 0x57, 0x8d}
            

            The 73 bytes of log records of this mini-transaction are as follows (one per line):

            bytes explanation
            0x39, 0x1, 0x5, 0x17, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, WRITE, page 1:5
            0xb7, 0x25, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, WRITE, same page
            0x34, 0x1, 0x5, 0x14, 0xf, WRITE, page 1:5
            0x34, 0x1, 0x5, 0x8, 0xf, WRITE, page 1:5
            0x46, 0x1, 0x2, 0xcf, 0x64, 0x29, 0xc6, MEMSET, page 1:2
            0x34, 0x1, 0x0, 0x93, 0xa, WRITE, page 1:0
            0x2, 0x1, 0x3b, FREE_PAGE, page 1:0x3b
            0x35, 0x1, 0x0, 0x2e, 0xb, 0x20, WRITE, page 1:0
            0x77, 0x1, 0x5, 0x25, 0x34, 0x0, 0x32, 0xb, OPTION, page 1:5 (MDEV-18976 checksum)
            0x77, 0x1, 0x2, 0x6e, 0xbc, 0xf8, 0x6, 0xeb, OPTION, page 1:2 (MDEV-18976 checksum)
            0x77, 0x1, 0x0, 0xae, 0x6e, 0xe4, 0x57, 0x8d OPTION, page 1:0 (MDEV-18976 checksum)

            Because we have innodb_encrypt_log=ON, the log record payload (after the record type and page identifier) is encrypted. That is why the OPTION records do not have the expected subtype byte 0 after the page number. We can observe that some WRITE logging for page 1:5 could be optimized to make use of the MDEV-12353 ‘same page’ flag.

            These log records were copied to the memory-mapped ib_logfile0 at log_sys.buf + 59342846. The circular log file had wrapped around exactly 1 time. We have log_sys.file_size = 100663296 and log_sys.first_lsn = 12288. Therefore, the start offset of the mini-transaction 59342846 should correspond to the LSN 100663296-12288+59342846 = 159993854. If this calculation is correct, the 159993940 that we saw during recovery should correspond to byte 86, which would be 13 bytes after the end of the above records. In the log_sys.buf the ‘extra’ bytes include the mini-transaction end marker. Because we have innodb_encrypt_log=ON, after the first byte (in this case 0 because the log has wrapped around an odd number of times) we have an 8-byte cryptographic nonce that is roughly the LSN at that time. Finally there is a CRC-32C checksum of the mini-transaction:
            0x0, 0x0, 0x0, 0x0, 0x0, 0x9, 0x89, 0x4f, 0xfe, 0x60, 0x9d, 0x49, 0x5d
            At this point of execution I see log_sys.lsn=159993940=0x9895054, and the nonce is a little smaller, 0x9894ffe.

            So, the end_lsn=159993940 during recovery seems to correspond to the end LSN of this mini-transaction in the log. It is also equal to the m_commit_lsn.

            After this we have further mini-transactions that modify the page:

            LSN when operation
            159994006 1117280 trx_purge_free_segment
            159994072 1117280 trx_purge_free_segment
            159994138 1117280 trx_purge_free_segment
            159994204 1117280 trx_purge_free_segment
            159994270 1117280 trx_purge_truncate_rseg_history
            167115528 1155381 trx_undo_report_row_operation during dict_stats_save for ANALYZE TABLE

            At recovery, the interesting LSN are 159993940 and 159994270.

            This only identifies the place that I need to investigate deeper, tomorrow.

            marko Marko Mäkelä added a comment - After the last write of the undo page 0x100000005 and the server kill, we got a number of mini-transactions that had modified the page: LSN when operation 167115528 1155381 trx_undo_report_row_operation 167241876 1155458 trx_t::commit in dict_stats_save during ANALYZE TABLE 167753178 1160455 trx_purge_free_segment in purge_truncation_callback 167753244 1160455 trx_purge_free_segment in purge_truncation_callback 167753310 1160455 trx_purge_free_segment in purge_truncation_callback 167753376 1160455 trx_purge_free_segment in purge_truncation_callback 167753508 1160455 trx_purge_truncate_rseg_history in purge_truncation_callback The purge_truncation_callback is something that was refactored into a separate task in MDEV-32050 . Previously, this code was invoked much less frequently. The purpose of this code is to mark undo pages as freed after all history that they may contain has been purged. The rollback segment header page 0x100000005 itself was never marked as freed, and this execution is with innodb_undo_log_truncate=OFF . But, we need to answer the question: Where did the LSN 159993940 come from, and what is causing that mismatch? This should be our mini-transaction of interest: 11.1 50799752dc6306c35a47405c8f556d0b9639a1a9 #1 0x00005644c8b6d694 in mach_write_to_8 (b=0x68258010, n=159993940) at /data/MDEV-33137/11.1/storage/innobase/include/mach0data.inl:342 #2 0x00005644c8b6f82a in mtr_t::commit (this=0x105506e366b0) at /data/MDEV-33137/11.1/storage/innobase/mtr/mtr0mtr.cc:439 #3 0x00005644c8ca3a90 in trx_purge_free_segment (rseg_hdr=0x68201040, …) … (rr) when Current event: 1117280 (rr) frame 2 #2 0x00005644c8b6f82a in mtr_t::commit (this=0x105506e366b0) at /data/MDEV-33137/11.1/storage/innobase/mtr/mtr0mtr.cc:439 439 mach_write_to_8(bpage->frame + FIL_PAGE_LSN, m_commit_lsn); (rr) p/x *m_log.m_first_block.m_data@m_log.m_size $17 = {0x39, 0x1, 0x5, 0x17, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, 0xb7, 0x25, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, 0x34, 0x1, 0x5, 0x14, 0xf, 0x34, 0x1, 0x5, 0x8, 0xf, 0x46, 0x1, 0x2, 0xcf, 0x64, 0x29, 0xc6, 0x34, 0x1, 0x0, 0x93, 0xa, 0x2, 0x1, 0x3b, 0x35, 0x1, 0x0, 0x2e, 0xb, 0x20, 0x77, 0x1, 0x5, 0x25, 0x34, 0x0, 0x32, 0xb, 0x77, 0x1, 0x2, 0x6e, 0xbc, 0xf8, 0x6, 0xeb, 0x77, 0x1, 0x0, 0xae, 0x6e, 0xe4, 0x57, 0x8d} The 73 bytes of log records of this mini-transaction are as follows (one per line): bytes explanation 0x39, 0x1, 0x5, 0x17, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, WRITE, page 1:5 0xb7, 0x25, 0xf0, 0xc4, 0x69, 0xaf, 0xf0, 0x1c, WRITE, same page 0x34, 0x1, 0x5, 0x14, 0xf, WRITE, page 1:5 0x34, 0x1, 0x5, 0x8, 0xf, WRITE, page 1:5 0x46, 0x1, 0x2, 0xcf, 0x64, 0x29, 0xc6, MEMSET, page 1:2 0x34, 0x1, 0x0, 0x93, 0xa, WRITE, page 1:0 0x2, 0x1, 0x3b, FREE_PAGE, page 1:0x3b 0x35, 0x1, 0x0, 0x2e, 0xb, 0x20, WRITE, page 1:0 0x77, 0x1, 0x5, 0x25, 0x34, 0x0, 0x32, 0xb, OPTION, page 1:5 ( MDEV-18976 checksum) 0x77, 0x1, 0x2, 0x6e, 0xbc, 0xf8, 0x6, 0xeb, OPTION, page 1:2 ( MDEV-18976 checksum) 0x77, 0x1, 0x0, 0xae, 0x6e, 0xe4, 0x57, 0x8d OPTION, page 1:0 ( MDEV-18976 checksum) Because we have innodb_encrypt_log=ON , the log record payload (after the record type and page identifier) is encrypted. That is why the OPTION records do not have the expected subtype byte 0 after the page number. We can observe that some WRITE logging for page 1:5 could be optimized to make use of the MDEV-12353 ‘same page’ flag. These log records were copied to the memory-mapped ib_logfile0 at log_sys.buf + 59342846 . The circular log file had wrapped around exactly 1 time. We have log_sys.file_size = 100663296 and log_sys.first_lsn = 12288 . Therefore, the start offset of the mini-transaction 59342846 should correspond to the LSN 100663296-12288+59342846 = 159993854. If this calculation is correct, the 159993940 that we saw during recovery should correspond to byte 86, which would be 13 bytes after the end of the above records. In the log_sys.buf the ‘extra’ bytes include the mini-transaction end marker. Because we have innodb_encrypt_log=ON , after the first byte (in this case 0 because the log has wrapped around an odd number of times) we have an 8-byte cryptographic nonce that is roughly the LSN at that time. Finally there is a CRC-32C checksum of the mini-transaction: 0x0, 0x0, 0x0, 0x0, 0x0, 0x9, 0x89, 0x4f, 0xfe, 0x60, 0x9d, 0x49, 0x5d At this point of execution I see log_sys.lsn=159993940=0x9895054 , and the nonce is a little smaller, 0x9894ffe. So, the end_lsn=159993940 during recovery seems to correspond to the end LSN of this mini-transaction in the log. It is also equal to the m_commit_lsn . After this we have further mini-transactions that modify the page: LSN when operation 159994006 1117280 trx_purge_free_segment 159994072 1117280 trx_purge_free_segment 159994138 1117280 trx_purge_free_segment 159994204 1117280 trx_purge_free_segment 159994270 1117280 trx_purge_truncate_rseg_history 167115528 1155381 trx_undo_report_row_operation during dict_stats_save for ANALYZE TABLE At recovery, the interesting LSN are 159993940 and 159994270. This only identifies the place that I need to investigate deeper, tomorrow.

            A peculiarity about the mini-transaction that ends at LSN 159994006 is that the only log record for the page 1:5 is the OPTION record written by mtr_t::page_checksum() (MDEV-18976). The record was written because the block was marked as modified, even though there was no modification, here:

            11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

            #0  0x00005644c8b75145 in mtr_t::set_modified (this=0x105506e366b0, block=...) at /data/MDEV-33137/11.1/storage/innobase/mtr/mtr0mtr.cc:1537
            #1  0x00005644c8a9bcea in mtr_t::memcpy_low (this=0x105506e366b0, block=..., offset=50, data=0x68258032, len=6) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0log.h:322
            #2  0x00005644c8b98340 in mtr_t::memcpy (this=0x105506e366b0, b=..., offset=50, len=6) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0log.h:310
            #3  0x00005644c8b9c713 in mtr_t::memcpy<(mtr_t::write_type)0> (this=0x105506e366b0, b=..., dest=0x68258032, str=0x105506e365b0, len=6) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0log.h:513
            #4  0x00005644c8df7e89 in flst_write_addr (block=..., faddr=0x68258032 "\377\377\377\377", page=4294967295, boffset=0, mtr=0x105506e366b0) at /data/MDEV-33137/11.1/storage/innobase/fut/fut0lst.cc:65
            #5  0x00005644c8df9de0 in flst_remove (base=0x68201040, boffset=46, cur=0x127749001f70, coffset=120, mtr=0x105506e366b0) at /data/MDEV-33137/11.1/storage/innobase/fut/fut0lst.cc:333
            #6  0x00005644c8ca3ef4 in trx_purge_truncate_rseg_history (rseg=..., limit=..., all=true) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:445
            #7  0x00005644c8ca47c8 in purge_sys_t::iterator::free_history (this=0x105506e36ad0) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:555
            #8  0x00005644c8c90a1f in purge_truncation_callback () at /data/MDEV-33137/11.1/storage/innobase/srv/srv0srv.cc:1104
            

            The MDEV-18976 records have only ever been written in debug builds. But, in non-debug builds we occasionally see test failures due to unexpected messages InnoDB: The last skipped log record LSN …. So, the problem is not with mtr_t::page_checksum() but with the incorrect flagging of blocks as modified. In other words, MDEV-22110 had not been fully fixed.

            marko Marko Mäkelä added a comment - A peculiarity about the mini-transaction that ends at LSN 159994006 is that the only log record for the page 1:5 is the OPTION record written by mtr_t::page_checksum() ( MDEV-18976 ). The record was written because the block was marked as modified, even though there was no modification, here: 11.1 50799752dc6306c35a47405c8f556d0b9639a1a9 #0 0x00005644c8b75145 in mtr_t::set_modified (this=0x105506e366b0, block=...) at /data/MDEV-33137/11.1/storage/innobase/mtr/mtr0mtr.cc:1537 #1 0x00005644c8a9bcea in mtr_t::memcpy_low (this=0x105506e366b0, block=..., offset=50, data=0x68258032, len=6) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0log.h:322 #2 0x00005644c8b98340 in mtr_t::memcpy (this=0x105506e366b0, b=..., offset=50, len=6) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0log.h:310 #3 0x00005644c8b9c713 in mtr_t::memcpy<(mtr_t::write_type)0> (this=0x105506e366b0, b=..., dest=0x68258032, str=0x105506e365b0, len=6) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0log.h:513 #4 0x00005644c8df7e89 in flst_write_addr (block=..., faddr=0x68258032 "\377\377\377\377", page=4294967295, boffset=0, mtr=0x105506e366b0) at /data/MDEV-33137/11.1/storage/innobase/fut/fut0lst.cc:65 #5 0x00005644c8df9de0 in flst_remove (base=0x68201040, boffset=46, cur=0x127749001f70, coffset=120, mtr=0x105506e366b0) at /data/MDEV-33137/11.1/storage/innobase/fut/fut0lst.cc:333 #6 0x00005644c8ca3ef4 in trx_purge_truncate_rseg_history (rseg=..., limit=..., all=true) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:445 #7 0x00005644c8ca47c8 in purge_sys_t::iterator::free_history (this=0x105506e36ad0) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:555 #8 0x00005644c8c90a1f in purge_truncation_callback () at /data/MDEV-33137/11.1/storage/innobase/srv/srv0srv.cc:1104 The MDEV-18976 records have only ever been written in debug builds. But, in non-debug builds we occasionally see test failures due to unexpected messages InnoDB: The last skipped log record LSN … . So, the problem is not with mtr_t::page_checksum() but with the incorrect flagging of blocks as modified. In other words, MDEV-22110 had not been fully fixed.

            I was misled by a data watchpoint. The rollback segment header is actually falsely claimed as modified in a later mini-transaction that reuses the same memory buffer:

            11.1 50799752dc6306c35a47405c8f556d0b9639a1a9

            #2  0x00005644c8b48708 in mtr_t::memo_push (this=0x105506e366b0, block=0x68201040, type=MTR_MEMO_PAGE_X_MODIFY) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0mtr.h:415
            #3  0x00005644c8ca3c1a in trx_purge_free_segment (rseg_hdr=0x68201040, block=0x127749001f70, mtr=...) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:379
            #4  0x00005644c8ca412b in trx_purge_truncate_rseg_history (rseg=..., limit=..., all=true) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:469
            #5  0x00005644c8ca47c8 in purge_sys_t::iterator::free_history (this=0x105506e36ad0) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:555
            #6  0x00005644c8c90a1f in purge_truncation_callback () at /data/MDEV-33137/11.1/storage/innobase/srv/srv0srv.cc:1104
            

            At the time of the next mtr_t::commit() call, the mtr_t::m_memo refers to 5 buffer pool pages, all claimed to be modified. In fact, there are log records for only 3 of the pages: 2, 0, 0x3a. The pages 5 and 0x36 were not modified. They were registered as such at the start of the mini-transaction, at the end of the first loop body of trx_purge_free_segment(). I checked for other calls of mtr_t::memo_push() that would set the MODIFY flag. In buf_page_free() it is correct, because the page will actually be modified (or marked as free in the file). All other calls look incorrect to me:

            diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
            index b500be548db..efbf2396991 100644
            --- a/storage/innobase/trx/trx0purge.cc
            +++ b/storage/innobase/trx/trx0purge.cc
            @@ -378,8 +378,8 @@ static void trx_purge_free_segment(buf_block_t *rseg_hdr, buf_block_t *block,
                 ut_ad(rseg_hdr->page.id() == rseg_hdr_id);
                 block->page.lock.x_lock();
                 ut_ad(block->page.id() == id);
            -    mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_MODIFY);
            -    mtr.memo_push(block, MTR_MEMO_PAGE_X_MODIFY);
            +    mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_FIX);
            +    mtr.memo_push(block, MTR_MEMO_PAGE_X_FIX);
               }
             
               while (!fseg_free_step(TRX_UNDO_SEG_HDR + TRX_UNDO_FSEG_HEADER +
            @@ -502,7 +502,7 @@ trx_purge_truncate_rseg_history(trx_rseg_t &rseg,
               mtr.start();
               rseg_hdr->page.lock.x_lock();
               ut_ad(rseg_hdr->page.id() == rseg.page_id());
            -  mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_MODIFY);
            +  mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_FIX);
             
               goto loop;
             }
            

            These calls had been added to MariaDB Server 10.6 in MDEV-32820 and MDEV-29593. Usually, a call to mtr_t::set_modified() will take care of flagging blocks as modified. I checked all calls to that; they will write log records if the page is persistent.

            marko Marko Mäkelä added a comment - I was misled by a data watchpoint. The rollback segment header is actually falsely claimed as modified in a later mini-transaction that reuses the same memory buffer: 11.1 50799752dc6306c35a47405c8f556d0b9639a1a9 #2 0x00005644c8b48708 in mtr_t::memo_push (this=0x105506e366b0, block=0x68201040, type=MTR_MEMO_PAGE_X_MODIFY) at /data/MDEV-33137/11.1/storage/innobase/include/mtr0mtr.h:415 #3 0x00005644c8ca3c1a in trx_purge_free_segment (rseg_hdr=0x68201040, block=0x127749001f70, mtr=...) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:379 #4 0x00005644c8ca412b in trx_purge_truncate_rseg_history (rseg=..., limit=..., all=true) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:469 #5 0x00005644c8ca47c8 in purge_sys_t::iterator::free_history (this=0x105506e36ad0) at /data/MDEV-33137/11.1/storage/innobase/trx/trx0purge.cc:555 #6 0x00005644c8c90a1f in purge_truncation_callback () at /data/MDEV-33137/11.1/storage/innobase/srv/srv0srv.cc:1104 At the time of the next mtr_t::commit() call, the mtr_t::m_memo refers to 5 buffer pool pages, all claimed to be modified. In fact, there are log records for only 3 of the pages: 2, 0, 0x3a. The pages 5 and 0x36 were not modified. They were registered as such at the start of the mini-transaction, at the end of the first loop body of trx_purge_free_segment() . I checked for other calls of mtr_t::memo_push() that would set the MODIFY flag. In buf_page_free() it is correct, because the page will actually be modified (or marked as free in the file). All other calls look incorrect to me: diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index b500be548db..efbf2396991 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -378,8 +378,8 @@ static void trx_purge_free_segment(buf_block_t *rseg_hdr, buf_block_t *block, ut_ad(rseg_hdr->page.id() == rseg_hdr_id); block->page.lock.x_lock(); ut_ad(block->page.id() == id); - mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_MODIFY); - mtr.memo_push(block, MTR_MEMO_PAGE_X_MODIFY); + mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_FIX); + mtr.memo_push(block, MTR_MEMO_PAGE_X_FIX); } while (!fseg_free_step(TRX_UNDO_SEG_HDR + TRX_UNDO_FSEG_HEADER + @@ -502,7 +502,7 @@ trx_purge_truncate_rseg_history(trx_rseg_t &rseg, mtr.start(); rseg_hdr->page.lock.x_lock(); ut_ad(rseg_hdr->page.id() == rseg.page_id()); - mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_MODIFY); + mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_FIX); goto loop; } These calls had been added to MariaDB Server 10.6 in MDEV-32820 and MDEV-29593 . Usually, a call to mtr_t::set_modified() will take care of flagging blocks as modified. I checked all calls to that; they will write log records if the page is persistent.

            elenst, can you please test the above patch on a branch where you can easily reproduce this?

            marko Marko Mäkelä added a comment - elenst , can you please test the above patch on a branch where you can easily reproduce this?
            elenst Elena Stepanova added a comment - - edited

            I couldn't reproduce the failure on 10.11 or 11.2 with the patch above in a reasonable amount of attempts, so it apparently fixes at least the reported failure.

            For the record, this is the test I used this time, both for 10.11 and 11.2:

            randgen ff3b5a8b62b5735821092c043e718930f8c116ca

            perl ./run.pl --compatibility=101199 --duration=600 --mysqld=--max-statement-time=60 --mysqld=--lock-wait-timeout=30 --mysqld=--innodb-lock-wait-timeout=15 --threads=6 --queries=1000000 --reporters=Backtrace  --mysqld=--loose-debug_assert_on_not_freed_memory=0  --gendata=simple --mysqld=--innodb-encrypt-tables --mysqld=--innodb-encrypt-log --mysqld=--innodb-encryption-threads=4 --mysqld=--plugin-load-add=file_key_management --mysqld=--file-key-management-filename=`pwd`/util/file_key_management_keys.txt  --engine=InnoDB  --gendata=simple --grammar=conf/yy/dml.yy --grammar=conf/yy/current_timestamp.yy --gendata=conf/zz/current_timestamp.zz    --grammar=conf/yy/all_selects.yy:0.0001 --scenario=AtomicDDL --filter=conf/ff/restrict_dynamic_vars.ff --variator=ExecuteAsUnion --base-port=10700 --basedir=/data/bld/11.2-patch-MDEV-33137/ --vardir=/dev/shm/var-MDEV33137-5 --seed=1701900615 --trials=15 --mysqld=--enforce-storage-engine=InnoDB
            

            Comparing to the tests in the description, it enforces InnoDB storage engine and is somewhat simpler (less DDL), to avoid irrelevant failures. This way, without the patch I got the failure in 1-2 attempts on each branch, and with the patch got a clean run of 15 attempts on each branch.

            However, it only checks that there are no crashes, assertion failures or corruption-like errors upon/after crash recovery. It doesn't verify the functionality or performance.

            elenst Elena Stepanova added a comment - - edited I couldn't reproduce the failure on 10.11 or 11.2 with the patch above in a reasonable amount of attempts, so it apparently fixes at least the reported failure. For the record, this is the test I used this time, both for 10.11 and 11.2: randgen ff3b5a8b62b5735821092c043e718930f8c116ca perl ./run.pl --compatibility=101199 --duration=600 --mysqld=--max-statement-time=60 --mysqld=--lock-wait-timeout=30 --mysqld=--innodb-lock-wait-timeout=15 --threads=6 --queries=1000000 --reporters=Backtrace --mysqld=--loose-debug_assert_on_not_freed_memory=0 --gendata=simple --mysqld=--innodb-encrypt-tables --mysqld=--innodb-encrypt-log --mysqld=--innodb-encryption-threads=4 --mysqld=--plugin-load-add=file_key_management --mysqld=--file-key-management-filename=`pwd`/util/file_key_management_keys.txt --engine=InnoDB --gendata=simple --grammar=conf/yy/dml.yy --grammar=conf/yy/current_timestamp.yy --gendata=conf/zz/current_timestamp.zz --grammar=conf/yy/all_selects.yy:0.0001 --scenario=AtomicDDL --filter=conf/ff/restrict_dynamic_vars.ff --variator=ExecuteAsUnion --base-port=10700 --basedir=/data/bld/11.2-patch-MDEV-33137/ --vardir=/dev/shm/var-MDEV33137-5 --seed=1701900615 --trials=15 --mysqld=--enforce-storage-engine=InnoDB Comparing to the tests in the description, it enforces InnoDB storage engine and is somewhat simpler (less DDL), to avoid irrelevant failures. This way, without the patch I got the failure in 1-2 attempts on each branch, and with the patch got a clean run of 15 attempts on each branch. However, it only checks that there are no crashes, assertion failures or corruption-like errors upon/after crash recovery. It doesn't verify the functionality or performance.

            Thank you, elenst. I think that the MDEV-18976 page checksum records should be rather good at catching recovery errors for DML workloads.

            marko Marko Mäkelä added a comment - Thank you, elenst . I think that the MDEV-18976 page checksum records should be rather good at catching recovery errors for DML workloads.

            1. Setting MTR_MEMO_PAGE_X_MODIFY without actually modifying the page could result in the problem. So, I agree here but we should work on a mtr test too.

            • It seems apparently harmless though for release builds

            2. There is a second issue with setting MTR_MEMO_PAGE_X_MODIFY directly without the page being modified later.

            • This could result in mtr commit to add page to flush list without taking the flush order mutex. Which could in turn result in wrong checkpoint calculation and eventual data loss / corruption .
            • So, we should absolutely remove all direct setting of MTR_MEMO_PAGE_X_MODIFY other than mtr_t::set_modified(). Perhaps somehow enforce it or at least document well.
            • Perhaps assert in mtr commit
            • Possibly add a test

            3. In purge code perhaps better/cleaner to reacquire the pages as opposed to pinning it. Usually pinning page to keep it alive is a special case and is better to be used when absolutely required. I suspect it is not going to cause any harm during purge if we get the required pages again.

            debarun Debarun Banerjee added a comment - 1. Setting MTR_MEMO_PAGE_X_MODIFY without actually modifying the page could result in the problem. So, I agree here but we should work on a mtr test too. It seems apparently harmless though for release builds 2. There is a second issue with setting MTR_MEMO_PAGE_X_MODIFY directly without the page being modified later. This could result in mtr commit to add page to flush list without taking the flush order mutex. Which could in turn result in wrong checkpoint calculation and eventual data loss / corruption . So, we should absolutely remove all direct setting of MTR_MEMO_PAGE_X_MODIFY other than mtr_t::set_modified(). Perhaps somehow enforce it or at least document well. Perhaps assert in mtr commit Possibly add a test 3. In purge code perhaps better/cleaner to reacquire the pages as opposed to pinning it. Usually pinning page to keep it alive is a special case and is better to be used when absolutely required. I suspect it is not going to cause any harm during purge if we get the required pages again.

            debarun, I mostly agree with you. For test coverage, we occasionally had some crash recovery tests failing due to this:

            10.6 aff5ed3988a3add17d607e6184288f82b567dbee

            innodb_zip.bug56680 '4k,innodb'          w9 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2023-12-17 11:57:59
            line
            2023-12-17 11:57:58 0 [Warning] InnoDB: The last skipped log record LSN 1315593 is not equal to page LSN 1315626
            2023-12-17 11:57:58 0 [Warning] InnoDB: The last skipped log record LSN 1315984 is not equal to page LSN 1316017
            2023-12-17 11:57:58 0 [Warning] InnoDB: The last skipped log record LSN 1329658 is not equal to page LSN 1329707
            ^ Found warnings in /home/buildbot/amd64-ubuntu-2204/build/mysql-test/var/9/log/mysqld.1.err
            

            I do not have any ideas how to improve the regression test suite with respect to this.

            I did think of adding an assertion to mtr_t::memo_push() that would disallow parameters that include the MODIFY flag, but I decided against it, because then buf_page_free() would have had to be changed to less optimal code.

            I prefer to buffer-fix some blocks across mini-transaction boundaries in operations that need to be split into multiple small mini-transactions. The buffer-fix will not block page writes or log checkpoints, but it will prevent page eviction. I agree that there are some dangers around buffer-fixing, like MDEV-31767 taught me. If a block may have just been read into the buffer pool, we must not only acquire a page latch, but we must also verify that the page was not evicted due to corruption (MDEV-13542). In this case, this validation was already performed when the page was first looked up by buf_page_get_gen(). The buffer-fix will ensure that an already validated block can be safely accessed in a subsequent mini-transaction, after reacquiring the page latch (which will block file I/O again).

            The benefit of the buffer-fix trick is twofold:

            • Correctness: Preventing a hard-to-test scenario that across the mini-transaction boundary, the block was written out, evicted, and read back as corrupted.
            • Performance: Being able to remove some code for attempting to handle the above scenario. Avoiding expensive operations like Block_hint::buffer_fix_block_if_still_valid() or buf_pool.page_hash lookups.

            Note: In MDEV-13542, I tried to write the corruption handling of page-freeing operations in such a way that when corruption is detected, we would stop the efforts to free further linked pages, but carry on with the higher-level operation. In other words, we’d prefer to ‘leak’ some pages (not marking them as free). This was extensively tested with some error injection (--debug-dbug=+d,intermittent_read_failure in fil_space_t::io()).

            marko Marko Mäkelä added a comment - debarun , I mostly agree with you. For test coverage, we occasionally had some crash recovery tests failing due to this: 10.6 aff5ed3988a3add17d607e6184288f82b567dbee innodb_zip.bug56680 '4k,innodb' w9 [ fail ] Found warnings/errors in server log file! Test ended at 2023-12-17 11:57:59 line 2023-12-17 11:57:58 0 [Warning] InnoDB: The last skipped log record LSN 1315593 is not equal to page LSN 1315626 2023-12-17 11:57:58 0 [Warning] InnoDB: The last skipped log record LSN 1315984 is not equal to page LSN 1316017 2023-12-17 11:57:58 0 [Warning] InnoDB: The last skipped log record LSN 1329658 is not equal to page LSN 1329707 ^ Found warnings in /home/buildbot/amd64-ubuntu-2204/build/mysql-test/var/9/log/mysqld.1.err I do not have any ideas how to improve the regression test suite with respect to this. I did think of adding an assertion to mtr_t::memo_push() that would disallow parameters that include the MODIFY flag, but I decided against it, because then buf_page_free() would have had to be changed to less optimal code. I prefer to buffer-fix some blocks across mini-transaction boundaries in operations that need to be split into multiple small mini-transactions. The buffer-fix will not block page writes or log checkpoints, but it will prevent page eviction. I agree that there are some dangers around buffer-fixing, like MDEV-31767 taught me. If a block may have just been read into the buffer pool, we must not only acquire a page latch, but we must also verify that the page was not evicted due to corruption ( MDEV-13542 ). In this case, this validation was already performed when the page was first looked up by buf_page_get_gen() . The buffer-fix will ensure that an already validated block can be safely accessed in a subsequent mini-transaction, after reacquiring the page latch (which will block file I/O again). The benefit of the buffer-fix trick is twofold: Correctness: Preventing a hard-to-test scenario that across the mini-transaction boundary, the block was written out, evicted, and read back as corrupted. Performance: Being able to remove some code for attempting to handle the above scenario. Avoiding expensive operations like Block_hint::buffer_fix_block_if_still_valid() or buf_pool.page_hash lookups. Note: In MDEV-13542 , I tried to write the corruption handling of page-freeing operations in such a way that when corruption is detected, we would stop the efforts to free further linked pages, but carry on with the higher-level operation. In other words, we’d prefer to ‘leak’ some pages (not marking them as free). This was extensively tested with some error injection ( --debug-dbug=+d,intermittent_read_failure in fil_space_t::io() ).

            marko, debarun, thank you for such informative discussion. After all, I think that page fixing is reasonable solution for this case. As for testing, we could use the same technique for setting debug sync points in purge threads as it was used in MDEV-31355 for "cursor-restore-locking.test" after I find the reason of the test instability.

            vlad.lesin Vladislav Lesin added a comment - marko , debarun , thank you for such informative discussion. After all, I think that page fixing is reasonable solution for this case. As for testing, we could use the same technique for setting debug sync points in purge threads as it was used in MDEV-31355 for "cursor-restore-locking.test" after I find the reason of the test instability.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.