Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30404

Inconsistent updates of PAGE_MAX_TRX_ID on ROW_FORMAT=COMPRESSED pages

Details

    Description

      origin/bb-10.6-MDEV-29835 d6e7778287be2f0be9828587d113d5a3df58369d 2023-01-11T17:34:21+02:00
      Per Marko also valid for official trees.
       
      Scenario:
      1. Start the server with --mysqld=--innodb-encrypt-log --mysqld=--innodb-encrypt-tables
           and generate some initial data.
      2. One session runs a DDL/DML mix.
      3. During 2. is ongoing kill the server
      4. Try a restart
      # 2023-01-11T15:50:00 [41620] | [rr 148592 9069]2023-01-11 15:48:45 0 [Note] InnoDB: Starting final batch to recover 1684 pages from redo log.
      # 2023-01-11T15:50:00 [41620] | [rr 148592 11804]2023-01-11 15:48:45 0 [ERROR] InnoDB: OPT_PAGE_CHECKSUM mismatch on [page id: space=46, page number=839]
      # 2023-01-11T15:50:00 [41620] | [rr 148592 11808]2023-01-11 15:48:45 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      # 2023-01-11T15:50:00 [41620] | [rr 148592 11812]2023-01-11 15:48:45 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=46, page number=839]; set innodb_force_recovery to ignore
      # 2023-01-11T15:50:00 [41620] | [rr 148592 23584]2023-01-11 15:48:49 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1496] with error Data structure corruption
       
      pluto:/data/results/1673451681/TBR-1283--TBR-1646
      _RR_TRACE_DIR=./1/rr/ rr replay --mark-stdio mysqld-1
         # Fate of the server till the intentional kill
      1/fbackup/
         # File backup of the server data directory (made direct after the kill)
      _RR_TRACE_DIR=./1/rr/ rr replay --mark-stdio
         # Fate of the server during the restart
       
      RQG
      ===
      # git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG
      #
      # GIT_SHOW: HEAD -> master, origin/master, origin/HEAD 2d66adf87b35c65c844dc849daae068d5d15c34b 2023-01-09T19:20:34+01:00
      # rqg.pl  : Version 4.2.1 (2022-12)
      #
      # $RQG_HOME/rqg.pl \
      # --grammar=conf/mariadb/innodb_compression_encryption.yy \
      # --gendata=conf/mariadb/innodb_compression_encryption.zz \
      # --max_gd_duration=1800 \
      # --mysqld=--innodb-encrypt-log \
      # --mysqld=--innodb-encrypt-tables \
      # --reporters=CrashRecovery \
      # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
      # --mysqld=--loose-idle_write_transaction_timeout=0 \
      # --mysqld=--loose-idle_transaction_timeout=0 \
      # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      # --mysqld=--connect_timeout=60 \
      # --mysqld=--interactive_timeout=28800 \
      # --mysqld=--slave_net_timeout=60 \
      # --mysqld=--net_read_timeout=30 \
      # --mysqld=--net_write_timeout=60 \
      # --mysqld=--loose-table_lock_wait_timeout=50 \
      # --mysqld=--wait_timeout=28800 \
      # --mysqld=--lock-wait-timeout=86400 \
      # --mysqld=--innodb-lock-wait-timeout=50 \
      # --no-mask \
      # --queries=10000000 \
      # --seed=random \
      # --reporters=Backtrace \
      # --reporters=ErrorLog \
      # --reporters=Deadlock \
      # --validators=None \
      # --mysqld=--log_output=none \
      # --mysqld=--log_bin_trust_function_creators=1 \
      # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      # --engine=InnoDB \
      # --restart_timeout=240 \
      # --mysqld=--plugin-load-add=file_key_management.so \
      # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
      # --mysqld=--plugin-load-add=provider_lzo.so \
      # --mysqld=--plugin-load-add=provider_bzip2.so \
      # --mysqld=--plugin-load-add=provider_lzma.so \
      # --mysqld=--plugin-load-add=provider_snappy.so \
      # --mysqld=--plugin-load-add=provider_lz4.so \
      # --duration=300 \
      # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      # --mysqld=--loose-innodb_read_only_compressed=OFF \
      # --mysqld=--loose-innodb-sync-debug \
      # --mysqld=--innodb_stats_persistent=off \
      # --mysqld=--innodb_adaptive_hash_index=off \
      # --mysqld=--log-bin \
      # --mysqld=--sync-binlog=1 \
      # --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
      # --mysqld=--loose-max-statement-time=30 \
      # --threads=1 \
      # --mysqld=--innodb-use-native-aio=0 \
      # --mysqld=--loose-gdb \
      # --mysqld=--loose-debug-gdb \
      # --rr=Extended \
      # --rr_options=--chaos --wait \
      # --mysqld=--innodb_undo_tablespaces=3 \
      # --mysqld=--innodb_undo_log_truncate=ON \
      # --mysqld=--loose_innodb_change_buffering=inserts \
      # --mysqld=--innodb_rollback_on_timeout=OFF \
      # --mysqld=--innodb_page_size=16K \
      # --mysqld=--innodb-buffer-pool-size=256M \
      # <local settings>
      

      Attachments

        Issue Links

          Activity

            A write of PAGE_MAX_TRX_ID=0x2e0 is wrongly being recovered as 0x1e0 for a ROW_FORMAT=COMPRESSED page.

            ssh pluto
            rr replay /data/results/1673451681/TBR-1283--TBR-1646/1/rr/mysqld-1
            

            run 1272803
            watch -l log_sys.lsn._M_i
            break mtr_t::page_checksum
            continue
            

            The OPT_PAGE_CHECKSUM record was written by mtr_t::page_checksum() at log_sys.lsn=96928241 and when 1272804. With the following commands, we can see the last time the page was modified:

            watch -l page[0x38]@8
            reverse-continue
            backtrace
            

            #2  page_set_max_trx_id (block=block@entry=0x507405322e10, 
                page_zip=page_zip@entry=0x507405322e58, trx_id=trx_id@entry=736, 
                mtr=mtr@entry=0x44212d0b3130)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:191
            #3  0x0000564610346c44 in page_update_max_trx_id (
                block=block@entry=0x507405322e10, page_zip=0x507405322e58, trx_id=736, 
                mtr=mtr@entry=0x44212d0b3130)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/page0page.inl:52
            #4  0x000056461036c429 in lock_sec_rec_modify_check_and_lock (
                flags=flags@entry=0, block=<optimized out>, 
                rec=rec@entry=0x507406d0478c "S,", index=index@entry=0x61600603a608, 
                thr=thr@entry=0x62500572b3c8, mtr=mtr@entry=0x44212d0b3130)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/lock/lock0lock.cc:5460 
            

            It remains to be seen whether this only affects ROW_FORMAT=COMPRESSED tables.

            marko Marko Mäkelä added a comment - A write of PAGE_MAX_TRX_ID=0x2e0 is wrongly being recovered as 0x1e0 for a ROW_FORMAT=COMPRESSED page. ssh pluto rr replay /data/results/1673451681/TBR-1283--TBR-1646/1/rr/mysqld-1 run 1272803 watch -l log_sys.lsn._M_i break mtr_t::page_checksum continue The OPT_PAGE_CHECKSUM record was written by mtr_t::page_checksum() at log_sys.lsn=96928241 and when 1272804. With the following commands, we can see the last time the page was modified: watch -l page[0x38]@8 reverse-continue backtrace #2 page_set_max_trx_id (block=block@entry=0x507405322e10, page_zip=page_zip@entry=0x507405322e58, trx_id=trx_id@entry=736, mtr=mtr@entry=0x44212d0b3130) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:191 #3 0x0000564610346c44 in page_update_max_trx_id ( block=block@entry=0x507405322e10, page_zip=0x507405322e58, trx_id=736, mtr=mtr@entry=0x44212d0b3130) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/page0page.inl:52 #4 0x000056461036c429 in lock_sec_rec_modify_check_and_lock ( flags=flags@entry=0, block=<optimized out>, rec=rec@entry=0x507406d0478c "S,", index=index@entry=0x61600603a608, thr=thr@entry=0x62500572b3c8, mtr=mtr@entry=0x44212d0b3130) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/lock/lock0lock.cc:5460 It remains to be seen whether this only affects ROW_FORMAT=COMPRESSED tables.

            The recovery of the byte 0x1 (instead of 0x2) happens here:

            #2  0x00005563d4fc9c81 in log_phys_t::apply (this=this@entry=0x640000446df0, 
                block=..., last_offset=@0x6060000719ec: 8)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/log/log0recv.cc:498
            498	          memcpy(frame + last_offset, l, llen);
            (rr) p *this
            $5 = {<log_rec_t> = {next = 0x640000447960, lsn = 74021668}, 
              start_lsn = 74017249}
            

            The mini-transaction is a WRITE record with 2888 (0xb48) bytes of data payload, to be written to page offset 8 (right after the FIL_PAGE_OFFSET):

            0x30, 0x8a, 0xbf, 0x2e, 0x82, 0xc7, 0x8,
            0x0, 0x0, 0x3, 0x46, 0x0, 0x0, 0x3, 0x49, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x45, 0xbf, …
            

            After this write, the PAGE_MAX_TRX_ID became 0x129. The write of 0x2e0 happened much later, around LSN=96928241. That is the next write to the location on recovery:

            #2  0x00005563d4fc9c81 in log_phys_t::apply (this=this@entry=0x6400020717d8, block=..., last_offset=@0x6060000719ec: 63) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/log/log0recv.cc:498
            498	          memcpy(frame + last_offset, l, llen);
            (rr) p *this
            $15 = {<log_rec_t> = {next = 0x640002071800, lsn = 96928281}, start_lsn = 96928241}
            

            The records of the mini-transaction are as follows:
            0x35, 0x2e, 0x82, 0xc7, 0x3f, 0xe0: WRITE the byte 0xe0 to page 0x2e:0x347, at offset 0x3f (FIL_PAGE_DATA+PAGE_MAX_TRX_ID+7)
            0xb3, 0x9e, 0x28, 0xc7: WRITE the byte 0xc7 to the same page, at byte offset 0x40+0x1ea8
            The first record appears to have been written incorrectly. It turns out that the PAGE_MAX_TRX_ID had been previously updated to 0x2be, but no log had been written for it. The update to
            0x2e0 was logged as 1 byte, because the 7 most significant bytes would not change at that time. We need to check when the PAGE_MAX_TRX_ID was changed from 0x1e0 and why no log was written for it. That happened during the execution of

            ALTER TABLE `t8` ROW_FORMAT=COMPRESSED ENCRYPTED=YES ENCRYPTION_KEY_ID=5 /* E_R Thread1 QNO 147 CON_ID 19 */;
            

            with the following stack trace:

            #0  page_set_max_trx_id (block=block@entry=0x507405322e10, 
                page_zip=page_zip@entry=0x507405322e58, trx_id=trx_id@entry=297, 
                mtr=mtr@entry=0x619000d4c690)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:182
            #1  0x00005646107d6024 in page_update_max_trx_id (mtr=0x619000d4c690, 
                trx_id=297, page_zip=0x507405322e58, block=0x507405322e10)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/page0page.inl:52
            #2  PageBulk::init (this=this@entry=0x619000d4c680)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:126
            #3  0x00005646107d902a in BtrBulk::insert (this=this@entry=0x44212d0aef70, 
                tuple=tuple@entry=0x6310031b08b0, level=level@entry=0)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:1077
            #4  0x000056461057fd53 in BtrBulk::insert (tuple=0x6310031b08b0, 
                this=0x44212d0aef70)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/btr0bulk.h:301
            #5  row_merge_insert_index_tuples (index=0x61600603a608, 
            

            This mini-transaction was committed at around log_sys.lsn=74017249, with the following log:

            (rr) p/x *m_log.m_first_block.m_data@134
            $16 = {0x13, 0x2e, 0x82, 0xc7, 0x13, 0x2e, 0x82, 0xc7, 0x30, 0x5c, 0x2e, 0x82, 
              0xc7, 0x8, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 
              0x0, 0x0, 0x0, 0x0, 0x0, 0x45, 0xbf, 0x0 <repeats 11 times>, 0x2e, 0x0, 0x2, 
              0x0, 0x78, 0x80, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 
              0x0 <repeats 21 times>, 0x5f, 0x0 <repeats 20 times>, 0x68, 0x81, 0x12, 
              0x64, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0x3, 0x0, 0x0, 0x24, 0x0, 0x12, 0x36, 
              0x2e, 0x82, 0xc7, 0x3e, 0x1, 0x29, 0x13, 0x2e, 0x82, 0xc7, 0x30, 0x8a, 0xbf, 
              0x2e, 0x82, 0xc7, 0x8}
            

            The first record is an INIT_PAGE of the page 0x2e:0x347. At this point of time, log_sys.last_checkpoint_lsn=log_sys.next_checkpoint_lsn=72135592. This looks like a page flushing bug:

            2023-01-11 15:48:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=72135592,72136847
            

            Finally, I found the race condition:

            #0  buf_page_t::clear_oldest_modification (this=0x507405322e10) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/buf0buf.h:2017
            #1  buf_pool_t::delete_from_flush_list (this=this@entry=0x5646121ca700 <buf_pool>, bpage=bpage@entry=0x507405322e10, clear=clear@entry=true) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:207
            #2  0x00005646108cc2c5 in buf_pool_t::delete_from_flush_list (bpage=0x507405322e10, this=<optimized out>) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/buf0buf.h:1873
            #3  buf_flush_discard_page (bpage=bpage@entry=0x507405322e10) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:1206
            #4  0x00005646108d4c1b in buf_do_flush_list_batch (max_n=max_n@entry=2000, lsn=lsn@entry=70815685) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:1447
            #5  0x00005646108d56eb in buf_flush_list (max_n=2000, lsn=lsn@entry=70815685) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:1532
            

            This thread is marking the page clean, even though the ALTER TABLE thread at the same time is holding an exclusive page latch:

            #3  0x00005646108c9a85 in inline_mysql_mutex_lock (src_line=179, src_file=0x564611673de0 "/data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc", that=0x5646121cea80 <buf_pool+17280>) at /data/Server/bb-10.6-MDEV-29835F/include/mysql/psi/mysql_thread.h:750
            #4  buf_pool_t::insert_into_flush_list (this=0x5646121ca700 <buf_pool>, block=block@entry=0x5074053c46a0, lsn=lsn@entry=71662004) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:179
            #5  0x00005646103dd8bf in buf_flush_note_modification (b=0x5074053c46a0, start=71662004, end=<optimized out>) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/buf0flu.h:129
            #6  0x000056461041d8b3 in mtr_t::commit (this=this@entry=0x619001dd2690) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/mtr/mtr0mtr.cc:194
            #7  0x00005646107d75ea in PageBulk::commit (this=this@entry=0x619001dd2680, succ--Type <RET> for more, q to quit, c to continue without paging--
            ess=success@entry=true) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:569
            #8  0x00005646107db710 in BtrBulk::pageCommit (this=this@entry=0x44212d0aef70, page_bulk=page_bulk@entry=0x619001dd2680, next_page_bulk=next_page_bulk@entry=0x619000d62480, insert_father=insert_father@entry=true) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:964
            #9  0x00005646107d9101 in BtrBulk::insert (this=this@entry=0x44212d0aef70, tuple=tuple@entry=0x6310002f88b0, level=level@entry=0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:1084
            #10 0x000056461057fd53 in BtrBulk::insert (…) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/row/row0merge.cc:3672
            #12 0x0000564610592457 in row_merge_read_clustered_index…
            

            This recovery bug ought to affect 10.5 already, due to MDEV-15528. I think that this could be reproduced with a DML-only workload (say, INSERT and ROLLBACK of lots of BLOB data).

            marko Marko Mäkelä added a comment - The recovery of the byte 0x1 (instead of 0x2) happens here: #2 0x00005563d4fc9c81 in log_phys_t::apply (this=this@entry=0x640000446df0, block=..., last_offset=@0x6060000719ec: 8) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/log/log0recv.cc:498 498 memcpy(frame + last_offset, l, llen); (rr) p *this $5 = {<log_rec_t> = {next = 0x640000447960, lsn = 74021668}, start_lsn = 74017249} The mini-transaction is a WRITE record with 2888 (0xb48) bytes of data payload, to be written to page offset 8 (right after the FIL_PAGE_OFFSET ): 0x30, 0x8a, 0xbf, 0x2e, 0x82, 0xc7, 0x8, 0x0, 0x0, 0x3, 0x46, 0x0, 0x0, 0x3, 0x49, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x45, 0xbf, … After this write, the PAGE_MAX_TRX_ID became 0x129. The write of 0x2e0 happened much later, around LSN=96928241. That is the next write to the location on recovery: #2 0x00005563d4fc9c81 in log_phys_t::apply (this=this@entry=0x6400020717d8, block=..., last_offset=@0x6060000719ec: 63) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/log/log0recv.cc:498 498 memcpy(frame + last_offset, l, llen); (rr) p *this $15 = {<log_rec_t> = {next = 0x640002071800, lsn = 96928281}, start_lsn = 96928241} The records of the mini-transaction are as follows: 0x35, 0x2e, 0x82, 0xc7, 0x3f, 0xe0 : WRITE the byte 0xe0 to page 0x2e:0x347, at offset 0x3f ( FIL_PAGE_DATA+PAGE_MAX_TRX_ID+7 ) 0xb3, 0x9e, 0x28, 0xc7 : WRITE the byte 0xc7 to the same page, at byte offset 0x40+ 0x1ea8 The first record appears to have been written incorrectly. It turns out that the PAGE_MAX_TRX_ID had been previously updated to 0x2be , but no log had been written for it. The update to 0x2e0 was logged as 1 byte, because the 7 most significant bytes would not change at that time. We need to check when the PAGE_MAX_TRX_ID was changed from 0x1e0 and why no log was written for it. That happened during the execution of ALTER TABLE `t8` ROW_FORMAT=COMPRESSED ENCRYPTED=YES ENCRYPTION_KEY_ID=5 /* E_R Thread1 QNO 147 CON_ID 19 */ ; with the following stack trace: #0 page_set_max_trx_id (block=block@entry=0x507405322e10, page_zip=page_zip@entry=0x507405322e58, trx_id=trx_id@entry=297, mtr=mtr@entry=0x619000d4c690) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:182 #1 0x00005646107d6024 in page_update_max_trx_id (mtr=0x619000d4c690, trx_id=297, page_zip=0x507405322e58, block=0x507405322e10) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/page0page.inl:52 #2 PageBulk::init (this=this@entry=0x619000d4c680) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:126 #3 0x00005646107d902a in BtrBulk::insert (this=this@entry=0x44212d0aef70, tuple=tuple@entry=0x6310031b08b0, level=level@entry=0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:1077 #4 0x000056461057fd53 in BtrBulk::insert (tuple=0x6310031b08b0, this=0x44212d0aef70) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/btr0bulk.h:301 #5 row_merge_insert_index_tuples (index=0x61600603a608, This mini-transaction was committed at around log_sys.lsn =74017249, with the following log: (rr) p/x *m_log.m_first_block.m_data@134 $16 = {0x13, 0x2e, 0x82, 0xc7, 0x13, 0x2e, 0x82, 0xc7, 0x30, 0x5c, 0x2e, 0x82, 0xc7, 0x8, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x45, 0xbf, 0x0 <repeats 11 times>, 0x2e, 0x0, 0x2, 0x0, 0x78, 0x80, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 0x0 <repeats 21 times>, 0x5f, 0x0 <repeats 20 times>, 0x68, 0x81, 0x12, 0x64, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0x3, 0x0, 0x0, 0x24, 0x0, 0x12, 0x36, 0x2e, 0x82, 0xc7, 0x3e, 0x1, 0x29, 0x13, 0x2e, 0x82, 0xc7, 0x30, 0x8a, 0xbf, 0x2e, 0x82, 0xc7, 0x8} The first record is an INIT_PAGE of the page 0x2e:0x347 . At this point of time, log_sys.last_checkpoint_lsn=log_sys.next_checkpoint_lsn=72135592 . This looks like a page flushing bug: 2023-01-11 15:48:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=72135592,72136847 Finally, I found the race condition: #0 buf_page_t::clear_oldest_modification (this=0x507405322e10) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/buf0buf.h:2017 #1 buf_pool_t::delete_from_flush_list (this=this@entry=0x5646121ca700 <buf_pool>, bpage=bpage@entry=0x507405322e10, clear=clear@entry=true) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:207 #2 0x00005646108cc2c5 in buf_pool_t::delete_from_flush_list (bpage=0x507405322e10, this=<optimized out>) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/buf0buf.h:1873 #3 buf_flush_discard_page (bpage=bpage@entry=0x507405322e10) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:1206 #4 0x00005646108d4c1b in buf_do_flush_list_batch (max_n=max_n@entry=2000, lsn=lsn@entry=70815685) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:1447 #5 0x00005646108d56eb in buf_flush_list (max_n=2000, lsn=lsn@entry=70815685) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:1532 This thread is marking the page clean, even though the ALTER TABLE thread at the same time is holding an exclusive page latch: #3 0x00005646108c9a85 in inline_mysql_mutex_lock (src_line=179, src_file=0x564611673de0 "/data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc", that=0x5646121cea80 <buf_pool+17280>) at /data/Server/bb-10.6-MDEV-29835F/include/mysql/psi/mysql_thread.h:750 #4 buf_pool_t::insert_into_flush_list (this=0x5646121ca700 <buf_pool>, block=block@entry=0x5074053c46a0, lsn=lsn@entry=71662004) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/buf/buf0flu.cc:179 #5 0x00005646103dd8bf in buf_flush_note_modification (b=0x5074053c46a0, start=71662004, end=<optimized out>) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/buf0flu.h:129 #6 0x000056461041d8b3 in mtr_t::commit (this=this@entry=0x619001dd2690) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/mtr/mtr0mtr.cc:194 #7 0x00005646107d75ea in PageBulk::commit (this=this@entry=0x619001dd2680, succ--Type <RET> for more, q to quit, c to continue without paging-- ess=success@entry=true) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:569 #8 0x00005646107db710 in BtrBulk::pageCommit (this=this@entry=0x44212d0aef70, page_bulk=page_bulk@entry=0x619001dd2680, next_page_bulk=next_page_bulk@entry=0x619000d62480, insert_father=insert_father@entry=true) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:964 #9 0x00005646107d9101 in BtrBulk::insert (this=this@entry=0x44212d0aef70, tuple=tuple@entry=0x6310002f88b0, level=level@entry=0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0bulk.cc:1084 #10 0x000056461057fd53 in BtrBulk::insert (…) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/row/row0merge.cc:3672 #12 0x0000564610592457 in row_merge_read_clustered_index… This recovery bug ought to affect 10.5 already, due to MDEV-15528 . I think that this could be reproduced with a DML-only workload (say, INSERT and ROLLBACK of lots of BLOB data).

            A little more detail:

            set $b=(buf_block_t*)0x507405322e10
            watch -l $b.page.oldest_modification_
            run 903820
            continue
            watch -l $b.page.lock.recursive
            reverse-continue
            continue
            continue
            …
            

            With this, we can see that buf_flush_discard_page() is properly protected by a page U-latch. The PageBulk::commit() arrives much later, and it is holding a page X latch, as expected. The problem might be on the recovery side, and not on the write side at all. I need to diagnose this further to be able to fix this.

            marko Marko Mäkelä added a comment - A little more detail: set $b=(buf_block_t*)0x507405322e10 watch -l $b.page.oldest_modification_ run 903820 continue watch -l $b.page.lock.recursive reverse-continue continue continue … With this, we can see that buf_flush_discard_page() is properly protected by a page U-latch. The PageBulk::commit() arrives much later, and it is holding a page X latch, as expected. The problem might be on the recovery side, and not on the write side at all. I need to diagnose this further to be able to fix this.

            This failure is specific to ROW_FORMAT=COMPRESSED. Here is some more context of the non-logged update of PAGE_MAX_TRX_ID:

            #0  mtr_t::set_modified (block=..., this=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/mtr0mtr.h:338
            #1  mtr_t::memcpy_low (this=0x36330b335af0, block=..., offset=<optimized out>, data=<optimized out>, len=<optimized out>) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/mtr0log.h:321
            #2  0x00005646104855ce in mtr_t::write<8u, (mtr_t::write_type)0, unsigned long> (this=this@entry=0x36330b335af0, block=..., ptr=ptr@entry=0x507406d04038, val=val@entry=702)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/mtr0log.h:202
            #3  0x000056461047013c in page_set_max_trx_id (block=block@entry=0x507405322e10, page_zip=page_zip@entry=0x507405322e58, trx_id=trx_id@entry=702, mtr=mtr@entry=0x36330b335af0)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:189
            #4  0x00005646104705b9 in page_update_max_trx_id (block=block@entry=0x507405322e10, page_zip=page_zip@entry=0x507405322e58, trx_id=702, mtr=mtr@entry=0x36330b335af0)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/page0page.inl:52
            #5  0x0000564610483754 in page_copy_rec_list_start (new_block=0x507405322e10, block=block@entry=0x507405322ba0, rec=<optimized out>, index=index@entry=0x61600603a608, mtr=mtr@entry=0x36330b335af0, 
                err=err@entry=0x36330b334b20) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:791
            #6  0x00005646107bd920 in btr_compress (cursor=cursor@entry=0x36330b335810, adjust=adjust@entry=false, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0btr.cc:3756
            #7  0x00005646107fee69 in btr_cur_compress_if_useful (cursor=cursor@entry=0x36330b335810, adjust=adjust@entry=false, mtr=mtr@entry=0x36330b335af0)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/btr0cur.h:781
            #8  0x000056461081f699 in btr_cur_pessimistic_delete (err=err@entry=0x36330b335760, has_reserved_extents=has_reserved_extents@entry=0, cursor=cursor@entry=0x36330b335810, flags=flags@entry=0, 
                rollback=rollback@entry=false, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0cur.cc:5507
            #9  0x00005646105fc682 in row_purge_remove_sec_if_poss_tree (node=node@entry=0x61a00000a908, index=index@entry=0x61600603a608, entry=entry@entry=0x61900012c608)
                at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/row/row0purge.cc:410
            

            In mtr_t::set_modified() we will not mark the page as modified, because m_log_mode==MTR_LOG_NONE. The purpose of that mode is to suppress log writes for ROW_FORMAT=COMPRESSED pages, in case we would run out of space and the change would have to be rolled back. Apparently, exactly that happens, because there will be a failing call to page_zip_reorganize() during page_copy_rec_list_start(), which was called by btr_compress(). In the end, at the time of mtr_t::commit(), the mini-transaction will have flagged only the page 0x2e:0x349 as modified, and only written log records for it. The intention was to revert any changes to our page 0x2e:0x347. But, the change to PAGE_MAX_TRX_ID was neither reverted nor logged!

            As far as I can tell, this recovery and backup bug can break locking and MVCC on secondary indexes of ROW_FORMAT=COMPRESSED tables, and it should affect all MySQL and MariaDB versions that support ROW_FORMAT=COMPRESSED for InnoDB tables. It may take some effort to fix this and to test the fix.

            marko Marko Mäkelä added a comment - This failure is specific to ROW_FORMAT=COMPRESSED . Here is some more context of the non-logged update of PAGE_MAX_TRX_ID : #0 mtr_t::set_modified (block=..., this=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/mtr0mtr.h:338 #1 mtr_t::memcpy_low (this=0x36330b335af0, block=..., offset=<optimized out>, data=<optimized out>, len=<optimized out>) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/mtr0log.h:321 #2 0x00005646104855ce in mtr_t::write<8u, (mtr_t::write_type)0, unsigned long> (this=this@entry=0x36330b335af0, block=..., ptr=ptr@entry=0x507406d04038, val=val@entry=702) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/mtr0log.h:202 #3 0x000056461047013c in page_set_max_trx_id (block=block@entry=0x507405322e10, page_zip=page_zip@entry=0x507405322e58, trx_id=trx_id@entry=702, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:189 #4 0x00005646104705b9 in page_update_max_trx_id (block=block@entry=0x507405322e10, page_zip=page_zip@entry=0x507405322e58, trx_id=702, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/page0page.inl:52 #5 0x0000564610483754 in page_copy_rec_list_start (new_block=0x507405322e10, block=block@entry=0x507405322ba0, rec=<optimized out>, index=index@entry=0x61600603a608, mtr=mtr@entry=0x36330b335af0, err=err@entry=0x36330b334b20) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/page/page0page.cc:791 #6 0x00005646107bd920 in btr_compress (cursor=cursor@entry=0x36330b335810, adjust=adjust@entry=false, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0btr.cc:3756 #7 0x00005646107fee69 in btr_cur_compress_if_useful (cursor=cursor@entry=0x36330b335810, adjust=adjust@entry=false, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/include/btr0cur.h:781 #8 0x000056461081f699 in btr_cur_pessimistic_delete (err=err@entry=0x36330b335760, has_reserved_extents=has_reserved_extents@entry=0, cursor=cursor@entry=0x36330b335810, flags=flags@entry=0, rollback=rollback@entry=false, mtr=mtr@entry=0x36330b335af0) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/btr/btr0cur.cc:5507 #9 0x00005646105fc682 in row_purge_remove_sec_if_poss_tree (node=node@entry=0x61a00000a908, index=index@entry=0x61600603a608, entry=entry@entry=0x61900012c608) at /data/Server/bb-10.6-MDEV-29835F/storage/innobase/row/row0purge.cc:410 In mtr_t::set_modified() we will not mark the page as modified, because m_log_mode==MTR_LOG_NONE . The purpose of that mode is to suppress log writes for ROW_FORMAT=COMPRESSED pages, in case we would run out of space and the change would have to be rolled back. Apparently, exactly that happens, because there will be a failing call to page_zip_reorganize() during page_copy_rec_list_start() , which was called by btr_compress() . In the end, at the time of mtr_t::commit() , the mini-transaction will have flagged only the page 0x2e:0x349 as modified, and only written log records for it. The intention was to revert any changes to our page 0x2e:0x347. But, the change to PAGE_MAX_TRX_ID was neither reverted nor logged! As far as I can tell, this recovery and backup bug can break locking and MVCC on secondary indexes of ROW_FORMAT=COMPRESSED tables, and it should affect all MySQL and MariaDB versions that support ROW_FORMAT=COMPRESSED for InnoDB tables. It may take some effort to fix this and to test the fix.

            We were able to catch this bug thanks to extra checksums (MDEV-18976) being written in debug builds.

            Before the log record format was changed in MDEV-12353 for MariaDB Server 10.5, updates of PAGE_MAX_TRX_ID should always have been logged entirely, and thus the impact of this bug could be slightly different: there would never be a completely bogus PAGE_MAX_TRX_ID present in data files. But, we could entirely omit updates of PAGE_MAX_TRX_ID if an unlogged update of the field is present in the buffer pool. Here is the relevant piece of code from page_update_max_trx_id():

            	if (page_get_max_trx_id(buf_block_get_frame(block)) < trx_id) {
             
            		page_set_max_trx_id(block, page_zip, trx_id, mtr);
            	}
            

            marko Marko Mäkelä added a comment - We were able to catch this bug thanks to extra checksums ( MDEV-18976 ) being written in debug builds. Before the log record format was changed in MDEV-12353 for MariaDB Server 10.5, updates of PAGE_MAX_TRX_ID should always have been logged entirely, and thus the impact of this bug could be slightly different: there would never be a completely bogus PAGE_MAX_TRX_ID present in data files. But, we could entirely omit updates of PAGE_MAX_TRX_ID if an unlogged update of the field is present in the buffer pool. Here is the relevant piece of code from page_update_max_trx_id() : if (page_get_max_trx_id(buf_block_get_frame(block)) < trx_id) {   page_set_max_trx_id(block, page_zip, trx_id, mtr); }

            As far as I can tell, this bug was introduced in MDEV-21174, which was preparation for MDEV-12353. Possible fix:

            diff --git a/storage/innobase/page/page0page.cc b/storage/innobase/page/page0page.cc
            index 1aab3f432da..10eba51cb2a 100644
            --- a/storage/innobase/page/page0page.cc
            +++ b/storage/innobase/page/page0page.cc
            @@ -789,7 +789,7 @@ page_copy_rec_list_start(
             	if (n_core && dict_index_is_sec_or_ibuf(index)
             	    && !index->table->is_temporary()) {
             		page_update_max_trx_id(new_block,
            -				       new_page_zip,
            +				       nullptr,
             				       page_get_max_trx_id(block->page.frame),
             				       mtr);
             	}
            

            Any other user of MTR_LOG_NONE is not updating PAGE_MAX_TRX_ID. There is a call to page_zip_compress() right after this. That would take care of propagating the provisional change from the uncompressed page to the compressed page and the log. In case compression fails, page_zip_decompress() will restore the uncompressed page.

            marko Marko Mäkelä added a comment - As far as I can tell, this bug was introduced in MDEV-21174 , which was preparation for MDEV-12353 . Possible fix: diff --git a/storage/innobase/page/page0page.cc b/storage/innobase/page/page0page.cc index 1aab3f432da..10eba51cb2a 100644 --- a/storage/innobase/page/page0page.cc +++ b/storage/innobase/page/page0page.cc @@ -789,7 +789,7 @@ page_copy_rec_list_start( if (n_core && dict_index_is_sec_or_ibuf(index) && !index->table->is_temporary()) { page_update_max_trx_id(new_block, - new_page_zip, + nullptr, page_get_max_trx_id(block->page.frame), mtr); } Any other user of MTR_LOG_NONE is not updating PAGE_MAX_TRX_ID . There is a call to page_zip_compress() right after this. That would take care of propagating the provisional change from the uncompressed page to the compressed page and the log. In case compression fails, page_zip_decompress() will restore the uncompressed page.
            marko Marko Mäkelä added a comment - - edited

            There may exist other reasons for OPT_PAGE_CHECKSUM mismatch. In MDEV-30479, it is related to innodb_undo_log_truncate=ON.

            marko Marko Mäkelä added a comment - - edited There may exist other reasons for OPT_PAGE_CHECKSUM mismatch. In MDEV-30479 , it is related to innodb_undo_log_truncate=ON .

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.