Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26001

InnoDB: Cannot apply log to [page id: space=152, page number=0] of corrupted file

Details

    Description

      We got a recovery failure due to MDEV-24626:

      10.6 3a566de22db6990faf213f483a722645c1ad8b9e

      2021-06-23 17:19:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=161810155,161810155
      2021-06-23 17:19:30 0 [ERROR] InnoDB: Cannot apply log to [page id: space=152, page number=0] of corrupted file './test/H.ibd'
      2021-06-23 17:19:30 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1447] with error Data structure corruption
      

      To retrieve the data (due to large redo log file, it is too large to be attached here):

      ssh sdp 'tar cjf - -C /data/Results/1624453151/Marko1/dev/shm/vardir/1624453151/163/1 data' | tar xjf -
      

      The problem is around here:

      diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
      index 16250928845..3e602297363 100644
      --- a/storage/innobase/log/log0recv.cc
      +++ b/storage/innobase/log/log0recv.cc
      @@ -806,8 +806,10 @@ bool recv_sys_t::recover_deferred(recv_sys_t::map::iterator &p,
         recv_spaces_t::iterator it{recv_spaces.find(first.space())};
         ut_ad(it != recv_spaces.end());
       
      -  if (!first.page_no() && p->second.state == page_recv_t::RECV_WILL_NOT_READ)
      +  if (!first.page_no())
         {
      +    ut_ad(p->second.state == page_recv_t::RECV_WILL_NOT_READ ||
      +          p->second.state == page_recv_t::RECV_NOT_PROCESSED);
           mtr_t mtr;
           buf_block_t *block= recover_low(first, p, mtr, free_block);
           ut_ad(block == free_block);
      

      The state is not RECV_WILL_NOT_READ, as expected. The RECV_NOT_PROCESSED was inserted here:

      #0  recv_sys_t::add (this=0x564e77653838 <recv_sys>, page_id={m_id = 652835028992}, start_lsn=161999632, lsn=161999940, l=0x7f90f08b2d05 "8\200\030", len=9)
          at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:1974
      #1  0x0000564e7684a73e in recv_sys_t::parse (this=0x564e77653838 <recv_sys>, checkpoint_lsn=161810155, store=0x7ffe6fbb46fc, apply=true) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:2393
      #2  0x0000564e7685b186 in recv_scan_log_recs (store=0x7ffe6fbb46fc, log_block=0x7f90f0a96000 '\245' <repeats 200 times>..., checkpoint_lsn=161810155, start_lsn=161940992, end_lsn=162006528, 
          contiguous_lsn=0x7ffe6fbb4a28, group_scanned_lsn=0x564e77f23208 <log_sys+520>) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:3460
      #3  0x0000564e76853608 in recv_group_scan_log_recs (checkpoint_lsn=161810155, contiguous_lsn=0x7ffe6fbb4a28, last_phase=false) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:3543
      #4  0x0000564e768520ac in recv_recovery_from_checkpoint_start (flush_lsn=41238) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:3946
      

      Attachments

        Issue Links

          Activity

            There was an incorrect change to os_aio_wait_until_no_pending_writes() in a development version of MDEV-25954, which caused the test innodb.instant_alter_import to fail. We will still have to ensure that there is no doublewrite batch pending; otherwise os_aio_wait_until_no_pending_writes() would not guarantee that all submitted page writes have completed. After I fixed that, the test innodb.instant_alter_import was rock solid also on my system when I used NVMe for mysql-test/var. It never failed on /dev/shm.

            I am unable to replay the trace that led to the SIGKILL before the failed startup, so I cannot confirm whether this failure was due to the above reason. I think that it is very plausible. Therefore, it would be good to test the latest 10.5 or 10.6 (with the final pushed version of MDEV-25954) to see if there are any recovery failures.

            marko Marko Mäkelä added a comment - There was an incorrect change to os_aio_wait_until_no_pending_writes() in a development version of MDEV-25954 , which caused the test innodb.instant_alter_import to fail. We will still have to ensure that there is no doublewrite batch pending; otherwise os_aio_wait_until_no_pending_writes() would not guarantee that all submitted page writes have completed. After I fixed that, the test innodb.instant_alter_import was rock solid also on my system when I used NVMe for mysql-test/var . It never failed on /dev/shm . I am unable to replay the trace that led to the SIGKILL before the failed startup, so I cannot confirm whether this failure was due to the above reason. I think that it is very plausible. Therefore, it would be good to test the latest 10.5 or 10.6 (with the final pushed version of MDEV-25954 ) to see if there are any recovery failures.

            perl rqg.pl \
            --grammar=conf/mariadb/innodb_compression_encryption.yy \
            --gendata=conf/mariadb/innodb_compression_encryption.zz \
            --max_gd_duration=1800 \
            --reporters=CrashRecovery1 \
            --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
            --mysqld=--loose-idle_write_transaction_timeout=0 \
            --mysqld=--loose-idle_transaction_timeout=0 \
            --mysqld=--loose-idle_readonly_transaction_timeout=0 \
            --mysqld=--connect_timeout=60 \
            --mysqld=--interactive_timeout=28800 \
            --mysqld=--slave_net_timeout=60 \
            --mysqld=--net_read_timeout=30 \
            --mysqld=--net_write_timeout=60 \
            --mysqld=--loose-table_lock_wait_timeout=50 \
            --mysqld=--wait_timeout=28800 \
            --mysqld=--lock-wait-timeout=86400 \
            --mysqld=--innodb-lock-wait-timeout=50 \
            --no-mask \
            --queries=10000000 \
            --seed=random \
            --reporters=Backtrace \
            --reporters=ErrorLog \
            --reporters=Deadlock1 \
            --validators=None \
            --mysqld=--log_output=none \
            --mysqld=--log-bin \
            --mysqld=--log_bin_trust_function_creators=1 \
            --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
            --engine=InnoDB \
            --restart_timeout=240 \
            --mysqld=--plugin-load-add=file_key_management.so \
            --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
            --duration=300 \
            --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
            --mysqld=--loose-innodb-sync-debug \
            --mysqld=--innodb_stats_persistent=off \
            --mysqld=--innodb_adaptive_hash_index=on \
            --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
            --mysqld=--loose-max-statement-time=30 \
            --threads=9 \
            --mysqld=--innodb-use-native-aio=0 \
            --rr=Extended \
            --rr_options="--chaos --wait" \
            --mysqld=--innodb_page_size=16K \
            --mysqld=--innodb-buffer-pool-size=8M \
            --no_mask 
            ....
            --script_debug=_nix_
             
            Buildscript:  bld_asan_O2_O0.sh
             
            RQG error pattern:
            [ 'TBR-1080', 'INFO: Reporter \'CrashRecovery1\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] InnoDB: Cannot apply log to .{1,100}of corrupted file .{1,100}\.ibd.+\[ERROR\] Plugin \'InnoDB\' registration as a STORAGE ENGINE failed.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],
            

            mleich Matthias Leich added a comment - perl rqg.pl \ --grammar=conf/mariadb/innodb_compression_encryption.yy \ --gendata=conf/mariadb/innodb_compression_encryption.zz \ --max_gd_duration=1800 \ --reporters=CrashRecovery1 \ --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ --mysqld=--loose-idle_write_transaction_timeout=0 \ --mysqld=--loose-idle_transaction_timeout=0 \ --mysqld=--loose-idle_readonly_transaction_timeout=0 \ --mysqld=--connect_timeout=60 \ --mysqld=--interactive_timeout=28800 \ --mysqld=--slave_net_timeout=60 \ --mysqld=--net_read_timeout=30 \ --mysqld=--net_write_timeout=60 \ --mysqld=--loose-table_lock_wait_timeout=50 \ --mysqld=--wait_timeout=28800 \ --mysqld=--lock-wait-timeout=86400 \ --mysqld=--innodb-lock-wait-timeout=50 \ --no-mask \ --queries=10000000 \ --seed=random \ --reporters=Backtrace \ --reporters=ErrorLog \ --reporters=Deadlock1 \ --validators=None \ --mysqld=--log_output=none \ --mysqld=--log-bin \ --mysqld=--log_bin_trust_function_creators=1 \ --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ --engine=InnoDB \ --restart_timeout=240 \ --mysqld=--plugin-load-add=file_key_management.so \ --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \ --duration=300 \ --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ --mysqld=--loose-innodb-sync-debug \ --mysqld=--innodb_stats_persistent=off \ --mysqld=--innodb_adaptive_hash_index=on \ --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \ --mysqld=--loose-max-statement-time=30 \ --threads=9 \ --mysqld=--innodb-use-native-aio=0 \ --rr=Extended \ --rr_options="--chaos --wait" \ --mysqld=--innodb_page_size=16K \ --mysqld=--innodb-buffer-pool-size=8M \ --no_mask .... --script_debug=_nix_   Buildscript: bld_asan_O2_O0.sh   RQG error pattern: [ 'TBR-1080', 'INFO: Reporter \'CrashRecovery1\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] InnoDB: Cannot apply log to .{1,100}of corrupted file .{1,100}\.ibd.+\[ERROR\] Plugin \'InnoDB\' registration as a STORAGE ENGINE failed.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

            mleich, please reopen this if you see it occurring in a main branch.

            In the Description, I wrote that it occurred in 10.6 3a566de22db6990faf213f483a722645c1ad8b9e, but in reality that was a development branch that included an early version of MDEV-25954 that was incorrect and caused us to wrongly assume that some page write had been completed, while in reality it was still in progress.

            marko Marko Mäkelä added a comment - mleich , please reopen this if you see it occurring in a main branch. In the Description, I wrote that it occurred in 10.6 3a566de22db6990faf213f483a722645c1ad8b9e, but in reality that was a development branch that included an early version of MDEV-25954 that was incorrect and caused us to wrongly assume that some page write had been completed, while in reality it was still in progress.

            People

              mleich Matthias Leich
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.