[MDEV-26001] InnoDB: Cannot apply log to [page id: space=152, page number=0] of corrupted file - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Cannot Reproduce
Affects Version/s: N/A
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- recovery
- rr-profile-analyzed

Description

We got a recovery failure due to ~~MDEV-24626~~:

10.6 3a566de22db6990faf213f483a722645c1ad8b9e
2021-06-23 17:19:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=161810155,161810155
2021-06-23 17:19:30 0 [ERROR] InnoDB: Cannot apply log to [page id: space=152, page number=0] of corrupted file './test/H.ibd'
2021-06-23 17:19:30 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1447] with error Data structure corruption

To retrieve the data (due to large redo log file, it is too large to be attached here):

ssh sdp 'tar cjf - -C /data/Results/1624453151/Marko1/dev/shm/vardir/1624453151/163/1 data' | tar xjf -

The problem is around here:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc

index 16250928845..3e602297363 100644

--- a/storage/innobase/log/log0recv.cc

+++ b/storage/innobase/log/log0recv.cc

@@ -806,8 +806,10 @@ bool recv_sys_t::recover_deferred(recv_sys_t::map::iterator &p,

   recv_spaces_t::iterator it{recv_spaces.find(first.space())};

   ut_ad(it != recv_spaces.end());

-  if (!first.page_no() && p->second.state == page_recv_t::RECV_WILL_NOT_READ)

+  if (!first.page_no())

+    ut_ad(p->second.state == page_recv_t::RECV_WILL_NOT_READ ||

+          p->second.state == page_recv_t::RECV_NOT_PROCESSED);

     mtr_t mtr;

     buf_block_t *block= recover_low(first, p, mtr, free_block);

     ut_ad(block == free_block);

The state is not RECV_WILL_NOT_READ, as expected. The RECV_NOT_PROCESSED was inserted here:

#0  recv_sys_t::add (this=0x564e77653838 <recv_sys>, page_id={m_id = 652835028992}, start_lsn=161999632, lsn=161999940, l=0x7f90f08b2d05 "8\200\030", len=9)

    at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:1974

#1  0x0000564e7684a73e in recv_sys_t::parse (this=0x564e77653838 <recv_sys>, checkpoint_lsn=161810155, store=0x7ffe6fbb46fc, apply=true) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:2393

#2  0x0000564e7685b186 in recv_scan_log_recs (store=0x7ffe6fbb46fc, log_block=0x7f90f0a96000 '\245' <repeats 200 times>..., checkpoint_lsn=161810155, start_lsn=161940992, end_lsn=162006528,

    contiguous_lsn=0x7ffe6fbb4a28, group_scanned_lsn=0x564e77f23208 <log_sys+520>) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:3460

#3  0x0000564e76853608 in recv_group_scan_log_recs (checkpoint_lsn=161810155, contiguous_lsn=0x7ffe6fbb4a28, last_phase=false) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:3543

#4  0x0000564e768520ac in recv_recovery_from_checkpoint_start (flush_lsn=41238) at /mariadb/10.6-merge/storage/innobase/log/log0recv.cc:3946

Attachments

Issue Links

relates to

MDEV-24626 Remove synchronous write of page0 and flushing file during file creation

Closed

MDEV-25954 Remove superfluous os_aio_wait_until_no_pending_writes() from Innodb

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2021-06-23 18:01

There was an incorrect change to os_aio_wait_until_no_pending_writes() in a development version of ~~MDEV-25954~~, which caused the test innodb.instant_alter_import to fail. We will still have to ensure that there is no doublewrite batch pending; otherwise os_aio_wait_until_no_pending_writes() would not guarantee that all submitted page writes have completed. After I fixed that, the test innodb.instant_alter_import was rock solid also on my system when I used NVMe for mysql-test/var. It never failed on /dev/shm.

I am unable to replay the trace that led to the SIGKILL before the failed startup, so I cannot confirm whether this failure was due to the above reason. I think that it is very plausible. Therefore, it would be good to test the latest 10.5 or 10.6 (with the final pushed version of ~~MDEV-25954~~) to see if there are any recovery failures.

Marko Mäkelä added a comment - 2021-06-23 18:01 There was an incorrect change to os_aio_wait_until_no_pending_writes() in a development version of MDEV-25954 , which caused the test innodb.instant_alter_import to fail. We will still have to ensure that there is no doublewrite batch pending; otherwise os_aio_wait_until_no_pending_writes() would not guarantee that all submitted page writes have completed. After I fixed that, the test innodb.instant_alter_import was rock solid also on my system when I used NVMe for mysql-test/var . It never failed on /dev/shm . I am unable to replay the trace that led to the SIGKILL before the failed startup, so I cannot confirm whether this failure was due to the above reason. I think that it is very plausible. Therefore, it would be good to test the latest 10.5 or 10.6 (with the final pushed version of MDEV-25954 ) to see if there are any recovery failures.

Matthias Leich added a comment - 2021-06-23 18:02

perl rqg.pl \

--grammar=conf/mariadb/innodb_compression_encryption.yy \

--gendata=conf/mariadb/innodb_compression_encryption.zz \

--max_gd_duration=1800 \

--reporters=CrashRecovery1 \

--mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \

--mysqld=--loose-idle_write_transaction_timeout=0 \

--mysqld=--loose-idle_transaction_timeout=0 \

--mysqld=--loose-idle_readonly_transaction_timeout=0 \

--mysqld=--connect_timeout=60 \

--mysqld=--interactive_timeout=28800 \

--mysqld=--slave_net_timeout=60 \

--mysqld=--net_read_timeout=30 \

--mysqld=--net_write_timeout=60 \

--mysqld=--loose-table_lock_wait_timeout=50 \

--mysqld=--wait_timeout=28800 \

--mysqld=--lock-wait-timeout=86400 \

--mysqld=--innodb-lock-wait-timeout=50 \

--no-mask \

--queries=10000000 \

--seed=random \

--reporters=Backtrace \

--reporters=ErrorLog \

--reporters=Deadlock1 \

--validators=None \

--mysqld=--log_output=none \

--mysqld=--log-bin \

--mysqld=--log_bin_trust_function_creators=1 \

--mysqld=--loose-debug_assert_on_not_freed_memory=0 \

--engine=InnoDB \

--restart_timeout=240 \

--mysqld=--plugin-load-add=file_key_management.so \

--mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \

--duration=300 \

--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \

--mysqld=--loose-innodb-sync-debug \

--mysqld=--innodb_stats_persistent=off \

--mysqld=--innodb_adaptive_hash_index=on \

--mysqld=--loose-innodb_evict_tables_on_commit_debug=off \

--mysqld=--loose-max-statement-time=30 \

--threads=9 \

--mysqld=--innodb-use-native-aio=0 \

--rr=Extended \

--rr_options="--chaos --wait" \

--mysqld=--innodb_page_size=16K \

--mysqld=--innodb-buffer-pool-size=8M \

--no_mask

....

--script_debug=_nix_

Buildscript:  bld_asan_O2_O0.sh

RQG error pattern:

[ 'TBR-1080', 'INFO: Reporter \'CrashRecovery1\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] InnoDB: Cannot apply log to .{1,100}of corrupted file .{1,100}\.ibd.+\[ERROR\] Plugin \'InnoDB\' registration as a STORAGE ENGINE failed.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

Matthias Leich added a comment - 2021-06-23 18:02 perl rqg.pl \ --grammar=conf/mariadb/innodb_compression_encryption.yy \ --gendata=conf/mariadb/innodb_compression_encryption.zz \ --max_gd_duration=1800 \ --reporters=CrashRecovery1 \ --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ --mysqld=--loose-idle_write_transaction_timeout=0 \ --mysqld=--loose-idle_transaction_timeout=0 \ --mysqld=--loose-idle_readonly_transaction_timeout=0 \ --mysqld=--connect_timeout=60 \ --mysqld=--interactive_timeout=28800 \ --mysqld=--slave_net_timeout=60 \ --mysqld=--net_read_timeout=30 \ --mysqld=--net_write_timeout=60 \ --mysqld=--loose-table_lock_wait_timeout=50 \ --mysqld=--wait_timeout=28800 \ --mysqld=--lock-wait-timeout=86400 \ --mysqld=--innodb-lock-wait-timeout=50 \ --no-mask \ --queries=10000000 \ --seed=random \ --reporters=Backtrace \ --reporters=ErrorLog \ --reporters=Deadlock1 \ --validators=None \ --mysqld=--log_output=none \ --mysqld=--log-bin \ --mysqld=--log_bin_trust_function_creators=1 \ --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ --engine=InnoDB \ --restart_timeout=240 \ --mysqld=--plugin-load-add=file_key_management.so \ --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \ --duration=300 \ --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ --mysqld=--loose-innodb-sync-debug \ --mysqld=--innodb_stats_persistent=off \ --mysqld=--innodb_adaptive_hash_index=on \ --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \ --mysqld=--loose-max-statement-time=30 \ --threads=9 \ --mysqld=--innodb-use-native-aio=0 \ --rr=Extended \ --rr_options="--chaos --wait" \ --mysqld=--innodb_page_size=16K \ --mysqld=--innodb-buffer-pool-size=8M \ --no_mask .... --script_debug=_nix_ Buildscript: bld_asan_O2_O0.sh RQG error pattern: [ 'TBR-1080', 'INFO: Reporter \'CrashRecovery1\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] InnoDB: Cannot apply log to .{1,100}of corrupted file .{1,100}\.ibd.+\[ERROR\] Plugin \'InnoDB\' registration as a STORAGE ENGINE failed.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

Marko Mäkelä added a comment - 2021-06-28 12:18

mleich, please reopen this if you see it occurring in a main branch.

In the Description, I wrote that it occurred in 10.6 3a566de22db6990faf213f483a722645c1ad8b9e, but in reality that was a development branch that included an early version of ~~MDEV-25954~~ that was incorrect and caused us to wrongly assume that some page write had been completed, while in reality it was still in progress.

Marko Mäkelä added a comment - 2021-06-28 12:18 mleich , please reopen this if you see it occurring in a main branch. In the Description, I wrote that it occurred in 10.6 3a566de22db6990faf213f483a722645c1ad8b9e, but in reality that was a development branch that included an early version of MDEV-25954 that was incorrect and caused us to wrongly assume that some page write had been completed, while in reality it was still in progress.

MariaDB Server

InnoDB: Cannot apply log to [page id: space=152, page number=0] of corrupted file

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration