[MDEV-31487] Recovery or backup failure after innodb_undo_log_truncate=ON Created: 2023-06-16  Updated: 2023-07-10  Resolved: 2023-06-27

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2
Fix Version/s: 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1

Type: Bug Priority: Critical
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: rr-profile-analyzed

Attachments: File data.tar.xz    
Issue Links:
Duplicate
is duplicated by MDEV-31488 Restart on backupped data fails with ... Closed
Problem/Incident
is caused by MDEV-30479 OPT_PAGE_CHECKSUM mismatch during rec... Closed

 Description   

origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00
Scenario:
1. Start source server, generate some initial data
2. Several sessions run concurrent SQL
3. During 2. is ongoing
    mariabackup --backup data from source server to some other location
    mariabackup --prepare on that other location
    This fails with
    # 2023-06-13T20:43:22 [73003] | [rr 136876 18773]2023-06-13 20:41:29 0 [Note] InnoDB: Last binlog file './mysql-bin.000001', position 11543118
   # 2023-06-13T20:43:22 [73003] | [rr 136876 18786]2023-06-13 20:41:29 0 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=50528773, page number=1165950976], should be [page id: space=3, page number=0]
   # 2023-06-13T20:43:22 [73003] | [rr 136876 18810][00] 2023-06-13 20:41:29 Last binlog file ./mysql-bin.000001, position 11543118
   # 2023-06-13T20:43:22 [73003] | [rr 136876 18849]2023-06-13 20:41:29 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './undo001' page [page id: space=1, page number=4]. You may have to recover from a backup.
 
pluto:/data/results/1686674959/TBR-1929--TBR-1928$ _RR_TRACE_DIR=./1_clone/rr/ rr replay --mark-stdio
 
RQG
====
# git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG
#
# GIT_SHOW: HEAD -> master, origin/master, origin/HEAD a8c8cd1021480e64ef979d326a2543c577d1068a 2023-06-09T18:22:34+02:00
# rqg.pl  : Version 4.2.1 (2022-12)
#
# $RQG_HOME/rqg.pl \
# --grammar=conf/engines/many_indexes.yy \
# --gendata=conf/engines/many_indexes.zz \
# --max_gd_duration=900 \
# --reporters=Mariabackup_linux \
# --mysqld=--loose-innodb-log-file-size=200M \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --no_mask \
# --queries=10000000 \
# --seed=random \
# --reporters=None \
# --reporters=ErrorLog \
# --reporters=Deadlock \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--plugin-load-add=provider_lzo.so \
# --mysqld=--plugin-load-add=provider_bzip2.so \
# --mysqld=--plugin-load-add=provider_lzma.so \
# --mysqld=--plugin-load-add=provider_snappy.so \
# --mysqld=--plugin-load-add=provider_lz4.so \
# --mysqld=--loose-innodb_compression_level=1 \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--innodb_stats_persistent=on \
# --mysqld=--innodb_adaptive_hash_index=on \
# --mysqld=--innodb_random_read_ahead=OFF \
# --mysqld=--log-bin \
# --mysqld=--sync-binlog=1 \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
# --mysqld=--loose-max-statement-time=30 \
# --threads=9 \
# --mysqld=--innodb_use_native_aio=1 \
# --mysqld=--innodb_undo_log_truncate=ON \
# --mysqld=--innodb_rollback_on_timeout=OFF \
# --mysqld=--innodb_page_size=8K \
# --mysqld=--innodb-buffer-pool-size=256M \
# <local settings>



 Comments   
Comment by Marko Mäkelä [ 2023-06-26 ]

I can also reproduce this locally with the data.tar.xz that I copied from the test location. The following patch fixes this:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 82d4ba3afde..9643e029f74 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -1841,7 +1841,7 @@ inline bool page_recv_t::trim(lsn_t start_lsn)
 {
   while (log.head)
   {
-    if (log.head->lsn > start_lsn) return false;
+    if (log.head->lsn >= start_lsn) return false;
     last_offset= 1; /* the next record must not be same_page */
     log_rec_t *next= log.head->next;
     recv_sys.free(log.head);

This would fully revert an earlier fix of MDEV-30479.

I think that instead of applying the above change, we must ensure the correct recovery of the truncation LSN.

Comment by Marko Mäkelä [ 2023-06-26 ]

Fortunately, the rr replay traces of MDEV-30479 are still accessible. I will reanalyze that and think of a better fix.

Comment by Marko Mäkelä [ 2023-06-27 ]

thiru’s observations from debugging a different case led me to the following patch:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 82d4ba3afde..b4a5957bb5e 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -2673,9 +2673,9 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists)
                         TRX_SYS_MAX_UNDO_SPACES, "compatibility");
           /* The entire undo tablespace will be reinitialized by
           innodb_undo_log_truncate=ON. Discard old log for all pages. */
-          trim({space_id, 0}, lsn);
+          trim({space_id, 0}, start_lsn);
           truncated_undo_spaces[space_id - srv_undo_space_id_start]=
-            { lsn, page_no };
+            { start_lsn, page_no };
           if (!store && undo_space_trunc)
             undo_space_trunc(space_id);
 #endif

This follows the idea of the MDEV-30479 fix. We must keep track of the start LSN of the mini-transaction that truncated the undo tablespace. The start LSN is also being used for handling mlog_init (MDEV-12699).

Comment by Matthias Leich [ 2023-06-29 ]

Hitting that problem was not rare. The corresponding error patterns
'ERROR: Reporter Mariabackup : First prepare returned.+ Executing backup: .+ Executing first prepare: .+\[ERROR\] InnoDB: Database page corruption on disk or a failed read of file \'\.\/undo.+RESULT: The RQG run ended with status STATUS_BACKUP_FAILURE' ],
 'ERROR: Reporter Mariabackup : First prepare returned.+ Executing backup: .+ Executing first prepare: .+\[ERROR\] InnoDB: Space id and page no stored in the page,.+RESULT: The RQG run ended with status STATUS_BACKUP_FAILURE' ],
 showed usually up together.
This happened last time on origin/11.2, 11.2 dbc342959233abd0ee70dace85409d042684162d 2023-06-26T16:22:27+02:00 and
preview-11.2-preview.
The excessive RQG test runs on
- 11.2 2867894ac6ca23fc2f82cfad2dd510351f597325 2023-06-28T09:39:28+03:00
- origin/bb-11.2-MDEV-14795 f4ae37836c3a6632a4284be853675bcbe3f565d5 2023-06-28T15:48:11+05:30
did not replay these error patterns again.
So we can assume that the reason for the problem is most probably fixed.

Comment by Marko Mäkelä [ 2023-06-29 ]

Thank you, mleich. I marked the bug MDEV-31488 that thiru analyzed as a duplicate of this one.

Generated at Thu Feb 08 10:24:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.