[MDEV-26326] MDEV-24626 (remove synchronous page0 write) seems to cause mariabackup to skip valid ibd file. Created: 2021-08-09  Updated: 2022-07-25  Resolved: 2022-02-01

Status: Closed
Project: MariaDB Server
Component/s: Backup, Storage Engine - InnoDB
Affects Version/s: 10.6, 10.7, 10.8
Fix Version/s: 10.6.6, 10.7.2, 10.8.1

Type: Bug Priority: Critical
Reporter: Vladislav Vaintroub Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: regression-10.6

Issue Links:
Blocks
blocks MDEV-26154 mariabackup.xb_compressed_encrypted f... Open
Problem/Incident
causes MDEV-29137 mariabackup excessive logging of ddl ... Closed
is caused by MDEV-24626 Remove synchronous write of page0 and... Closed
Relates
relates to MDEV-14481 Execute InnoDB crash recovery in the ... Closed
relates to MDEV-25909 Unnecessary calls to fil_ibd_load() s... Open
relates to MDEV-27424 mariabackup ignores physically corrup... Closed
relates to MDEV-24626 Remove synchronous write of page0 and... Closed

 Description   

http://buildbot.askmonty.org/buildbot/builders/winx64-packages/builds/26370/steps/test/logs/stdio , shows that the file t.ibd is not copied by copy-back, thus it is not in the backup.

A theory :
An empty .ibd is not getting copied into backup, because it is not getting loaded by xb_load_single_table_tablespace . There is a test small_ibd.test, that actually checks that.
Perhaps that does not work well anymore, and the file should be loaded and or copied or something like that.



 Comments   
Comment by Marko Mäkelä [ 2021-09-14 ]

If an empty .ibd file is not being copied into a backup, then the backup will have to be adjusted somehow. Maybe during mariabackup --prepare, we should actually create data files in recv_sys_t::recover_deferred() if that is not already the case.

Somewhat related to this, MDEV-25909 has been filed for avoiding unnecessary repeated reads of the first page of a data file. That could affect mariabackup --prepare as well.

Comment by Marko Mäkelä [ 2021-11-11 ]

I just analyzed an rr replay trace of a failed mariabackup --prepare . An assertion would fail in my development branch due to buf_flush_discard_page() being invoked while deferred_spaces still has not been converted to tablespaces.

I was unable to reproduce the race when restoring the same backup very many times. I think that a simple fix is as follows (possibly to be revised in MDEV-14481):

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index 1f720989f6f..30f46e6024f 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1301,7 +1301,7 @@ static void buf_flush_LRU_list_batch(ulint max, flush_counters_t *n)
         space= nullptr;
       }
 
-      if (!space)
+      if (!space && !recv_recovery_is_on())
         buf_flush_discard_page(bpage);
       else if (neighbors && space->is_rotational())
       {
 

A cleaner fix could be to retain an additional X-latch on the page, instead of only retaining a buffer-fix. Then, no change to the page cleaner should be needed. But, this requires the page latch to always exist also for ROW_FORMAT=COMPRESSED blocks. (That is possible in my development branch.)

Comment by Marko Mäkelä [ 2022-01-12 ]

The link in the description is stale, and no summary of the failure is available. On a quick search of the cross-reference, I found some failures of the test mariabackup.xb_compressed_encrypted that fail like this:

10.6 48bbc447335a0b3ec698e975d48ad491

 
mariabackup.xb_compressed_encrypted '16k,innodb' w3 [ fail ]
        Test ended at 2021-09-17 07:00:03
 
CURRENT_TEST: mariabackup.xb_compressed_encrypted
/mnt/buildbot/build/mariadb-10.6.5/extra/mariabackup/mariabackup based on MariaDB server 10.6.5-MariaDB Linux (x86_64)
[00] 2021-09-17 07:00:02 completed OK!
mysqltest: At line 29: query 'select sum(c1) from t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB

In fact, mariabackup.xb_compressed_encrypted is the only mariabackup.* test that has with Tablespace is missing for table in the "Failure Output" section. There were a couple more tests failing for bb-10.6-MDEV-24626_1 like this.

In MDEV-27424, I posted a rough mtr test case that could explain this scenario, and is indeed caused by MDEV-24626.

Comment by Thirunarayanan Balathandayuthapani [ 2022-01-18 ]

This issue should be fixed by patch for MDEV-27424 (bb-10.6-thiru)

Comment by Marko Mäkelä [ 2022-02-01 ]

The fix looks OK to push. I would suggest to refer to this bug in the commit and let the later report MDEV-27424 remain open, because it is unclear which version that report was filed for.

Generated at Thu Feb 08 09:44:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.