[MDEV-18194] Incremental prepare tries to access page which is out of tablespace bounds. Created: 2019-01-09  Updated: 2020-08-25  Resolved: 2019-02-01

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - InnoDB
Affects Version/s: 10.2.21, 10.3, 10.4
Fix Version/s: 10.4.3, 10.2.22, 10.3.13

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 2
Labels: None

Attachments: File mbackup_backup_0.log     File mbackup_backup_1.log     File mbackup_backup_2.log     File mbackup_prepare_0.log     File mbackup_prepare_1.log     File mysql.err_orig    
Issue Links:
Relates
relates to MDEV-11556 InnoDB redo log apply fails to adjust... Closed
relates to MDEV-18336 Remove backup_fix_ddl() during backup Open
relates to MDEV-20059 mariabackup-10.3.22 restore normal, b... Open
relates to MDEV-23344 InnoDB: Fatal: Trying to access page ... Closed

 Description   

10.4 30da40bb8c3

2019-01-09 22:15:09 0 [Note] InnoDB: Starting final batch to recover 2011 pages from redo log.
2019-01-09 22:15:09 0 [ERROR] [FATAL] InnoDB: Trying to read page number 96 in space 1239, space name test/t8#P#ver_pn, which is outside the tablespace bounds. Byte offset 0, len 16384
190109 22:15:09 [ERROR] mysqld got signal 6 ;
 
#4  0x00007fdd4b54d3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x000055da0a61be66 in ib::fatal::~fatal (this=0x7fff327b27d0, __in_chrg=<optimized out>) at /data/src/10.4/storage/innobase/ut/ut0ut.cc:765
#6  0x000055da0a39c232 in fil_report_invalid_page_access (block_offset=96, space_id=1239, space_name=0x55da0d79b120 "test/t8#P#ver_pn", byte_offset=0, len=16384, is_read=true) at /data/src/10.4/storage/innobase/fil/fil0fil.cc:4141
#7  0x000055da0a39ca5d in fil_io (type=..., sync=false, page_id=..., page_size=..., byte_offset=0, len=16384, buf=0x7fdd4695c000, message=0x7fdd4467aa98, ignore_missing_space=false) at /data/src/10.4/storage/innobase/fil/fil0fil.cc:4287
#8  0x000055da0a34fbb5 in buf_read_page_low (err=0x7fff327b2f28, sync=false, type=64, mode=132, page_id=..., page_size=..., unzip=true, ignore_missing_space=false) at /data/src/10.4/storage/innobase/buf/buf0rea.cc:181
#9  0x000055da0a351871 in buf_read_recv_pages (sync=false, space_id=1239, page_nos=0x7fff327b3110, n_stored=32) at /data/src/10.4/storage/innobase/buf/buf0rea.cc:924
#10 0x000055da0a4adc6b in recv_read_in_area (page_id=...) at /data/src/10.4/storage/innobase/log/log0recv.cc:2001
#11 0x000055da0a4ae2d4 in recv_apply_hashed_log_recs (last_batch=true) at /data/src/10.4/storage/innobase/log/log0recv.cc:2106
#12 0x000055da0a5c4393 in srv_start (create_new_db=false) at /data/src/10.4/storage/innobase/srv/srv0start.cc:1861
#13 0x000055da09b62845 in innodb_init () at /data/src/10.4/extra/mariabackup/xtrabackup.cc:1988
#14 0x000055da09b6c9d8 in xtrabackup_prepare_func (argv=0x55da0cfd6a88) at /data/src/10.4/extra/mariabackup/xtrabackup.cc:5594
#15 0x000055da09b6ebda in main_low (argv=0x55da0cfd6a88) at /data/src/10.4/extra/mariabackup/xtrabackup.cc:6418
#16 0x000055da09b6e3ee in main (argc=7, argv=0x7fff327ba398) at /data/src/10.4/extra/mariabackup/xtrabackup.cc:6211

The log is attached as mbackup_prepare_1.log

Backup dirs before any attempts to prepare are stored here:
ftp://ftp.askmonty.org/public/mdev18194/

backup_0 is the initial full backup.
backup_1 is the incremental backup.

To reproduce, unpack the backups and run

bin/mariabackup --prepare --apply-log-only --innodb-file-io-threads=1 --target-dir=backup_0 --user=root
bin/mariabackup --prepare --apply-log-only --innodb-file-io-threads=1 --target-dir=backup_0 --incremental-dir=backup_1 --user=root


Notes on the origin of the report. Please be aware it's not easily reproducible the same way, the test is concurrent by nature, and apparently there is a race condition involved.

The problem initially happened by running the concurrent test with MariaBackupIncremental scenario:

github.com/MariaDB/randgen elenst-dev 0ebbe452a69, github.com/elenst/mariadb-toolbox master 6ded4b71568

perl ./runall-new.pl --basedir=/home/travis/server/10.4 --vardir=/home/travis/logs/vardir2 --duration=350 --threads=4 --seed=1547032787 --reporters=Backtrace,ErrorLog,Deadlock --skip-gendata --gendata-advanced --views --grammar=conf/mariadb/generic-dml.yy --redefine=conf/mariadb/bulk_insert.yy --mysqld=--log_output=FILE --mysqld=--max-statement-time=20 --mysqld=--lock-wait-timeout=10 --mysqld=--loose-innodb-lock-wait-timeout=5 --mysqld=--loose-debug_assert_on_not_freed_memory=0 --filter=/home/travis/mariadb-toolbox/travis/10.4-combo-filter.ff --mysqld=--innodb-encrypt-tables --mysqld=--innodb-encrypt-log --mysqld=--innodb-encryption-threads=4 --mysqld=--aria-encrypt-tables=1 --mysqld=--encrypt-tmp-disk-tables=1 --mysqld=--file-key-management --mysqld=--file-key-management-filename=/home/travis/mariadb-toolbox/data/keys.txt --mysqld=--plugin-load-add=file_key_management --engine=Aria --scenario=MariaBackupIncremental --redefine=conf/mariadb/alter_table.yy --redefine=conf/mariadb/modules/admin.yy --redefine=conf/mariadb/versioning.yy --redefine=conf/mariadb/sequences.yy

The test performs the following steps:

  • start the server with

    "--no-defaults" "--basedir=/home/travis/server/10.4" "--lc-messages-dir=/home/travis/server/10.4/share" "--character-sets-dir=/home/travis/server/10.4/share/charsets" "--tmpdir=/home/travis/logs/vardir2/tmp" "--core-file" "--datadir=/home/travis/logs/vardir2/data" "--max-allowed-packet=128Mb" "--port=19300" "--socket=/home/travis/logs/vardir2/mysql.sock" "--pid-file=/home/travis/logs/vardir2/mysql.pid" "--general-log" "--general-log-file=/home/travis/logs/vardir2/mysql.log" --log_output=FILE --max-statement-time=20 --lock-wait-timeout=10 --loose-innodb-lock-wait-timeout=5 --loose-debug_assert_on_not_freed_memory=0 --innodb-encrypt-tables --innodb-encrypt-log --innodb-encryption-threads=4 --aria-encrypt-tables=1 --encrypt-tmp-disk-tables=1 --file-key-management --file-key-management-filename=/home/travis/mariadb-toolbox/data/keys.txt --plugin-load-add=file_key_management
    

  • run concurrent DML/DDL flow on the server;
  • after ~1 min, take the full backup from the running server:

    bin/mariabackup --backup --target-dir=/home/travis/logs/vardir2/backup_0 --protocol=tcp --port=19300 --user=root
    

    (succeeded, the log attached as mbackup_backup_0.log;

  • after another ~1 minute, run the incremental backup as

    bin/mariabackup --backup --target-dir=/home/travis/logs/vardir2/backup_1 --incremental-basedir=/home/travis/logs/vardir2/backup_0 --protocol=tcp --port=19300 --user=root
    

    (succeeded, the log attached as mbackup_backup_1.log;

  • after another ~1 minute, run the incremental backup again as

    bin/mariabackup --backup --target-dir=/home/travis/logs/vardir2/backup_2 --incremental-basedir=/home/travis/logs/vardir2/backup_1 --protocol=tcp --port=19300 --user=root
    

    (succeeded, the log attached as mbackup_backup_2.log;

  • let the test flow continue for a little while
  • shut down the server gracefully (succeeded, the error log is attached as mysql.err_orig;
  • run mariabackup --prepare on the full backup as

    bin/mariabackup --prepare --apply-log-only --innodb-file-io-threads=1 --target-dir=/home/travis/logs/vardir2/backup_0 --user=root
    

    (succeeded, the log attached as mbackup_prepare_0.log;

  • run mariabackup --prepare on the first incremental backup as

    bin/mariabackup --prepare --apply-log-only --innodb-file-io-threads=1 --target-dir=/home/travis/logs/vardir2/backup_0 --incremental-dir=/home/travis/logs/vardir2/backup_1 --user=root
    

    (failed, the log is attached as mbackup_prepare_1.log.


I have no information yet whether the problem happens on 10.1.



 Comments   
Comment by Marko Mäkelä [ 2019-01-17 ]

I think that this should likely affect mariabackup 10.1 already. The code should be equivalent between 10.2, 10.3, and 10.4.

Comment by Thirunarayanan Balathandayuthapani [ 2019-01-22 ]

MLOG_FILE_RENAME2 doesn't replay in mariabackup during prepare phase. Because file path stored in mariabackup as "test/t1.ibd".
But redo log records stores the file path as "./test/t1.ibd". So fil_op_replay_rename() does string matching and fails to
do the renaming of the file.

Comment by Thirunarayanan Balathandayuthapani [ 2019-01-22 ]

After preparing full backup phase:
Problematic space id is t8#P#ver_pn. Space id is 903. It has only 4 pages.

While preparing incremental backup phase:
t8#P#ver_pn.ibd has only .delta file. It is empty.
Based on .meta file, mariabackup prepare creates new ibd file with new space id(1239) and name it as t8#P#ver_pn.ibd
Redo log now tries to access 96th page. So it crashes.

Basically backup phase went wrong during incremental backup i think.

Comment by Thirunarayanan Balathandayuthapani [ 2019-01-22 ]

The following test case repeats the issue:

--source include/have_debug.inc
--source include/have_innodb.inc
--source include/have_partition.inc
 
call mtr.add_suppression("InnoDB: New log files created");
 
let $basedir=$MYSQLTEST_VARDIR/tmp/backup;
let $incremental_dir=$MYSQLTEST_VARDIR/tmp/backup_inc1;
let $backuplog=$MYSQLTEST_VARDIR/tmp/backup.log;
 
CREATE TABLE t1(i INT) ENGINE INNODB;
alter table t1 partition by hash(i) partitions 4;
CREATE TABLE t2(i INT PRIMARY KEY) ENGINE INNODB;
CREATE TABLE t3(i INT) ENGINE INNODB;
 
INSERT INTO t1 values(1), (2), (3), (4), (5), (6), (7), (8), (9);
insert into t1 select * from t1;
insert into t1 select * from t1;
 
echo # Create full backup , modify table, then create incremental/differential backup;
--disable_result_log
exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf  --backup --target-dir=$basedir;
--enable_result_log
drop table t1;
create table t4(f1 int not null, f2 int not null)engine=innodb;
insert into t4 values(1, 2), (2, 2), (3, 3), (5, 5), (6, 6), (4, 4), (9, 9);
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
alter table t4 partition by hash(f1) partitions 4;
rename table t4 to t1;
 
--echo # XTRABACKUP INCREMENTAL
exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf  --backup --target-dir=$incremental_dir --incremental-basedir=$basedir;
 
--echo # XTRABACKUP PREPARE
exec $XTRABACKUP --apply-log-only --prepare --target-dir=$basedir;
 
--echo # XTRABACKUP INCREMENTAL PREPARE
exec $XTRABACKUP --prepare --target-dir=$basedir --incremental-dir=$incremental_dir;
 
let $targetdir=$basedir;
-- source include/restart_and_restore.inc
--enable_result_log
 
select count(*) from t1;
drop table t2, t1, t3;

Comment by Thirunarayanan Balathandayuthapani [ 2019-01-23 ]

--source include/have_debug.inc
 
call mtr.add_suppression("InnoDB: New log files created");
 
let $basedir=$MYSQLTEST_VARDIR/tmp/backup;
let $incremental_dir=$MYSQLTEST_VARDIR/tmp/backup_inc1;
let $backuplog=$MYSQLTEST_VARDIR/tmp/backup.log;
 
CREATE TABLE t1(i INT) ENGINE INNODB;
CREATE TABLE t2(i INT PRIMARY KEY) ENGINE INNODB;
CREATE TABLE t3(i INT) ENGINE INNODB;
 
--echo # FULL XTRABACKUP
 
--disable_result_log
exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf  --backup --target-dir=$basedir;
--enable_result_log
 
create table t4(f1 int not null, f2 int not null)engine=innodb;
insert into t4 values(1, 2), (2, 2), (3, 3), (5, 5), (6, 6), (4, 4), (9, 9);
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
insert into t4 select * from t4;
rename table t4 to t7;
select count(*) from t7;
 
--echo # XTRABACKUP INCREMENTAL
exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf  --backup --target-dir=$incremental_dir --incremental-basedir=$basedir
 
--echo # XTRABACKUP PREPARE
exec $XTRABACKUP --apply-log-only --prepare --target-dir=$basedir;
 
--echo # XTRABACKUP INCREMENTAL PREPARE
--error 0, 1, 134
exec $XTRABACKUP --prepare --target-dir=$basedir --incremental-dir=$incremental_dir --dbug="+d,ib_log" 2> $backuplog;
 
let $targetdir=$basedir;
-- source include/restart_and_restore.inc
--enable_result_log
 
select count(*) from t7;
drop table t1, t2, t7, t3;

The above is the another test case to repeat the failure. There is no need for partition or so.

INCREMENTAL BACKUP:
===================
xtrabackup tries to copy delta page of t7 tablespace when all pages are in the buffer pool. So lsn is zero in the file.
It stores nothing in .delta file. But t7 tablespace has 12 pages in buffer pool.

At the end, incremental backup contains t7.delta which has nothing and t7.meta which has page_size, space id.

INCREMENTAL PREPARE PHASE:
===========================

Mariabackup tries to apply delta for all tablespace. When it encounters t7 tablespace. Target directory doesn't have t7.
So it creates new tablespace as t7 with space id stored in .meta file. It sets the file size to 4. (Default file size).

redo log for MLOG_FILE_NAME will fail to load the tablespace to memory for t4 tablespace. Because fil_ibd_discovery doesn't find any t4 tablespace.

redo log for FSP_SIZE will try to search the tablespace for the respective id. Since tablespace is not loaded yet for MLOG_FILE_NAME. It will fail to set.

In applying redo log phase, redo log for the tablespace t7 and its respective pages. (12 pages in our case). So it tries to fetch the pages at all once to
apply the redo log. Since mariabackup prepare creates only 4 pages. It leads to assert while accessing the 5th page.

Fix:
====
.meta file should contain file size info. Instead of setting the default file size, mariabackup prepare should create respective file size present in meta file.

Comment by Thirunarayanan Balathandayuthapani [ 2019-01-24 ]

Patch is in bb-10.2-MDEV-18194 https://github.com/MariaDB/server/commit/376f3a2ea6347397b6be9378042b8fd871eb8ea7

Comment by Marko Mäkelä [ 2019-01-28 ]

The fix looks OK to me. With the additional patch that effectively disables MDEV-12103, it passed the additional testing by elenst.

Alas, I believe that we must first fix the corner case of crash recovery that was broken in MDEV-12103. Without fixing that, this fix could make matters worse.

The issue that MDEV-12103 caused is: If multiple redo log checkpoints ‘overlap’ each other, then crash recovery would fail to start parsing the log records near the first available MLOG_CHECKPOINT record, and start from the last record instead. Due to this, some recently created, renamed or deleted tablespaces could be unavailable during redo log processing in crash recovery.

Generated at Thu Feb 08 08:42:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.