Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Duplicate
-
10.5.2
-
None
Description
extra/mariabackup/mariabackup --host=127.0.0.1 --user=root --password='' --innodb-use-native-aio=0 --port=24704 --prepare --target-dir=/dev/shm/vardir/1585648469/6/1_clone/data
|
fails with
|
...
|
2020-03-31 9:56:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7111846
|
2020-03-31 9:56:36 0 [Note] InnoDB: Starting final batch to recover 363 pages from redo log.
|
2020-03-31 9:56:36 0 [ERROR] InnoDB: Not applying INSERT_REUSE_DYNAMIC due to corruption on [page id: space=6, page number=677]
|
2020-03-31 9:56:36 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
|
2020-03-31 9:56:36 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1574] with error Data structure corruption
|
|
Workflow
|
1. Create DB server and start it
|
2. One session starts to run some SQL stream (DML and DDL) against this server
|
3. Concurrent to the ongoing 2.
|
extra/mariabackup/mariabackup --host=127.0.0.1 --user=root --password='' --innodb-use-native-aio=0 --port=24700 --backup --datadir=/dev/shm/vardir/1585648469/6/1/data --target-dir=/dev/shm/vardir/1585648469/6/1_clone/data
|
gets executed.
|
4. Also concurrent to 2. but not touching the server started in 1. some
|
extra/mariabackup/mariabackup --host=127.0.0.1 --user=root --password='' --innodb-use-native-aio=0 --port=24700 --prepare --target-dir=/dev/shm/vardir/1585648469/6/1_clone/data
|
gets executed
|
|
The uploaded RQG protocol 000022.log shows exact that RQG run.
|
|
RQG
|
git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich1
|
origin/experimental d417497b476e263428422b4640cf63b8e9d16afe 2020-03-30T17:17:54+02:00
|
|
MariaDB
|
origin/10.5 718f18599a9bcf1e7c2d3f18416fca4f7124d00d 2020-03-26T16:05:25+02:00
|
build with debug
|
|
cd <top directory of tree with RQG>
|
perl rqg.pl \
|
--duration=300 \
|
--queries=10000000 \
|
--threads=1 \
|
--no_mask \
|
--seed=random \
|
--gendata=conf/mariadb/table_stress.zz \
|
--gendata_sql=conf/mariadb/table_stress.sql \
|
--rpl_mode=none \
|
--engine=InnoDB \
|
--mysqld=--log_bin_trust_function_creators=1 \
|
--mysqld=--lock-wait-timeout=86400 \
|
--mysqld=--loose-idle_write_transaction_timeout=0 \
|
--mysqld=--interactive_timeout=28800 \
|
--mysqld=--innodb-lock-wait-timeout=50 \
|
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
|
--mysqld=--slave_net_timeout=60 \
|
--mysqld=--wait_timeout=28800 \
|
--mysqld=--loose_innodb_use_native_aio=1 \
|
--mysqld=--loose-table_lock_wait_timeout=50 \
|
--mysqld=--net_write_timeout=60 \
|
--mysqld=--log-bin \
|
--mysqld=--innodb_stats_persistent=off \
|
--mysqld=--log-output=none \
|
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
|
--mysqld=--innodb_page_size=4K \
|
--mysqld=--connect_timeout=60 \
|
--mysqld=--loose_innodb_lock_schedule_algorithm=fcfs \
|
--mysqld=--net_read_timeout=30 \
|
--mysqld=--loose-idle_transaction_timeout=0 \
|
--reporters=Deadlock1,ErrorLog,Mariabackup,None,ServerDead \
|
--validators=None \
|
--grammar=c00000.yy \
|
--workdir=/home/mleich/RQG_mleich/storage/1585648469/6 \
|
--vardir=/dev/shm/vardir/1585648469/6 \
|
--mtr-build-thread=735 \
|
--basedir1=<path tp the 10.5 binaries> \
|
--script_debug=_nix_ \
|
--rr
|
Attachments
Issue Links
- duplicates
-
MDEV-22139 fseg_free_page_low() fails to write FREE_PAGE record, breaking recovery
-
- Closed
-
- is caused by
-
MDEV-15528 Avoid writing freed InnoDB pages
-
- Closed
-
- relates to
-
MDEV-22097 Not applying DELETE_ROW_FORMAT_REDUNDANT due to corruption
-
- Closed
-
-
MDEV-14992 BACKUP: in-server backup
-
- Open
-
I am rather convinced that this bug occurs because mariabackup --backup is for some reason not observing page flushes.
An invocation of
mariadb-backup --dbug=d,ib_log --prepare --target-dir data
indicates that for the problematic page 6:677, there will only be one record since the checkpoint:
mariadb-backup --dbug=d,ib_log --prepare --target-dir data
2020-03-31 15:43:28 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6941911
…
recv_group_scan_log_recs: ib_log: scan 6941999: rec 20 len 59 page 6:677
…
?func: ib_log: Applying log to page 6:677
?func: ib_log: apply 6941999: 6:677
2020-03-31 15:43:28 0 [ERROR] InnoDB: Not applying INSERT_REUSE_DYNAMIC due to corruption on [page id: space=6, page number=677]
2020-03-31 15:43:28 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
In the page, we have FIL_PAGE_LSN=0x688b84=6851460. That is older than the checkpoint.
In rr replay of the server that was running concurrently with the backup, we have the following:
when 707133
Thread 20 hit Hardware watchpoint 2: -location $2.page.oldest_modification
Old value = 6849351
New value = 0
buf_flush_remove (bpage=0x7fab7414cd00)
at /home/mleich/10.5/storage/innobase/buf/buf0flu.cc:541
541 buf_flush_validate_skip();
1: log_sys.lsn = {<std::__atomic_base<unsigned long>> = {
static _S_alignment = 8, _M_i = 6865214}, <No data fields>}
(rr) bt
#0 buf_flush_remove (bpage=0x7fab7414cd00)
at /home/mleich/10.5/storage/innobase/buf/buf0flu.cc:541
#1 0x0000564f14b52d5d in buf_flush_write_complete (bpage=0x7fab7414cd00,
dblwr=false) at /home/mleich/10.5/storage/innobase/buf/buf0flu.cc:625
#2 0x0000564f14b3a5dd in buf_page_io_complete (bpage=0x7fab7414cd00,
dblwr=false, evict=false)
at /home/mleich/10.5/storage/innobase/buf/buf0buf.cc:5078
#3 0x0000564f14bd7645 in fil_aio_callback (data=0x7fab2bffec40)
at /home/mleich/10.5/storage/innobase/fil/fil0fil.cc:4168
#4 0x0000564f14924b62 in io_callback (cb=0x564f17c21f20)
at /home/mleich/10.5/storage/innobase/os/os0file.cc:3900
#5 0x0000564f14c7519f in tpool::simulated_aio::simulated_aio_callback (
param=0x564f17c21f20) at /home/mleich/10.5/tpool/aio_simulated.cc:162
We have rather clear evidence that mariabackup --backup seems to be broken. The stricter consistency checks implemented in
MDEV-12353andMDEV-21724could catch that better than before.There also seems to be a bug in mariabackup --prepare: it should have aborted at the first ERROR, but it did not.
Unfortunately, there was no rr record of the mariabackup --backup run, so I cannot analyze this failure further. I think that I have proven that either the bug is somewhere deep inside the InnoDB page flushing, in the operating system or file system, or in the inter-process communication via the file system. Sadly, no effort to implement server-side backup (MDEV-14992) was done yet.