Aria crash recovery failures (MDEV-19813)

[MDEV-18187] Aria engine: Redo phase failed with "error 192 when executing record redo_index_new_page" upon startup on a restored datadir Created: 2019-01-09  Updated: 2023-06-08  Resolved: 2023-06-08

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - Aria
Affects Version/s: 10.4
Fix Version/s: 10.4.23, 10.5.14, 10.6.6

Type: Technical task Priority: Major
Reporter: Elena Stepanova Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mbackup_backup.log     File mbackup_prepare.log     File mbackup_restore.log     File mysql.err     File mysql.err_orig    
Issue Links:
Problem/Incident
is caused by MDEV-17913 Encrypted transactional Aria tables r... Closed

 Description   

10.4 30da40bb8c

2019-01-09 16:00:40 0 [Note] mysqld: Aria engine: starting recovery
recovered pages: 0%
2019-01-09 16:00:40 0 [ERROR] mysqld: failed to decrypt './test/t7'  rc: -1  dstlen: 0  size: 4294967275
 
Got error 192 when executing record redo_index_new_page
2019-01-09 16:00:40 0 [ERROR] mysqld: Aria engine: Redo phase failed
tables to flush: 9 8 7 6 5 4 3 2 1 0
2019-01-09 16:00:40 0 [ERROR] mysqld: Aria recovery failed. Please run aria_chk -r on all Aria tables and delete all aria_log.######## files
2019-01-09 16:00:40 0 [ERROR] Plugin 'Aria' init function returned error.
2019-01-09 16:00:40 0 [ERROR] Plugin 'Aria' registration as a STORAGE ENGINE failed.

The backup dir (stored after prepare but before any server startup was attempted) is here:
ftp://ftp.askmonty.org/public/mdev18187-backup.tar.gz

To reproduce the failure, restore it to the datadir and start the server with

--innodb-encrypt-tables --innodb-encrypt-log --innodb-encryption-threads=4 --aria-encrypt-tables=1 --encrypt-tmp-
disk-tables=1 --plugin-load-add=file_key_management --file-key-management --file-key-management-filename=`pwd`/mysql-test/std_data/keys.txt

(modify file-key-management-filename value as needed).


Notes on the origin of the report. Please be aware it's not necessarily reproducible the same way.
https://travis-ci.org/elenst/travis-tests/jobs/477279304

The problem initially happened by running the concurrent test with MariaBackupFull scenario:

elenst-jira-refs 5342873c9, mariadb-toolbox 6ded4b71568

perl ./runall-new.pl --basedir=/home/travis/server --vardir=/home/travis/logs/vardir --duration=350 --threads=4 --seed=1547032472 --reporters=Backtrace,ErrorLog,Deadlock --skip-gendata --gendata-advanced --views --grammar=conf/mariadb/generic-dml.yy --redefine=conf/mariadb/bulk_insert.yy --mysqld=--log_output=FILE --mysqld=--max-statement-time=20 --mysqld=--lock-wait-timeout=10 --mysqld=--loose-innodb-lock-wait-timeout=5 --mysqld=--loose-debug_assert_on_not_freed_memory=0 --filter=/home/travis/mariadb-toolbox/travis/10.4-combo-filter.ff --mysqld=--innodb-encrypt-tables --mysqld=--innodb-encrypt-log --mysqld=--innodb-encryption-threads=4 --mysqld=--aria-encrypt-tables=1 --mysqld=--encrypt-tmp-disk-tables=1 --mysqld=--file-key-management --mysqld=--file-key-management-filename=/home/travis/mariadb-toolbox/data/keys.txt --mysqld=--plugin-load-add=file_key_management --engine=Aria --scenario=MariaBackupFull

The test performs the following steps:

  • start the server with

    --no-defaults --basedir=/home/travis/server --lc-messages-dir=/home/travis/server/share --character-sets-dir=/home/travis/server/share/charsets --tmpdir=/home/travis/logs/vardir/tmp --core-file --datadir=/home/travis/logs/vardir/data --max-allowed-packet=128Mb --port=19300 --socket=/home/travis/logs/vardir/mysql.sock --pid-file=/home/travis/logs/vardir/mysql.pid --general-log --general-log-file=/home/travis/logs/vardir/mysql.log --log_output=FILE --max-statement-time=20 --lock-wait-timeout=10 --loose-innodb-lock-wait-timeout=5 --loose-debug_assert_on_not_freed_memory=0 --innodb-encrypt-tables --innodb-encrypt-log --innodb-encryption-threads=4 --aria-encrypt-tables=1 --encrypt-tmp-disk-tables=1 --file-key-management --file-key-management-filename=/home/travis/mariadb-toolbox/data/keys.txt --plugin-load-add=file_key_management
    

  • run concurrent DML/DDL flow on the server;
  • after ~100 seconds, take the full backup from the running server (succeeded, the log attached as mbackup_backup.log;
  • after another ~100 seconds, shut down the server gracefully (succeeded, the error log is attached as mysql.err_orig;
  • run mariabackup --prepare (succeded, the log attached as mbackup_prepare.log;
  • cleanup server datadir and restore the backup to it (succeeded, the log attached as mbackup_restore.log;
  • try to start the server with the same options as before.

I have no information yet whether the problem happens on earlier versions.



 Comments   
Comment by Alexander [ 2021-08-09 ]

I made a fix, please review:
https://github.com/MariaDB/server/pull/1888

Comment by Michael Widenius [ 2021-12-15 ]

This issue was resolved a long time ago in
81258f14323e1d1ad0203bae93bc55a30d47c1b3

The issue was that the encryption code wrote wrong data on page headers, which made it
impossible for recovery to decrypt the data.
Even after the fix the data in this bug report will not work as the data is already corrupted.

Comment by Michael Widenius [ 2021-12-15 ]

New test case shows another issue

Comment by Michael Widenius [ 2021-12-15 ]

The fix in https://github.com/MariaDB/server/pull/1888 is slight wrong.
The issue is that when recovery is about to create a new data or index page, it check if the page already exits.
If the page does not exists (file is too short) or contains wrong checksum, then the recovery code will recreate the old page.
The bug was that the code that checked if the page existed didn't take into account encrypted pages.
Adding a check if page could not be encrypted solved the issue.

Comment by Michael Widenius [ 2021-12-15 ]

Fix pushed to bb-10.4-monty for testing

Comment by Elena Stepanova [ 2023-06-08 ]

There is a patch for this issue in 10.4+, so closing it.

Generated at Thu Feb 08 08:42:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.