[MDEV-19813] Aria crash recovery failures Created: 2019-06-20  Updated: 2023-11-28

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - Aria
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10
Fix Version/s: 10.4, 10.5, 10.6

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Elena Stepanova
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
PartOf
includes MDEV-20578 Got error 126 when executing undo und... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDEV-18310 Aria engine: Undo phase failed with "... Technical task Closed Vladislav Lesin  
MDEV-18187 Aria engine: Redo phase failed with "... Technical task Closed Michael Widenius  
MDEV-17912 Aria with encryption fails upon crash... Technical task Stalled Michael Widenius  
MDEV-19576 Aria crash recovery fails with error ... Technical task Open Michael Widenius  
MDEV-19718 Assertion `rownr == 0 && new_page' fa... Technical task Closed Elena Stepanova  
MDEV-18461 Aria crash recovery failures on the s... Technical task Stalled Michael Widenius  
MDEV-18203 Aria engine: Undo phase failed with "... Technical task Closed Michael Widenius  
MDEV-19980 Aria crash recovery fails with "Got e... Technical task Open Michael Widenius  
MDEV-20132 Assertion `info->new_row.checksum == ... Technical task Open Michael Widenius  

 Description   

We have a number of bug reports related to Aria recovery problems, with different representation of said problems. After the first analysis performed by Monty on some of them, it appears they have a lot of common, first of all the fact that even though the data directory on which the recovery issue can be reproduced is available, it is not sufficient for fixing the issue, and a complete test case causing the initial corruption is needed. These test cases are concurrent and non-deterministic by nature, and quite often by just re-running the same test, we hit various representations of the recovery problem. Thus, i think it makes sense to group all these issues together, as one fix is likely to fix several bugs, and at the same time, while working on one bug, developers/testers are likely to have to deal with other ones.

Actual bug reports are to be made subtasks of this one. They will be handled and closed as normal bug reports. The umbrella report will stay open until there are no open subtasks left.

Examples of observed recovery issues from the subtasks:

MDEV-18310

Got error 121 when executing undo undo_key_delete

MDEV-18203

Got error 126 when executing undo undo_key_insert

MDEV-20578

Got error 126 when executing undo undo_key_delete

MDEV-18187

2019-01-09 16:00:40 0 [ERROR] mysqld: failed to decrypt './test/t7'  rc: -1  dstlen: 0  size: 4294967275
 Got error 192 when executing record redo_index_new_page

MDEV-17912

2018-12-05 18:38:33 0 [ERROR] mysqld: failed to decrypt './test/oltp46'  rc: -1  dstlen: 0  size: 8172
Got error 192 when executing record redo_new_row_head

MDEV-19576

Got error 175 when executing record redo_index

MDEV-19576

Got error 175 when executing undo undo_row_insert

MDEV-19718

mysqld: /data/src/10.3/storage/maria/ma_blockrec.c:6358: _ma_apply_redo_insert_row_head_or_tail: Assertion `rownr == 0 && new_page' failed.

MDEV-18461

mysqld: /data/src/10.4/storage/maria/ma_loghandler.c:3862: translog_init_with_table: Assertion `sure_page <= last_page' failed.

MDEV-18461

mysqld: /home/travis/src/storage/maria/ma_blockrec.c:2879: write_block_record: Assertion `undo_lsn == ((LSN)1) || head_length == row_pos->length' failed.

MDEV-18461

Got error 176 when executing record redo_insert_row_head

MDEV-20132

Assertion `info->new_row.checksum == (*share->calc_checksum)(info, current_record)' failed



 Comments   
Comment by Elena Stepanova [ 2019-06-20 ]

MDEV-18203 now has an MTR test case (non-deterministic).

Comment by Marko Mäkelä [ 2022-09-06 ]

I believe that every release of MariaDB Server is affected by this. Basically, any test that kills the server may fail due to a low-probability failure of Aria recovery. Here is a recent example:

10.5 c0470caf5a80e69ad7d855a871c62cf72dc03b05

main.grant_kill                          w14 [ fail ]
        Test ended at 2022-09-06 08:42:12
CURRENT_TEST: main.grant_kill
Failed to start mysqld.1
mysqltest failed but provided no output
 - saving '/home/buildbot/amd64-fedora-35/build/mysql-test/var/14/log/main.grant_kill/' to '/home/buildbot/amd64-fedora-35/build/mysql-test/var/log/main.grant_kill/'
Retrying test main.grant_kill, attempt(2/3)...
worker[14] > Restart  - not started
***Warnings generated in error logs during shutdown after running tests: main.grant_kill
2022-09-06  8:42:12 0 [ERROR] mariadbd: Aria recovery failed. Please run aria_chk -r on all Aria tables (*.MAI) and delete all aria_log.######## files
2022-09-06  8:42:12 0 [ERROR] Plugin 'Aria' init function returned error.
2022-09-06  8:42:12 0 [ERROR] Plugin 'Aria' registration as a STORAGE ENGINE failed.
2022-09-06  8:42:12 0 [ERROR] Could not open mysql.plugin table: "Unknown storage engine 'Aria'". Some plugins may be not loaded
2022-09-06  8:42:12 0 [ERROR] Failed to initialize plugins.
2022-09-06  8:42:12 0 [ERROR] Aborting
main.func_like                           w9 [ pass ]     16
main.default_storage_engine              w5 [ pass ]    472
main.grant_kill                          w14 [ retry-pass ]      9

It may make sense to produce rr replay traces of some of the failures (covering both the intentionally killed server and the failed recovery) so that they can be reliably debugged. For InnoDB recovery failures, having only a copy of the data directory that fails to start up is only half of the story. Often or usually, the actual problem resides on the ‘write’ side, and recovery only sees some corrupted input files.

Generated at Thu Feb 08 08:54:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.