[MDEV-19813] Aria crash recovery failures Created: 2019-06-20 Updated: 2023-11-28 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - Aria |
| Affects Version/s: | 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10 |
| Fix Version/s: | 10.4, 10.5, 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Elena Stepanova |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We have a number of bug reports related to Aria recovery problems, with different representation of said problems. After the first analysis performed by Monty on some of them, it appears they have a lot of common, first of all the fact that even though the data directory on which the recovery issue can be reproduced is available, it is not sufficient for fixing the issue, and a complete test case causing the initial corruption is needed. These test cases are concurrent and non-deterministic by nature, and quite often by just re-running the same test, we hit various representations of the recovery problem. Thus, i think it makes sense to group all these issues together, as one fix is likely to fix several bugs, and at the same time, while working on one bug, developers/testers are likely to have to deal with other ones. Actual bug reports are to be made subtasks of this one. They will be handled and closed as normal bug reports. The umbrella report will stay open until there are no open subtasks left. Examples of observed recovery issues from the subtasks:
|
| Comments |
| Comment by Elena Stepanova [ 2019-06-20 ] | |||||||||||||||||||
|
| |||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-06 ] | |||||||||||||||||||
|
I believe that every release of MariaDB Server is affected by this. Basically, any test that kills the server may fail due to a low-probability failure of Aria recovery. Here is a recent example:
It may make sense to produce rr replay traces of some of the failures (covering both the intentionally killed server and the failed recovery) so that they can be reliably debugged. For InnoDB recovery failures, having only a copy of the data directory that fails to start up is only half of the story. Often or usually, the actual problem resides on the ‘write’ side, and recovery only sees some corrupted input files. |