[MDEV-29374] Frequent "Data structure corruption" in InnoDB after OOM Created: 2022-08-24 Updated: 2023-03-23 Resolved: 2022-09-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.9, 10.7.5, 10.8.4, 10.9.2, 10.10.1 |
| Fix Version/s: | 10.6.10, 10.7.6, 10.8.5, 10.9.3, 10.10.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Brad | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Description |
|
My company has around 200 CentOS 7 servers running MariaDB 10.6. Last week, after the 10.6.9 update was applied, we started seeing lots of InnoDB failures after an OOM had killed MariaDB. Our systems are kind of light on memory and do hit OOM's sometimes but it shouldn't cause a failure in InnoDB recovery. When 10.6.9 was updated, we started seeing multiple failures per day. I rolled back to 10.6.8 yesterday and have not seen any more issues. So I think something in 10.6.9 is causing a problem. Each time the issue happens, the error log looks like this.
There isn't anything unusual about our config. Here is one of them.
Fortunately, it's very easy to recover from this but it does take manual intervention to do. i.e. innodb_force_recovery=1 Any ideas what could be causing this new issue and what we can do to correct it? |
| Comments |
| Comment by Marko Mäkelä [ 2022-08-25 ] | ||||||||||||||||||||||||||||
|
Error handling was refactored in | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-25 ] | ||||||||||||||||||||||||||||
|
wk_bradp, can you provide a copy of a data directory on which MariaDB Server 10.6.9 fails to start up, and 10.6.8 does start up? | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-25 ] | ||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||
| Comment by Brad [ 2022-08-25 ] | ||||||||||||||||||||||||||||
|
Thanks for looking into this, Marko! I don't think I can provide a copy of the data directory for any of these servers because it does contain customer data. I haven't tried to reproduce this issue in a testing environment yet. But I can say that it started with 10.6.9 and hasn't happened again since the rollback to 10.6.8. I can try to run some tests to try to reproduce the crash if needed but it does sound like that crash recovery bug you mentioned might be related. Also, just to be clear on the order of things. I didn't downgrade to fix the data corruption, I used innodb_force_recovery=1 to ignore the problem. Only after that did I downgrade to 10.6.8. Meaning, I didn't downgrade to get it to start up, I downgraded to prevent the corruption from happening again. Just wanted to make that clear but let me know if it's not Also, let me know what you need from me and I'll be happy to provide it. | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-26 ] | ||||||||||||||||||||||||||||
|
wk_bradp, thank you, and sorry for introducing Ironically, the corruption was caused by I think that you’d better stay at 10.6.8 until a 10.6.10 release is available. | ||||||||||||||||||||||||||||
| Comment by Brad [ 2022-08-26 ] | ||||||||||||||||||||||||||||
|
Thanks so much for looking into this Marko! I'll leave our systems on 10.6.8 until the next patch is released. | ||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-08-27 ] | ||||||||||||||||||||||||||||
|
I'll close it as a duplicate then, but please, do comment if the next release won't fix the issue | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-30 ] | ||||||||||||||||||||||||||||
|
I am reopening this, because mleich provided an rr replay trace of something that reproduces exactly this message, even when the fix of What I found out so far:
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-31 ] | ||||||||||||||||||||||||||||
|
It turns out that we fail to write a FREE_PAGE record (it can only be written by invoking mtr_t::free()) for the following operation:
Due to the missing record for freeing the page, the recovery will attempt to load the page and fail, as follows.
I must make the failure diagnostics more verbose; it was probably "cleaned up" too much in The root cause of this failure does not look like a recent regression. I think that it will affect 10.5 as well. 10.5 has a compatible log format, and it would crash on this corrupted data directory:
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-31 ] | ||||||||||||||||||||||||||||
|
The root cause turns out to be that a write of a FREE_PAGE record is being skipped due to an early return here:
The culprit is an accidentally added extra return err; statement. (Only in case of an error, we should not write the FREE_PAGE record.)
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-31 ] | ||||||||||||||||||||||||||||
|
As part of fixing this, I am going to add some error messages to identify the corrupted page:
| ||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-09-08 ] | ||||||||||||||||||||||||||||
|
Summary: a crash shortly after a page merge (can be triggered by an UPDATE or DELETE or, say, rollback of an INSERT) can cause data corruption. |