[MDEV-11799] InnoDB can abort if the doublewrite buffer contains a bad and a good copy Created: 2017-01-15 Updated: 2020-08-07 Resolved: 2020-07-31 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 5.5, 10.0, 10.1, 10.2, 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.2.33, 10.3.24, 10.4.14, 10.5.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | crash, recovery | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | 10.1.21 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This came up when testing
The last quoted line is displayed by the function buf_dblwr_process():
The logic is flawed, because nothing prevents us from having multiple copies of the same page in the doublewrite buffer. If the server was killed during a write to the doublewrite page, then restarted, and killed again during a wite to the data page so that the doublewrite buffer still contains the corrupted copy from the first kill, recovery would unnecessarily abort the server process, making the database inaccessible. |
| Comments |
| Comment by Marko Mäkelä [ 2017-01-15 ] |
|
Pushed to 10.1.21. I think that we should do it in 10.0 as well, so I am not closing this yet. |
| Comment by Marko Mäkelä [ 2017-01-16 ] |
|
I think that a slight improvement to the logic is needed. Currently, buf_dblwr_process() is choosing the first valid-looking copy of the page that it encounters, even though there could be newer copies (with later LSN). Restoring a too old copy of the page may cause information loss and corruption of the database. buf_dblwr_process() should restore a valid copy of the page with the latest LSN, similar to how fil_user_tablespace_restore_page() does it. |
| Comment by Marko Mäkelä [ 2017-01-16 ] |
|
While investigating this further, I noticed a bug in recv_dblwr_t::find_page() which is used by fil_user_tablespace_restore_page() for restoring the first page of each tablespace, if that page is corrupted. The fix of this issue in 10.1.21 was incomplete, and more work will be needed. |
| Comment by Thirunarayanan Balathandayuthapani [ 2020-07-29 ] |
|
Patch is in bb-10.2- |
| Comment by Marko Mäkelä [ 2020-07-29 ] |
|
Thank you, the high-level idea of the fix looks good. I spotted some further problems and potential for improvement. This fix must be stress-tested with RQG (killing and restarting the server during a DML-only workload, to avoid hitting DDL-related problems that would hopefully some day be addressed by I do not think that this can be meaningfully tested with MTR. Only for the |
| Comment by Marko Mäkelä [ 2020-07-31 ] |
|
I merged this up to 10.4 and tested 10.4 with the |