[MDEV-30753] Possible corruption due to trx_purge_free_segment() Created: 2023-02-28 Updated: 2023-12-01 Resolved: 2023-03-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0 |
| Fix Version/s: | 10.11.3, 11.0.2, 10.5.20, 10.6.13, 10.7.8, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | corruption | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
There is a potential problem if the server is killed amid freeing undo log pages:
The following scenario would seem to be possible:
The function trx_purge_free_segment() is also missing calls to log_free_check(), which means that an overrun of the redo log is possible, and the database might become impossible to recover if the server is killed while the function is being executed. There is a hint in the source code how this could be fixed:
If we simply call trx_purge_remove_log_hdr() in the first mini-transaction, everything should be safe. Yes, the pages might not be easy to free afterwards, but that is not a problem for those who use multiple innodb_undo_tablespaces and innodb_undo_log_truncate=ON. We could also try to free everything in a single mini-transaction, provided that there is sufficient capacity in the redo log and the buffer pool. |
| Comments |
| Comment by Marko Mäkelä [ 2023-04-21 ] |
|
I pushed a follow-up fix to correct an error in buffer pool page restoration after mini-transaction restart. If the buffer pool is tiny and the server is heavily loaded, the buffer page could be replaced between the mtr.commit() and acquiring an exclusive lock on the block. We never reproduced such a failure, but we reproduced a similar error in a development version of |
| Comment by Marko Mäkelä [ 2023-05-10 ] |
|
I looked up the follow-up fix because I was thinking that we might need similar adjustments in order to prevent corruptions like I thought that a buffer-fix does not prevent a buffer block from being evicted or relocated. But, it actually does, because buf_page_t::can_relocate() returns false for buffer-fixed, io-fixed or latched pages. Hence, it looks like the follow-up fix actually added some dead code. I re-analyzed the rr replay trace for a development version of |