Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.6.12, 10.6.13, 10.6.14, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
Description
This was reproduced while trying to reproduce an older issue MDEV-30531.
Some InnoDB B-tree cursor refactoring in MDEV-30400 turns out to be unsafe, resulting in InnoDB tables being flagged as corrupted. This occurs also on PRIMARY KEY indexes (clustered indexes), not only on secondary index pages.
The root cause seems to be that some operations are accessing the buffer page frame contents while only holding a buffer-fix on the page, not a page latch. It could be the case that the page is being read into the buffer pool, or it is being decrypted or decompressed. In some core dumps of such failures (with additional instrumentation to essentially revert MDEV-13542), the corruption condition would no longer hold.
Attachments
Issue Links
- blocks
-
MDEV-30531 Corrupt index(es) on busy table when using FOREIGN KEY with CASCADE or SET NULL
-
- Closed
-
- is caused by
-
MDEV-30400 Assertion `height == btr_page_get_level(page_cur_get_page(page_cursor))' failed in btr_cur_search_to_nth_level on INSERT
-
- Closed
-
- relates to
-
MDEV-13542 Crashing on a corrupted page is unhelpful
-
- Closed
-
-
MDEV-27058 Buffer page descriptors are too large
-
- Closed
-
-
MDEV-32116 Server suddenly crashed
-
- Closed
-
-
MDEV-33764 InnoDB: Failing assertion: err == DB_SUCCESS in btr0cur.cc line 4272
-
- Open
-
-
MDEV-33205 [ERROR] InnoDB: We detected index corruption in an InnoDB type table.
-
- Closed
-
There could be an even older culprit to this than
MDEV-30400. InMDEV-27058, I removed the function buf_wait_for_read(), which would keep acquiring and releasing a page latch as long as the page is read-fixed. This loop would probably prevent many of these issues when a page is only being buffer-fixed.I do not think that it ever is a good idea to use buffer-fixing for the first-time lookup of a data page in a mini-transaction. If the page was not in the buffer pool and had to be loaded into it, the buffer-fixing could gain access to the page before the read request was completed and the page checksum was validated. Before
MDEV-13542, this was not that much of an issue; we would crash on corruption anyway.What my fix aims to do is to acquire proper page latches upfront. To avoid deadlocks when acquiring page latches in the wrong order (not from left to right), we can safely release a page latch for a short while while waiting for the left sibling page latch. A buffer-fix will prevent the current block from being evicted from the buffer pool.