Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.6.16, 10.10.7, 10.11.6, 11.0.4, 10.6.17, 10.11.7, 11.1.3, 11.2.2
Description
While the testing of MDEV-33588, mleich encountered an rr replay trace where the purge_coordinator_task is trying to parse arbitrary data in the middle of an undo log record. I narrowed it down to the following code in purge_sys_t::choose_next_log():
const trx_undo_rec_t *undo_rec= |
trx_undo_page_get_first_rec(b, hdr_page_no, hdr_offset);
|
Here, b is undo log page 13, and hdr_page_no and hdr_offset are related to that. I think that we would benefit from some testing with the following patch to catch the root cause of the corruption sooner:
@@ -134,8 +134,9 @@ trx_undo_page_get_first_rec(const buf_block_t *block, uint32_t page_no,
|
uint16_t offset)
|
{
|
uint16_t start= trx_undo_page_get_start(block, page_no, offset);
|
- return start == trx_undo_page_get_end(block, page_no, offset)
|
- ? nullptr : block->page.frame + start;
|
+ uint16_t end= trx_undo_page_get_end(block, page_no, offset);
|
+ ut_ad(start <= end);
|
+ return start >= end ? nullptr : block->page.frame + start;
|
}
|
|
/** Get the last undo log record on a page. |
In the trace that I analyzed, we would be reading an invalid start value that points outside the page frame.
Attachments
Issue Links
- is caused by
-
MDEV-32050 UNDO logs still growing for write-intensive workloads
-
- Closed
-
- relates to
-
MDEV-33208 InnoDB: Trying to access update undo rec field 50 in index `PRIMARY` of table NNN but index has only 14 fields
-
- Closed
-
-
MDEV-33588 buf::Block_hint is a performance hog
-
- Closed
-
ssh pluto
rr replay /data/results/1711541897/TBR-2112/1/rr/latest-trace/
At when 343065 the undo log header in page 13 was initialized in trx_undo_header_create() inside trx_undo_reuse_cached(). Slightly after that, at the same event counter, there is a call to buf_page_create() to create the undo log page 14. At when 348171 this undo log page 14 will be evicted from the buffer pool. At when 348202 it is being loaded back intact, with FIL_PAGE_LSN being 0x029c5eb8. The assertion fails, because trx_undo_page_get_start() is reading the value 0x2a98 from undo page 13. The bytes around that had been written by the following:
undo_page->page.frame, buf, 4);
The contents of buf is 0, 1, 0x2a, 0x98, so TRX_UNDO_LOG_START is really intended to contain the value 0x2a98, which is later being read. I think that I must check what happens to this undo log page 13 between the time it was written and the parsing fails. The analysis is complicated by the fact that also this page is being evicted from the buffer pool and reloaded again.