[MDEV-32681] Test case innodb.undo_truncate_recover frequent failure Created: 2023-11-04 Updated: 2023-12-05 Resolved: 2023-11-17 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5 |
| Fix Version/s: | 10.5.24 |
| Type: | Bug | Priority: | Major |
| Reporter: | Kristian Nielsen | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
When it happens, the following appear in the mysqld.1.err:
I think the signal 6 might be because of a timeout waiting for the server to shut down (the server is manually restarted in the test case at this point). The server is at this point stuck during shutdown in trx_purge_truncate_history(), trying to create the rsegs for the truncated tablespace, but not able to get a buffer pool page, as per the printout in the error log:
I was able to reproduce with --rr, have the run saved currently. I see that the code is locking all dirty buffer pool pages in the undo tablespace, then moving on to re-creating all rseg header pages. I wonder if it managed to lock all pages in the buffer pool, thus locking itself out from creating new pages? Maybe there should be a check if the code is trying to lock more than say 50% of buffer pool pages, and if so flush some pages explicitly or abort and try again later. Or better if the pages can be dropped from the buffer pool without flushing, but not sure if that can be done in a crash/recover -safe way. In 10.5, the test failure is easy to trigger with 16k pages, I can see it hangs when it has locked + created around 500 pages (with 8M buffer pool). Cannot be reproduced with 16M buffer pool. In 10.6, the failure no longer occurs. Not sure if this is accidental or if the code is fixed. In 10.6 the code still locks all modified pages in the to-be-truncated undo tablespace, but the number of pages is smaller so the code doesn't reach 500 pages locked+created which triggered the hang in 10.5. |
| Comments |
| Comment by Marko Mäkelä [ 2023-11-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that this would be fixed by backporting the fixes of | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Maybe we can just add --innodb-buffer-pool-size=16M to the undo_truncate_recover.opt to make the failure in buildbot go away? Even if there's a theoretical bug here, it might not occur practically that all pages in the buffer pool (except for <64) are dirty pages from a single undo tablespace... which seems to be what is triggering the hang in this test failure. marko, what do you think? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can reproduce this easily. Between the time the following breakpoint was hit:
and the SIGABRT was sent because the shutdown took more than 60 seconds, the buf_pool.free.count remained at 0. But, the SIGABRT arrived not too much after the message had been sent. I maintain that this thread is not genuinely hung, but freeing the blocks is taking a lot of time. On the server restart after the SIGABRT, we got something more to worry about:
I tried starting up MariaDB Server 10.6 with a copy of that data directory data.tar.xz
I will have to analyze and fix this as well. Backporting the fixes from 10.6 should make buf_flush_page_cleaner() guarantee some spare capacity in buf_pool.free.count. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The crash occurs when the 56261st row of the recovered transaction 31 is being rolled back. Initially there were 56386 rows that had to be rolled back. The immediate reason is that trx_undo_truncate_end() had already freed the page 547 in undo tablespace 1 during a previous iteration of the loop:
The function trx_undo_free_page() returned the same page number to trx_undo_free_last_page(), so we would seem to attempt a double-free in a possibly infinite loop. The list TRX_UNDO_SEG_HDR + TRX_UNDO_PAGE_LIST on the header page 68 would appear to be corrupted. Before the server had been killed by the shutdown timeout, this part of the page was last touched in a read into the buffer pool. Similarly, during the failed crash recovery, there was no change to the header page at all before the first call to trx_undo_free_page() for the 56261st row of transaction 31. I think that this corruption would be easier to debug by changing the test to use a large buffer pool and by adding a forced kill of the server. In that way, the page would remain at the same virtual memory address for its entire lifetime in the being-killed server process. In any case, the copy of the data directory data.tar.xz | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Transaction 31 was actually committed before the server was killed due to the shutdown hang. It had written 60,000 undo log records, corresponding to the following SQL statements:
The LSN of the transaction commit was 20365073, and the log was durably written up to that point, and ultimately to 20379464, which also is the recv_sys.recovered_lsn of the recovery. So, this transaction should have been recovered as committed, and no rollback should have been attempted in the first place. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The undo log page 68 state was changed from TRX_UNDO_ACTIVE to TRX_UNDO_TO_PURGE during the commit of transaction 31:
The FIL_PAGE_LSN of this page at the time the SIGABRT is received is 20379428, and the oldest_modification is 0, that is, any changes had supposedly been written back. But, on recovery, we read the an older version of the page, with the FIL_PAGE_LSN being only 19400054. This looks like a "write side" error. The page had been wrongly marked as clean here:
At this point, the log_sys.flushed_to_disk_lsn as well as log_sys.lsn had already reached 20379464, the value at the time of the SIGABRT. But, we had not yet durably written any log record for truncating the undo tablespace. The error occurs when the log checkpoint is advanced from 19417270 to 20379452 by the buf_flush_page_cleaner() thread a little later. At that point, our undo tablespace truncation thread was busy allocating buffer pool pages:
In summary, the extremely slow buffer pool eviction in 10.5 let us find a recovery bug in innodb_undo_log_truncate=ON. It turns out that the following patch will fix this:
That call is not only harmful but also redundant, because mtr_t::commit_shrink() would execute the following in Shrink::operator()(mtr_memo_slot_t*):
However, the above patch is not enough. We would need to adjust the following condition in fsp_page_create():
Before the call to clear_oldest_modification() was removed, this condition would hold and we would end up allocating another block for the same page identifier. This is actually related to running out of the buffer pool! Without the assignment block=nullptr we would hit an assertion failure page.state() != BUF_BLOCK_FILE_PAGE in buf_block_t::initialise() later in fsp_page_create():
I think that we must revise fsp_page_create() as well as Shrink::operator()(mtr_memo_slot_t*) so that the x-latched blocks of the mini-transaction will only be marked clean if they were not modified by the mini-transaction. In that way, we should be able to reuse the already x-latched blocks in the buffer pool. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I filed | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I ported | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2023-11-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
origin/10.5- |