[MDEV-27414] Server may hang when innodb_undo_log_truncate=ON Created: 2022-01-03  Updated: 2023-05-25  Resolved: 2022-01-03

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: N/A
Fix Version/s: 10.6.6, 10.7.2, 10.8.1

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: hang, regression

Issue Links:
Problem/Incident
is caused by MDEV-27058 Buffer page descriptors are too large Closed
Relates
relates to MDEV-30180 Server hang with innodb_undo_log_trun... Closed
relates to MDEV-31343 Another server hang with innodb_undo_... Closed

 Description   

MDEV-27058 introduced a deadlock between trx_purge_truncate_history() and buf_pool_t::release_freed_page(). The former function is holding buf_pool.flush_list_mutex while waiting for a page X-latch:

    mysql_mutex_lock(&buf_pool.flush_list_mutex);
 
    for (buf_page_t *bpage= UT_LIST_GET_LAST(buf_pool.flush_list); bpage; )
    {
        bpage->lock.x_lock();

At the same time, buf_pool_t::release_freed_page() may hold a U-latch on this page while waiting for buf_pool.flush_list_mutex.

Before MDEV-27058, there was no problem here, because we would first buffer-fix the block, then release buf_pool.flush_list_mutex and only then acquire the block X-latch.

To fix this, we may invoke bpage->lock.x_lock_try() instead. If it fails, we may release and reacquire buf_pool.flush_list_mutex and restart the scan.

Because MDEV-27058 was not part of any release yet, this regression does not affect any release.



 Comments   
Comment by Marko Mäkelä [ 2022-01-03 ]

The hang was repeatable by running

./mtr --parallel=auto --repeat=10 innodb.undo_truncate_recover,4k{,,,,,,,,,,,,,,}

on NVMe storage (not RAM disk). The fix was validated with --repeat=30.

Comment by Marko Mäkelä [ 2022-12-09 ]

MDEV-30180 was filed because this fix turned out to be incomplete.

Generated at Thu Feb 08 09:52:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.