In an effort to speed up crash recovery, InnoDB introduced buf_pool_t::flush_rbt, which would keep the flush_list sorted by oldest_modification during recovery.
MDEV-23399 will remove the buf_pool.flush_rbt to simplify the page flushing code.
Such ordering should be beneficial at all times, but maintaining 2 data collections (doubly linked list and binary search tree) is clearly suboptimal. Replacing the doubly-linked buf_pool.flush_list with a priority queue ORDER BY page.oldest_modification ASC, page.id ASC should include the following benefits:
- guarantee linear progress of the page cleaner (advancing the MIN(oldest_modification))
- remove the need of FlushHp (unlike a linked list, a priority queue in something like an array allows the mutex to be released and reacquired in the middle of an iteration)
- remove all page lookups from buf_flush_try_neighbors() because the buf_pool.flush_list is already ORDER BY page.oldest_modification ASC, page.id ASC (this will not affect SSD thanks to
- remove the need to flush the buffer pool after recovery (recv_sys.debug_free()), and thus remove one blocker of
One implementation challenge is that apart from the buf_block_t that are in buf_pool.chunks for ROW_FORMAT=COMPRESSED pages we have some buf_page_t that are allocated with malloc(). The majority of buf_page_t are actually buf_block_t::page. Because of this, the priority queue will probably have to store pointers to buf_page_t.