[MDEV-23754] Replace buf_pool.flush_list with a priority queue Created: 2020-09-18  Updated: 2020-10-28  Resolved: 2020-10-28

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Won't Do Votes: 0
Labels: performance

Issue Links:
PartOf
is part of MDEV-16526 Overhaul the InnoDB page flushing Closed
Relates
relates to MDEV-12227 Defer writes to the InnoDB temporary ... Closed

 Description   

In an effort to speed up crash recovery, InnoDB introduced buf_pool_t::flush_rbt, which would keep the flush_list sorted by oldest_modification during recovery. MDEV-23399 will remove the buf_pool.flush_rbt to simplify the page flushing code.

Such ordering should be beneficial at all times, but maintaining 2 data collections (doubly linked list and binary search tree) is clearly suboptimal. Replacing the doubly-linked buf_pool.flush_list with a priority queue ORDER BY page.oldest_modification ASC, page.id ASC should include the following benefits:

  1. guarantee linear progress of the page cleaner (advancing the MIN(oldest_modification))
  2. remove the need of FlushHp (unlike a linked list, a priority queue in something like an array allows the mutex to be released and reacquired in the middle of an iteration)
  3. remove all page lookups from buf_flush_try_neighbors() because the buf_pool.flush_list is already ORDER BY page.oldest_modification ASC, page.id ASC (this will not affect SSD thanks to MDEV-17380)
  4. remove the need to flush the buffer pool after recovery (recv_sys.debug_free()), and thus remove one blocker of MDEV-14481

One implementation challenge is that apart from the buf_block_t that are in buf_pool.chunks for ROW_FORMAT=COMPRESSED pages we have some buf_page_t that are allocated with malloc(). The majority of buf_page_t are actually buf_block_t::page. Because of this, the priority queue will probably have to store pointers to buf_page_t.



 Comments   
Comment by Marko Mäkelä [ 2020-09-24 ]

I suspect that the following O(n) iteration in a critical section of buf_pool.flush_list_mutex in page_cleaner_flush_pages_recommendation() could be replaced with an O(log n) binary search when we switch to a priority queue:

	for (buf_page_t* b = UT_LIST_GET_LAST(buf_pool.flush_list);
	     b != NULL;
	     b = UT_LIST_GET_PREV(list, b)) {
		if (b->oldest_modification() > target_lsn) {
			break;
		}
		++pages_for_lsn;
	}

Comment by Vladislav Vaintroub [ 2020-09-24 ]

STL priority_queue does not have an interface to iterate over it. you can access the top element though. Although, it is probably possible to derive from STL priority_queue and rely on the fact that underlying container is organized as heap structure. then I guess it would be possible to iterate

Comment by Marko Mäkelä [ 2020-09-24 ]

We could ‘manually’ create something similar to std::priority_queue, possibly by making use of std::make_heap, which works with random access iterators.

In a perf record trace of the server with some work-in-progress MDEV-23399 changes, the page cleaner seems to be spending most of its CPU cycles in that page_cleaner_flush_pages_recommendation() loop and also some in buf_flush_check_neighbors(). For neighbor flushing (which is always disabled on SSD), we cannot entirely avoid buf_pool.page_hash lookups, but we might reduce their amount by checking if the neighbor pages happen to be adjacent elements in the priority queue (with the same oldest_modification, which could be rare).

Comment by Marko Mäkelä [ 2020-09-25 ]

We might solve MDEV-12227 as part of this task by ensuring that pages of the temporary tablespace will be added at the end of the priority queue (pretending that oldest_modification=LSN_MAX).

But, I would still think that we’d better write the pages of temporary tables solely via the LRU eviction mechanism, that is, keep them entirely out of the buf_pool.flush_list or its replacement. So, it might be better to implement MDEV-12227 before this task.

Comment by Marko Mäkelä [ 2020-10-01 ]

It turns out that the buf_pool.flush_list actually is already ORDER BY oldest_modification DESC (most recently dirtied blocks are added first). Blocks are added by mtr_t::commit(), and the LSN would always be incremented between mtr_t::commit(), unless we have several mini-transactions that only modified pages in the temporary tablespace (which per MDEV-12227 maybe should not reside in the buf_pool.flush_list at all). So, the ORDER BY oldest_modification DESC is guaranteed by design.

What we can do is the following:

  1. MDEV-23399 will remove the buf_pool.flush_rbt and relax the debug assertion in buf_flush_validate_low() so that random ordering will be allowed when applying redo log records.
  2. At the end of redo log processing, we will flush the buffer pool and re-enable the check. (This is unchanged from earlier, where we had to flush the buffer pool in order to empty the recovery-time buf_pool.flush_rbt.)
  3. In MDEV-14481 we will remove that extra flush by adding a step where the last recovery batch will sort the buf_pool.flush_list.
  4. After MDEV-12227, pages of temporary tablespaces should not be added to buf_pool.flush_list, and both buf_flush_wait_flushed() and buf_pool_t::get_oldest_modification() can simply access the last element.
  5. As part of this task, mtr_t::commit() might add the dirty pages to the buf_pool.flush_list ordered by page number.

Currently it looks like there is not much that can be done as part of this task. The doubly-linked list should remain an optimal data structure for this purpose.

Comment by Marko Mäkelä [ 2020-10-28 ]

Sorry, I should have done more research before filing this.

Generated at Thu Feb 08 09:24:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.