Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23754

Replace buf_pool.flush_list with a priority queue

Details

    Description

      In an effort to speed up crash recovery, InnoDB introduced buf_pool_t::flush_rbt, which would keep the flush_list sorted by oldest_modification during recovery. MDEV-23399 will remove the buf_pool.flush_rbt to simplify the page flushing code.

      Such ordering should be beneficial at all times, but maintaining 2 data collections (doubly linked list and binary search tree) is clearly suboptimal. Replacing the doubly-linked buf_pool.flush_list with a priority queue ORDER BY page.oldest_modification ASC, page.id ASC should include the following benefits:

      1. guarantee linear progress of the page cleaner (advancing the MIN(oldest_modification))
      2. remove the need of FlushHp (unlike a linked list, a priority queue in something like an array allows the mutex to be released and reacquired in the middle of an iteration)
      3. remove all page lookups from buf_flush_try_neighbors() because the buf_pool.flush_list is already ORDER BY page.oldest_modification ASC, page.id ASC (this will not affect SSD thanks to MDEV-17380)
      4. remove the need to flush the buffer pool after recovery (recv_sys.debug_free()), and thus remove one blocker of MDEV-14481

      One implementation challenge is that apart from the buf_block_t that are in buf_pool.chunks for ROW_FORMAT=COMPRESSED pages we have some buf_page_t that are allocated with malloc(). The majority of buf_page_t are actually buf_block_t::page. Because of this, the priority queue will probably have to store pointers to buf_page_t.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -

            I suspect that the following O(n) iteration in a critical section of buf_pool.flush_list_mutex in page_cleaner_flush_pages_recommendation() could be replaced with an O(log n) binary search when we switch to a priority queue:

            	for (buf_page_t* b = UT_LIST_GET_LAST(buf_pool.flush_list);
            	     b != NULL;
            	     b = UT_LIST_GET_PREV(list, b)) {
            		if (b->oldest_modification() > target_lsn) {
            			break;
            		}
            		++pages_for_lsn;
            	}
            

            marko Marko Mäkelä added a comment - I suspect that the following O( n ) iteration in a critical section of buf_pool.flush_list_mutex in page_cleaner_flush_pages_recommendation() could be replaced with an O(log n ) binary search when we switch to a priority queue: for (buf_page_t* b = UT_LIST_GET_LAST(buf_pool.flush_list); b != NULL; b = UT_LIST_GET_PREV(list, b)) { if (b->oldest_modification() > target_lsn) { break ; } ++pages_for_lsn; }
            wlad Vladislav Vaintroub added a comment - - edited

            STL priority_queue does not have an interface to iterate over it. you can access the top element though. Although, it is probably possible to derive from STL priority_queue and rely on the fact that underlying container is organized as heap structure. then I guess it would be possible to iterate

            wlad Vladislav Vaintroub added a comment - - edited STL priority_queue does not have an interface to iterate over it. you can access the top element though. Although, it is probably possible to derive from STL priority_queue and rely on the fact that underlying container is organized as heap structure. then I guess it would be possible to iterate

            We could ‘manually’ create something similar to std::priority_queue, possibly by making use of std::make_heap, which works with random access iterators.

            In a perf record trace of the server with some work-in-progress MDEV-23399 changes, the page cleaner seems to be spending most of its CPU cycles in that page_cleaner_flush_pages_recommendation() loop and also some in buf_flush_check_neighbors(). For neighbor flushing (which is always disabled on SSD), we cannot entirely avoid buf_pool.page_hash lookups, but we might reduce their amount by checking if the neighbor pages happen to be adjacent elements in the priority queue (with the same oldest_modification, which could be rare).

            marko Marko Mäkelä added a comment - We could ‘manually’ create something similar to std::priority_queue , possibly by making use of std::make_heap , which works with random access iterators. In a perf record trace of the server with some work-in-progress MDEV-23399 changes, the page cleaner seems to be spending most of its CPU cycles in that page_cleaner_flush_pages_recommendation() loop and also some in buf_flush_check_neighbors() . For neighbor flushing (which is always disabled on SSD), we cannot entirely avoid buf_pool.page_hash lookups, but we might reduce their amount by checking if the neighbor pages happen to be adjacent elements in the priority queue (with the same oldest_modification , which could be rare).
            marko Marko Mäkelä made changes -

            We might solve MDEV-12227 as part of this task by ensuring that pages of the temporary tablespace will be added at the end of the priority queue (pretending that oldest_modification=LSN_MAX).

            But, I would still think that we’d better write the pages of temporary tables solely via the LRU eviction mechanism, that is, keep them entirely out of the buf_pool.flush_list or its replacement. So, it might be better to implement MDEV-12227 before this task.

            marko Marko Mäkelä added a comment - We might solve MDEV-12227 as part of this task by ensuring that pages of the temporary tablespace will be added at the end of the priority queue (pretending that oldest_modification=LSN_MAX ). But, I would still think that we’d better write the pages of temporary tables solely via the LRU eviction mechanism, that is, keep them entirely out of the buf_pool.flush_list or its replacement. So, it might be better to implement MDEV-12227 before this task.

            It turns out that the buf_pool.flush_list actually is already ORDER BY oldest_modification DESC (most recently dirtied blocks are added first). Blocks are added by mtr_t::commit(), and the LSN would always be incremented between mtr_t::commit(), unless we have several mini-transactions that only modified pages in the temporary tablespace (which per MDEV-12227 maybe should not reside in the buf_pool.flush_list at all). So, the ORDER BY oldest_modification DESC is guaranteed by design.

            What we can do is the following:

            1. MDEV-23399 will remove the buf_pool.flush_rbt and relax the debug assertion in buf_flush_validate_low() so that random ordering will be allowed when applying redo log records.
            2. At the end of redo log processing, we will flush the buffer pool and re-enable the check. (This is unchanged from earlier, where we had to flush the buffer pool in order to empty the recovery-time buf_pool.flush_rbt.)
            3. In MDEV-14481 we will remove that extra flush by adding a step where the last recovery batch will sort the buf_pool.flush_list.
            4. After MDEV-12227, pages of temporary tablespaces should not be added to buf_pool.flush_list, and both buf_flush_wait_flushed() and buf_pool_t::get_oldest_modification() can simply access the last element.
            5. As part of this task, mtr_t::commit() might add the dirty pages to the buf_pool.flush_list ordered by page number.

            Currently it looks like there is not much that can be done as part of this task. The doubly-linked list should remain an optimal data structure for this purpose.

            marko Marko Mäkelä added a comment - It turns out that the buf_pool.flush_list actually is already ORDER BY oldest_modification DESC (most recently dirtied blocks are added first). Blocks are added by mtr_t::commit() , and the LSN would always be incremented between mtr_t::commit() , unless we have several mini-transactions that only modified pages in the temporary tablespace (which per MDEV-12227 maybe should not reside in the buf_pool.flush_list at all). So, the ORDER BY oldest_modification DESC is guaranteed by design. What we can do is the following: MDEV-23399 will remove the buf_pool.flush_rbt and relax the debug assertion in buf_flush_validate_low() so that random ordering will be allowed when applying redo log records. At the end of redo log processing, we will flush the buffer pool and re-enable the check. (This is unchanged from earlier, where we had to flush the buffer pool in order to empty the recovery-time buf_pool.flush_rbt .) In MDEV-14481 we will remove that extra flush by adding a step where the last recovery batch will sort the buf_pool.flush_list . After MDEV-12227 , pages of temporary tablespaces should not be added to buf_pool.flush_list , and both buf_flush_wait_flushed() and buf_pool_t::get_oldest_modification() can simply access the last element. As part of this task, mtr_t::commit() might add the dirty pages to the buf_pool.flush_list ordered by page number. Currently it looks like there is not much that can be done as part of this task. The doubly-linked list should remain an optimal data structure for this purpose.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]

            Sorry, I should have done more research before filing this.

            marko Marko Mäkelä added a comment - Sorry, I should have done more research before filing this.
            marko Marko Mäkelä made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.6 [ 24028 ]
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            Resolution Won't Do [ 10201 ]
            Status Confirmed [ 10101 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 113651 ] MariaDB v4 [ 134341 ]

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.