Details

    Description

      MDEV-25113 removed the acquisition of buf_pool.flush_list_mutex from buf_page_write_complete(). But, acquiring buf_pool.mutex should not really be necessary except in eviction flushing (request.is_LRU()) or when the block->lock is not available (no uncompressed copy of a ROW_FORMAT=COMPRESSED exists in the buffer pool).

      In the common case of checkpoint flushing, we can actually rely on the block->lock to protect us. The only other reason to hold buf_pool.mutex was updating the count of outstanding page writes (buf_pool.n_flush_list()). That counter is actually redundant, because we can refer to write_slots->pending_io_count().

      Finally, we would remove pthread_cond_broadcast(&buf_pool.done_flush_list) from buf_page_write_complete(). Only the buf_flush_page_cleaner would call that. To wait for page writes to complete, os_aio_wait_until_no_pending_writes() may be invoked.

      Interface change:
      New parameter Innodb_buffer_pool_pages_split has been added

      Attachments

        Issue Links

          Activity

            We are still observing a performance regression after rebasing this on MDEV-27058. Hence, I think that we must preserve an explicit counter of the number of outstanding page writes. We could protect that counter with buf_dblwr.mutex, which we could acquire also for page writes that skipped the doublewrite buffer. Instead of removing the condition variable buf_pool.done_flush_list, we can replace it with something that pairs with buf_dblwr.mutex.

            marko Marko Mäkelä added a comment - We are still observing a performance regression after rebasing this on MDEV-27058 . Hence, I think that we must preserve an explicit counter of the number of outstanding page writes. We could protect that counter with buf_dblwr.mutex , which we could acquire also for page writes that skipped the doublewrite buffer. Instead of removing the condition variable buf_pool.done_flush_list , we can replace it with something that pairs with buf_dblwr.mutex .

            The first part of this was pushed as MDEV-27416, to fix a rare hang during checkpoint. That change did not cause any performance regression in our internal tests, but it did cause MDEV-27499.

            The remaining 2 commits were last rebased on MDEV-14425, and an observed regression remained.

            I am keeping this ticket "in progress" as a reminder for me that I consider this an important issue that needs to be resolved. After all, performance tests of MDEV-14425 indicates that by far, the most contended mutexes under write load are log_sys.mutex and buf_pool.mutex, and this change aims to reduce contention on the latter.

            marko Marko Mäkelä added a comment - The first part of this was pushed as MDEV-27416 , to fix a rare hang during checkpoint. That change did not cause any performance regression in our internal tests, but it did cause MDEV-27499 . The remaining 2 commits were last rebased on MDEV-14425 , and an observed regression remained. I am keeping this ticket "in progress" as a reminder for me that I consider this an important issue that needs to be resolved. After all, performance tests of MDEV-14425 indicates that by far, the most contended mutexes under write load are log_sys.mutex and buf_pool.mutex , and this change aims to reduce contention on the latter.

            A further idea: Why should the LRU eviction flushing actually evict each page on write completion? It would seem better to keep the write completion function as simple as possible, and to evict pages in the actual user threads that need to allocate a page. Page allocation will already have to hold buf_pool.mutex anyway, even when no page is being evicted.

            Furthermore, the buf_flush_page_cleaner() thread could initiate two types of page writes, like it used to do before MDEV-23855: Not only write blocks ordered by buf_pool.flush_list, but also some dirty blocks ordered by the buf_pool.LRU list. That would ensure that some of the least frequently used pages are clean, and simply allow a less frequently used block to be replaced by whatever needs to allocate the page. Pages that do not need to be written back to data files can be reused instantly.

            marko Marko Mäkelä added a comment - A further idea: Why should the LRU eviction flushing actually evict each page on write completion? It would seem better to keep the write completion function as simple as possible, and to evict pages in the actual user threads that need to allocate a page. Page allocation will already have to hold buf_pool.mutex anyway, even when no page is being evicted. Furthermore, the buf_flush_page_cleaner() thread could initiate two types of page writes, like it used to do before MDEV-23855 : Not only write blocks ordered by buf_pool.flush_list , but also some dirty blocks ordered by the buf_pool.LRU list. That would ensure that some of the least frequently used pages are clean, and simply allow a less frequently used block to be replaced by whatever needs to allocate the page. Pages that do not need to be written back to data files can be reused instantly.

            In MDEV-29383 I noted the following: Performance could be improved if we did not set mtr_t::m_made_dirty already when registering MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX, but deferred it until the moment we set the MTR_MEMO_MODIFY flag on a block. In that way, even if a mini-transaction acquired a U or X latch on a page but never modified that page, mtr_t::commit() could avoid acquiring log_sys.flush_order_mutex. We only need that mutex when the mini-transaction actually needs to add a previously clean block to buf_pool.flush_list.

            marko Marko Mäkelä added a comment - In MDEV-29383 I noted the following: Performance could be improved if we did not set mtr_t::m_made_dirty already when registering MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX , but deferred it until the moment we set the MTR_MEMO_MODIFY flag on a block. In that way, even if a mini-transaction acquired a U or X latch on a page but never modified that page, mtr_t::commit() could avoid acquiring log_sys.flush_order_mutex . We only need that mutex when the mini-transaction actually needs to add a previously clean block to buf_pool.flush_list .

            After I rebased this on an attempted fix of MDEV-26055, I realized that a follow-up to MDEV-25113 is needed, now that page write completion no longer involves acquiring buf_pool.mutex. All use of buf_pool.flush_hp will have to be evaluated carefully. The ‘dirtiness’ of blocks will be protected by buf_page_t::lock, which needs to be acquired earlier during flushing.

            Furthermore, I realized that buf_page_t::write_complete() may not need to perform buf_pool.LRU eviction at all; starting with a fix of MDEV-26055 that would be done in larger batches by buf_pool_page_cleaner(). This would make IORequest::is_LRU() redundant.

            marko Marko Mäkelä added a comment - After I rebased this on an attempted fix of MDEV-26055 , I realized that a follow-up to MDEV-25113 is needed, now that page write completion no longer involves acquiring buf_pool.mutex . All use of buf_pool.flush_hp will have to be evaluated carefully. The ‘dirtiness’ of blocks will be protected by buf_page_t::lock , which needs to be acquired earlier during flushing. Furthermore, I realized that buf_page_t::write_complete() may not need to perform buf_pool.LRU eviction at all; starting with a fix of MDEV-26055 that would be done in larger batches by buf_pool_page_cleaner() . This would make IORequest::is_LRU() redundant.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.