Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25113

Reduce effect of parallel background flush on select workload

Details

    Description

      Let's say the pattern of workload involves read-write workload followed by read-only workload.

      Read-write workload modifies pages that would cause background flushing to get active and flush dirty pages. This action of flushing dirty pages continues till the dirty_pct < innodb_max_dirty_pages_pct_lwm. This means even post read-write workload when the read-only workload starts background flushing could remain active.

      This could affect the overall throughput of the select workload.

      This task is meant to explore if there is a possibility to reduce the effect of background flush on active select workload (94-?93->92->91).

      -----------

      In the example below for the first 14 secs, adaptive flushing is active and post that there is no flushing taking place. As we could tps have a bit of noise in the first part.

      Example:

      [ 1s ] thds: 8 tps: 5864.27 qps: 93922.16 (r/w/o: 82185.64/0.00/11736.53) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
      [ 2s ] thds: 8 tps: 5878.06 qps: 94032.93 (r/w/o: 82276.81/0.00/11756.12) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
      [ 3s ] thds: 8 tps: 5901.05 qps: 94402.87 (r/w/o: 82601.76/0.00/11801.11) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
      [ 4s ] thds: 8 tps: 5822.01 qps: 93179.09 (r/w/o: 81534.08/0.00/11645.01) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
      [ 5s ] thds: 8 tps: 5754.02 qps: 92043.27 (r/w/o: 80535.24/0.00/11508.03) lat (ms,95%): 1.47 err/s: 0.00 reconn/s: 0.00
      [ 6s ] thds: 8 tps: 5682.98 qps: 90906.72 (r/w/o: 79540.76/0.00/11365.97) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
      [ 7s ] thds: 8 tps: 5698.98 qps: 91228.70 (r/w/o: 79830.74/0.00/11397.96) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
      [ 8s ] thds: 8 tps: 5688.00 qps: 90974.05 (r/w/o: 79598.04/0.00/11376.01) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
      [ 9s ] thds: 8 tps: 5641.97 qps: 90312.57 (r/w/o: 79028.62/0.00/11283.95) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
      [ 10s ] thds: 8 tps: 5695.03 qps: 91104.54 (r/w/o: 79714.48/0.00/11390.07) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 11s ] thds: 8 tps: 5736.92 qps: 91757.79 (r/w/o: 80283.94/0.00/11473.85) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 12s ] thds: 8 tps: 5692.11 qps: 91080.74 (r/w/o: 79696.52/0.00/11384.22) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
      [ 13s ] thds: 8 tps: 5691.95 qps: 91085.15 (r/w/o: 79701.26/0.00/11383.89) lat (ms,95%): 1.55 err/s: 0.00 reconn/s: 0.00
      [ 14s ] thds: 8 tps: 5692.04 qps: 91082.60 (r/w/o: 79698.53/0.00/11384.08) lat (ms,95%): 1.55 err/s: 0.00 reconn/s: 0.00

      [ 15s ] thds: 8 tps: 5775.99 qps: 92398.80 (r/w/o: 80846.82/0.00/11551.97) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 16s ] thds: 8 tps: 5828.96 qps: 93266.41 (r/w/o: 81608.49/0.00/11657.93) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 17s ] thds: 8 tps: 5835.04 qps: 93376.59 (r/w/o: 81706.52/0.00/11670.07) lat (ms,95%): 1.47 err/s: 0.00 reconn/s: 0.00
      [ 18s ] thds: 8 tps: 5810.96 qps: 92941.38 (r/w/o: 81320.46/0.00/11620.92) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 19s ] thds: 8 tps: 5834.05 qps: 93358.84 (r/w/o: 81689.73/0.00/11669.10) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 20s ] thds: 8 tps: 5836.97 qps: 93387.55 (r/w/o: 81713.60/0.00/11673.94) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 21s ] thds: 8 tps: 5836.97 qps: 93425.60 (r/w/o: 81751.65/0.00/11673.95) lat (ms,95%): 1.47 err/s: 0.00 reconn/s: 0.00
      [ 22s ] thds: 8 tps: 5839.98 qps: 93424.69 (r/w/o: 81745.73/0.00/11678.96) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
      [ 23s ] thds: 8 tps: 5838.04 qps: 93412.72 (r/w/o: 81735.63/0.00/11677.09) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00

      Attachments

        Issue Links

          Activity

            krizhanovsky Alexander Krizhanovsky added a comment - - edited

            This statistic were collected for the default values for innodb_flushing_avg_loops, innodb_max_dirty_pages_pct, and innodb_max_dirty_pages_pct_lwn with the Marko's patch MDEV-25113.patch and

            diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
            index 83036d81658..e294f41f0fd 100644
            --- a/storage/innobase/log/log0log.cc
            +++ b/storage/innobase/log/log0log.cc
            @@ -163,7 +163,7 @@ log_set_capacity(ulonglong file_size)
             
                    log_sys.log_capacity = smallest_capacity;
             
            -	log_sys.max_modified_age_async = margin - margin / 8;
            +	log_sys.max_modified_age_async = margin - margin / 4;
             	log_sys.max_checkpoint_age = margin;
             
                    mysql_mutex_unlock(&log_sys.mutex);
            

            If the mutex is the problem, then I'd expect to see a spike on the mutex during the performance gap, but actually there are lower number of mutex events during the dip.

            krizhanovsky Alexander Krizhanovsky added a comment - - edited This statistic were collected for the default values for innodb_flushing_avg_loops , innodb_max_dirty_pages_pct , and innodb_max_dirty_pages_pct_lwn with the Marko's patch MDEV-25113.patch and diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc index 83036d81658..e294f41f0fd 100644 --- a/storage/innobase/log/log0log.cc +++ b/storage/innobase/log/log0log.cc @@ -163,7 +163,7 @@ log_set_capacity(ulonglong file_size)   log_sys.log_capacity = smallest_capacity;   - log_sys.max_modified_age_async = margin - margin / 8; + log_sys.max_modified_age_async = margin - margin / 4; log_sys.max_checkpoint_age = margin;   mysql_mutex_unlock(&log_sys.mutex); If the mutex is the problem, then I'd expect to see a spike on the mutex during the performance gap, but actually there are lower number of mutex events during the dip.

            krizhanovsky, I think that the buf_pool.flush_list_mutex hold time is typically rather short, with the exception of special processing of a single tablespace (MDEV-25773). In your graphs, we see a high number of events on the mutex, which I think should be due to buf_flush_note_modification() adding blocks to the start of the list. If we implement my ‘lazy removal’ idea, the numerous buf_page_write_completion() will stop interfering with concurrent buf_flush_note_modification() from user threads.

            I have an even wilder idea: Implement buf_pool.flush_list as a lock-free singly-linked list. I would expect that a lock-free implementation of a singly-linked list is almost trivial. We can see that the addition to the list is protected by log_sys.flush_order_mutex. Declaring the ‘next’ pointers as Atomic_relaxed<buf_page_t*> could be a good starting point.

            Moving to a singly-linked list would require a rewrite of buf_flush_LRU_list_batch(). Currently it iterates the buf_pool.LRU list, and then invokes buf_flush_discard_page() for individual blocks. When removing from a singly-linked list, we need to know the address of the preceding block. We get it for free when traversing the buf_pool.flush_list, but in buf_flush_LRU_list_batch() an extra traversal of the list would be needed.

            Also the function buf_flush_relocate_on_flush_list() (which is only used with ROW_FORMAT=COMPRESSED tables) could take a significant performance hit from moving to a singly-linked list.

            I will not be able to implement any of this before I have made some significant progress with MDEV-25506 and MDEV-25783.

            marko Marko Mäkelä added a comment - krizhanovsky , I think that the buf_pool.flush_list_mutex hold time is typically rather short, with the exception of special processing of a single tablespace ( MDEV-25773 ). In your graphs, we see a high number of events on the mutex, which I think should be due to buf_flush_note_modification() adding blocks to the start of the list. If we implement my ‘lazy removal’ idea, the numerous buf_page_write_completion() will stop interfering with concurrent buf_flush_note_modification() from user threads. I have an even wilder idea: Implement buf_pool.flush_list as a lock-free singly-linked list. I would expect that a lock-free implementation of a singly-linked list is almost trivial. We can see that the addition to the list is protected by log_sys.flush_order_mutex . Declaring the ‘next’ pointers as Atomic_relaxed<buf_page_t*> could be a good starting point. Moving to a singly-linked list would require a rewrite of buf_flush_LRU_list_batch() . Currently it iterates the buf_pool.LRU list, and then invokes buf_flush_discard_page() for individual blocks. When removing from a singly-linked list, we need to know the address of the preceding block. We get it for free when traversing the buf_pool.flush_list , but in buf_flush_LRU_list_batch() an extra traversal of the list would be needed. Also the function buf_flush_relocate_on_flush_list() (which is only used with ROW_FORMAT=COMPRESSED tables) could take a significant performance hit from moving to a singly-linked list. I will not be able to implement any of this before I have made some significant progress with MDEV-25506 and MDEV-25783 .

            I worked on a prototype that implements the lazy deletion from buf_pool.flush_list. The test innodb.ibuf_not_empty is failing, presumably due to a missed page write, so it is not ready for stress testing or performance testing yet.

            I noticed that buf_page_write_complete() is also acquiring buf_pool.mutex. Also some other flush list traversal is covered by both buf_pool.mutex and buf_pool.flush_list_mutex. This is likely to be somewhat of a bottleneck, and we should try to try to reduce the use of buf_pool.mutex.

            marko Marko Mäkelä added a comment - I worked on a prototype that implements the lazy deletion from buf_pool.flush_list . The test innodb.ibuf_not_empty is failing, presumably due to a missed page write, so it is not ready for stress testing or performance testing yet. I noticed that buf_page_write_complete() is also acquiring buf_pool.mutex . Also some other flush list traversal is covered by both buf_pool.mutex and buf_pool.flush_list_mutex . This is likely to be somewhat of a bottleneck, and we should try to try to reduce the use of buf_pool.mutex .

            After some experiments, I do not think that we can easily remove buf_pool.mutex from buf_page_write_complete().

            marko Marko Mäkelä added a comment - After some experiments, I do not think that we can easily remove buf_pool.mutex from buf_page_write_complete() .

            Some follow-up work was postponed to MDEV-26004 due to mixed results.

            marko Marko Mäkelä added a comment - Some follow-up work was postponed to MDEV-26004 due to mixed results.

            People

              marko Marko Mäkelä
              krunalbauskar Krunal Bauskar
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.