[MDEV-25113] Reduce effect of parallel background flush on select workload Created: 2021-03-11  Updated: 2023-08-28  Resolved: 2021-06-23

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.5.12, 10.6.3

Type: Task Priority: Major
Reporter: Krunal Bauskar Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: performance

Attachments: PNG File 25113-fix2.png     File MDEV-25113.patch     PNG File tpcc_10.6_1h_multi.png     PNG File tpcc_10.6_1h_pct_lwn10ppct30_io1020x88_new.png     PNG File tpcc_10.6_bp25g_io.png    
Issue Links:
Blocks
blocks MDEV-26827 Make page flushing even faster Closed
Problem/Incident
causes MDEV-26010 Assertion `lsn > 2' failed in buf_poo... Closed
causes MDEV-26017 Assertion `stat.flush_list_bytes <= c... Closed
causes MDEV-26200 buf_pool.flush_list corruption in buf... Closed
Relates
relates to MDEV-26004 Excessive wait times in buf_LRU_get_f... Closed
relates to MDEV-24949 Enabling idle flushing (possible regr... Closed
relates to MDEV-25093 Adaptive flushing fails to kick in ev... Closed
relates to MDEV-27868 buf_pool.flush_list is in the wrong o... Closed
relates to MDEV-31354 SIGSEGV in log_sort_flush_list() in I... Closed
relates to MDEV-32029 Assertion failures in log_sort_flush_... Closed

 Description   

Let's say the pattern of workload involves read-write workload followed by read-only workload.

Read-write workload modifies pages that would cause background flushing to get active and flush dirty pages. This action of flushing dirty pages continues till the dirty_pct < innodb_max_dirty_pages_pct_lwm. This means even post read-write workload when the read-only workload starts background flushing could remain active.

This could affect the overall throughput of the select workload.

This task is meant to explore if there is a possibility to reduce the effect of background flush on active select workload (94-?93->92->91).

-----------

In the example below for the first 14 secs, adaptive flushing is active and post that there is no flushing taking place. As we could tps have a bit of noise in the first part.

Example:

[ 1s ] thds: 8 tps: 5864.27 qps: 93922.16 (r/w/o: 82185.64/0.00/11736.53) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 8 tps: 5878.06 qps: 94032.93 (r/w/o: 82276.81/0.00/11756.12) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 8 tps: 5901.05 qps: 94402.87 (r/w/o: 82601.76/0.00/11801.11) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 8 tps: 5822.01 qps: 93179.09 (r/w/o: 81534.08/0.00/11645.01) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00
[ 5s ] thds: 8 tps: 5754.02 qps: 92043.27 (r/w/o: 80535.24/0.00/11508.03) lat (ms,95%): 1.47 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 8 tps: 5682.98 qps: 90906.72 (r/w/o: 79540.76/0.00/11365.97) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
[ 7s ] thds: 8 tps: 5698.98 qps: 91228.70 (r/w/o: 79830.74/0.00/11397.96) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 8 tps: 5688.00 qps: 90974.05 (r/w/o: 79598.04/0.00/11376.01) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
[ 9s ] thds: 8 tps: 5641.97 qps: 90312.57 (r/w/o: 79028.62/0.00/11283.95) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 8 tps: 5695.03 qps: 91104.54 (r/w/o: 79714.48/0.00/11390.07) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 11s ] thds: 8 tps: 5736.92 qps: 91757.79 (r/w/o: 80283.94/0.00/11473.85) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 8 tps: 5692.11 qps: 91080.74 (r/w/o: 79696.52/0.00/11384.22) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 8 tps: 5691.95 qps: 91085.15 (r/w/o: 79701.26/0.00/11383.89) lat (ms,95%): 1.55 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 8 tps: 5692.04 qps: 91082.60 (r/w/o: 79698.53/0.00/11384.08) lat (ms,95%): 1.55 err/s: 0.00 reconn/s: 0.00

[ 15s ] thds: 8 tps: 5775.99 qps: 92398.80 (r/w/o: 80846.82/0.00/11551.97) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 8 tps: 5828.96 qps: 93266.41 (r/w/o: 81608.49/0.00/11657.93) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 8 tps: 5835.04 qps: 93376.59 (r/w/o: 81706.52/0.00/11670.07) lat (ms,95%): 1.47 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 8 tps: 5810.96 qps: 92941.38 (r/w/o: 81320.46/0.00/11620.92) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 8 tps: 5834.05 qps: 93358.84 (r/w/o: 81689.73/0.00/11669.10) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 8 tps: 5836.97 qps: 93387.55 (r/w/o: 81713.60/0.00/11673.94) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 21s ] thds: 8 tps: 5836.97 qps: 93425.60 (r/w/o: 81751.65/0.00/11673.95) lat (ms,95%): 1.47 err/s: 0.00 reconn/s: 0.00
[ 22s ] thds: 8 tps: 5839.98 qps: 93424.69 (r/w/o: 81745.73/0.00/11678.96) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 23s ] thds: 8 tps: 5838.04 qps: 93412.72 (r/w/o: 81735.63/0.00/11677.09) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00



 Comments   
Comment by Marko Mäkelä [ 2021-03-11 ]

This problem was originally observed while we tested MDEV-24949. MDEV-25093 has some similarity with this.

Comment by Marko Mäkelä [ 2021-05-11 ]

krizhanovsky ran a test that suggests that the adaptive flushing could be improved:

If stable throughput and latency are more important than minimizing write amplification (and maximizing the lifetime of the storage), that graph does show ‘bad’ behaviour. The cause of the ‘badness’ should be that we let the checkpoint age hover around the maximum. Occasionally, the page cleaner would be able to write out a larger burst of pages, presumably because the log had already been durably written up to FIL_PAGE_LSN on all of the pages. (Often, the page cleaner would have to skip such pages, hoping to be able to write them during the next batch, after more of the redo log has been written.)

While I believe that such performance may be desirable in some deployments, something definitely could be done to improve the adaptive flushing. I would like to apply control theory to this problem and implement the adaptive flushing as a PID controller. The inputs would include the checkpoint age and the length of buf_pool.flush_list, possibly relative to buf_pool.LRU (but maybe we should omit pages of temporary or dropped tables from the latter count).

Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-17 ]

After several iteration of tuning the flushing process for the workload I came to the following strange results. The bad thing about the previous graphs is that while checkpoint age and number of dirty pages were growing, there were almost no IO utill we hit the flush storm. So I set

innodb_max_dirty_pages_pct=30
innodb_max_dirty_pages_pct_lwm=10
innodb_io_capacity=10000
innodb_io_capacity_max=20000
innodb_write_io_threads=8

to make flushing to start earlier. With these settings the number of dirty pages is relatively stable, but LSN age still hits the maximum value. This graph contains buffer_flush_sync_waits, which spikes right during the performance dip.

The stange results about adaptive flushing is that the number of flushed pages reaches the peak right before the performance dip and IO is significantly reduced during the performance dip and after that.

The disk can handle up to 64K random writes of 16KB size (verified with fio), however InnoDB still doesn't reach the max IO capacity even during the flush storm.

Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-24 ]

I also collected graphs for the same workload and much smaller buffer pool and bit higher IO capacity:

innodb_io_capacity=15000
innodb_io_capacity_max=20000
innodb_buffer_pool_size= 25G

The behaviour looks the same: underutilized IO during normal operation and sharp waves when checkpoint age reaches the maximum value.

Comment by Marko Mäkelä [ 2021-05-24 ]

What if we make mtr_t::commit() trigger a less eager flush earlier? That is, make log_close() return at least one more value to mtr_t::finish_write(), instead of the current Boolean (indicating whether buf_flush_async() will have to be invoked). It would seem reasonable to restore two checkpoint age limits and introduce another atomic variable buf_flush_async_lsn (in addition to buf_flush_sync_lsn) for the case when only a lower checkpoint age limit is exceeded.

In buf_flush_page_cleaner(), we would avoid goto unemployed as long as buf_flush_async_lsn is set. We would reset that variable as soon as we are below the checkpoint age limit again.

Comment by Marko Mäkelä [ 2021-05-27 ]

Based on some analysis of stack traces in MDEV-25773, I realized that all several threads executing buf_page_write_completion() will be competing for buf_pool.flush_list_mutex. Could we employ some lock-free data structure to avoid the mutex acquisition? And let the buf_flush_page_cleaner thread remove the already flushed blocks on the next run? Maybe we could just reset the oldest_modification field to signal that the block is clean, and introduce a separate flag "exists as garbage in the buf_pool.flush_list" to prevent the list from becoming corrupted?

Comment by Marko Mäkelä [ 2021-05-27 ]

Even simpler idea: Use a special value such as oldest_modification=1 to indicate that the page is actually clean but exists as garbage in buf_pool.flush_list. LSN values less than 2048 are impossible by the design of the redo log file.

Garbage would be collected (list member removed and the oldest_modification reset to 0) by any code that encounters such a member in buf_pool.flush_list while holding buf_pool.flush_list_mutex. Also, if a block is evicted and its oldest_modification is not 0, we would first remove it from buf_pool.flush_list.

Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-27 ]

This statistic were collected for the default values for innodb_flushing_avg_loops, innodb_max_dirty_pages_pct, and innodb_max_dirty_pages_pct_lwn with the Marko's patch MDEV-25113.patch and

diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
index 83036d81658..e294f41f0fd 100644
--- a/storage/innobase/log/log0log.cc
+++ b/storage/innobase/log/log0log.cc
@@ -163,7 +163,7 @@ log_set_capacity(ulonglong file_size)
 
        log_sys.log_capacity = smallest_capacity;
 
-	log_sys.max_modified_age_async = margin - margin / 8;
+	log_sys.max_modified_age_async = margin - margin / 4;
 	log_sys.max_checkpoint_age = margin;
 
        mysql_mutex_unlock(&log_sys.mutex);

If the mutex is the problem, then I'd expect to see a spike on the mutex during the performance gap, but actually there are lower number of mutex events during the dip.

Comment by Marko Mäkelä [ 2021-05-28 ]

krizhanovsky, I think that the buf_pool.flush_list_mutex hold time is typically rather short, with the exception of special processing of a single tablespace (MDEV-25773). In your graphs, we see a high number of events on the mutex, which I think should be due to buf_flush_note_modification() adding blocks to the start of the list. If we implement my ‘lazy removal’ idea, the numerous buf_page_write_completion() will stop interfering with concurrent buf_flush_note_modification() from user threads.

I have an even wilder idea: Implement buf_pool.flush_list as a lock-free singly-linked list. I would expect that a lock-free implementation of a singly-linked list is almost trivial. We can see that the addition to the list is protected by log_sys.flush_order_mutex. Declaring the ‘next’ pointers as Atomic_relaxed<buf_page_t*> could be a good starting point.

Moving to a singly-linked list would require a rewrite of buf_flush_LRU_list_batch(). Currently it iterates the buf_pool.LRU list, and then invokes buf_flush_discard_page() for individual blocks. When removing from a singly-linked list, we need to know the address of the preceding block. We get it for free when traversing the buf_pool.flush_list, but in buf_flush_LRU_list_batch() an extra traversal of the list would be needed.

Also the function buf_flush_relocate_on_flush_list() (which is only used with ROW_FORMAT=COMPRESSED tables) could take a significant performance hit from moving to a singly-linked list.

I will not be able to implement any of this before I have made some significant progress with MDEV-25506 and MDEV-25783.

Comment by Marko Mäkelä [ 2021-06-08 ]

I worked on a prototype that implements the lazy deletion from buf_pool.flush_list. The test innodb.ibuf_not_empty is failing, presumably due to a missed page write, so it is not ready for stress testing or performance testing yet.

I noticed that buf_page_write_complete() is also acquiring buf_pool.mutex. Also some other flush list traversal is covered by both buf_pool.mutex and buf_pool.flush_list_mutex. This is likely to be somewhat of a bottleneck, and we should try to try to reduce the use of buf_pool.mutex.

Comment by Marko Mäkelä [ 2021-06-22 ]

After some experiments, I do not think that we can easily remove buf_pool.mutex from buf_page_write_complete().

Comment by Marko Mäkelä [ 2021-06-23 ]

Some follow-up work was postponed to MDEV-26004 due to mixed results.

Generated at Thu Feb 08 09:35:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.