[MDEV-25113] Reduce effect of parallel background flush on select workload Created: 2021-03-11 Updated: 2023-08-28 Resolved: 2021-06-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 10.5.12, 10.6.3 |
| Type: | Task | Priority: | Major |
| Reporter: | Krunal Bauskar | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Let's say the pattern of workload involves read-write workload followed by read-only workload. Read-write workload modifies pages that would cause background flushing to get active and flush dirty pages. This action of flushing dirty pages continues till the dirty_pct < innodb_max_dirty_pages_pct_lwm. This means even post read-write workload when the read-only workload starts background flushing could remain active. This could affect the overall throughput of the select workload. This task is meant to explore if there is a possibility to reduce the effect of background flush on active select workload (94-?93->92->91). ----------- In the example below for the first 14 secs, adaptive flushing is active and post that there is no flushing taking place. As we could tps have a bit of noise in the first part. Example: [ 1s ] thds: 8 tps: 5864.27 qps: 93922.16 (r/w/o: 82185.64/0.00/11736.53) lat (ms,95%): 1.44 err/s: 0.00 reconn/s: 0.00 [ 15s ] thds: 8 tps: 5775.99 qps: 92398.80 (r/w/o: 80846.82/0.00/11551.97) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00 |
| Comments |
| Comment by Marko Mäkelä [ 2021-03-11 ] | |||||||||||||
|
This problem was originally observed while we tested | |||||||||||||
| Comment by Marko Mäkelä [ 2021-05-11 ] | |||||||||||||
|
krizhanovsky ran a test that suggests that the adaptive flushing could be improved: If stable throughput and latency are more important than minimizing write amplification (and maximizing the lifetime of the storage), that graph does show ‘bad’ behaviour. The cause of the ‘badness’ should be that we let the checkpoint age hover around the maximum. Occasionally, the page cleaner would be able to write out a larger burst of pages, presumably because the log had already been durably written up to FIL_PAGE_LSN on all of the pages. (Often, the page cleaner would have to skip such pages, hoping to be able to write them during the next batch, after more of the redo log has been written.) While I believe that such performance may be desirable in some deployments, something definitely could be done to improve the adaptive flushing. I would like to apply control theory to this problem and implement the adaptive flushing as a PID controller. The inputs would include the checkpoint age and the length of buf_pool.flush_list, possibly relative to buf_pool.LRU (but maybe we should omit pages of temporary or dropped tables from the latter count). | |||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-17 ] | |||||||||||||
|
After several iteration of tuning the flushing process for the workload I came to the following strange results. The bad thing about the previous graphs is that while checkpoint age and number of dirty pages were growing, there were almost no IO utill we hit the flush storm. So I set innodb_max_dirty_pages_pct=30 to make flushing to start earlier. With these settings the number of dirty pages is relatively stable, but LSN age still hits the maximum value. This graph contains buffer_flush_sync_waits, which spikes right during the performance dip. The stange results about adaptive flushing is that the number of flushed pages reaches the peak right before the performance dip and IO is significantly reduced during the performance dip and after that. The disk can handle up to 64K random writes of 16KB size (verified with fio), however InnoDB still doesn't reach the max IO capacity even during the flush storm. | |||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-24 ] | |||||||||||||
|
I also collected graphs for the same workload and much smaller buffer pool and bit higher IO capacity:
The behaviour looks the same: underutilized IO during normal operation and sharp waves when checkpoint age reaches the maximum value. | |||||||||||||
| Comment by Marko Mäkelä [ 2021-05-24 ] | |||||||||||||
|
What if we make mtr_t::commit() trigger a less eager flush earlier? That is, make log_close() return at least one more value to mtr_t::finish_write(), instead of the current Boolean (indicating whether buf_flush_async() will have to be invoked). It would seem reasonable to restore two checkpoint age limits and introduce another atomic variable buf_flush_async_lsn (in addition to buf_flush_sync_lsn) for the case when only a lower checkpoint age limit is exceeded. In buf_flush_page_cleaner(), we would avoid goto unemployed as long as buf_flush_async_lsn is set. We would reset that variable as soon as we are below the checkpoint age limit again. | |||||||||||||
| Comment by Marko Mäkelä [ 2021-05-27 ] | |||||||||||||
|
Based on some analysis of stack traces in | |||||||||||||
| Comment by Marko Mäkelä [ 2021-05-27 ] | |||||||||||||
|
Even simpler idea: Use a special value such as oldest_modification=1 to indicate that the page is actually clean but exists as garbage in buf_pool.flush_list. LSN values less than 2048 are impossible by the design of the redo log file. Garbage would be collected (list member removed and the oldest_modification reset to 0) by any code that encounters such a member in buf_pool.flush_list while holding buf_pool.flush_list_mutex. Also, if a block is evicted and its oldest_modification is not 0, we would first remove it from buf_pool.flush_list. | |||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-27 ] | |||||||||||||
|
This statistic were collected for the default values for innodb_flushing_avg_loops, innodb_max_dirty_pages_pct, and innodb_max_dirty_pages_pct_lwn with the Marko's patch MDEV-25113.patch
If the mutex is the problem, then I'd expect to see a spike on the mutex during the performance gap, but actually there are lower number of mutex events during the dip. | |||||||||||||
| Comment by Marko Mäkelä [ 2021-05-28 ] | |||||||||||||
|
krizhanovsky, I think that the buf_pool.flush_list_mutex hold time is typically rather short, with the exception of special processing of a single tablespace ( I have an even wilder idea: Implement buf_pool.flush_list as a lock-free singly-linked list. I would expect that a lock-free implementation of a singly-linked list is almost trivial. We can see that the addition to the list is protected by log_sys.flush_order_mutex. Declaring the ‘next’ pointers as Atomic_relaxed<buf_page_t*> could be a good starting point. Moving to a singly-linked list would require a rewrite of buf_flush_LRU_list_batch(). Currently it iterates the buf_pool.LRU list, and then invokes buf_flush_discard_page() for individual blocks. When removing from a singly-linked list, we need to know the address of the preceding block. We get it for free when traversing the buf_pool.flush_list, but in buf_flush_LRU_list_batch() an extra traversal of the list would be needed. Also the function buf_flush_relocate_on_flush_list() (which is only used with ROW_FORMAT=COMPRESSED tables) could take a significant performance hit from moving to a singly-linked list. I will not be able to implement any of this before I have made some significant progress with | |||||||||||||
| Comment by Marko Mäkelä [ 2021-06-08 ] | |||||||||||||
|
I worked on a prototype that implements the lazy deletion from buf_pool.flush_list. The test innodb.ibuf_not_empty is failing, presumably due to a missed page write, so it is not ready for stress testing or performance testing yet. I noticed that buf_page_write_complete() is also acquiring buf_pool.mutex. Also some other flush list traversal is covered by both buf_pool.mutex and buf_pool.flush_list_mutex. This is likely to be somewhat of a bottleneck, and we should try to try to reduce the use of buf_pool.mutex. | |||||||||||||
| Comment by Marko Mäkelä [ 2021-06-22 ] | |||||||||||||
|
After some experiments, I do not think that we can easily remove buf_pool.mutex from buf_page_write_complete(). | |||||||||||||
| Comment by Marko Mäkelä [ 2021-06-23 ] | |||||||||||||
|
Some follow-up work was postponed to |