[MDEV-25093] Adaptive flushing fails to kick in even if innodb_adaptive_flushing_lwm is hit. (possible regression) Created: 2021-03-09  Updated: 2021-06-30  Resolved: 2021-04-28

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.7, 10.5.8, 10.5.9
Fix Version/s: 10.5.10

Type: Bug Priority: Blocker
Reporter: Krunal Bauskar Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: regression

Issue Links:
Problem/Incident
is caused by MDEV-23855 InnoDB log checkpointing causes regre... Closed
Relates
relates to MDEV-24949 Enabling idle flushing (possible regr... Closed
relates to MDEV-25113 Reduce effect of parallel background ... Closed
relates to MDEV-25557 Document that innodb_adaptive_flushin... Closed
relates to MDEV-26055 Adaptive flushing is still not gettin... Closed

 Description   
  • InnoDB flushing should happen if either of the factors is true

  a. dirty_pct (dirty pages in buffer pool) > innodb_max_dirty_pages_pct_lwm
  b. or innodb_adaptive_flushing_lwm limit is reached (default to 10%)

  • condition (b) represent pressure on redo log and even if (a) is not reached
      then (b) will cause flushing to start to help reduce the pressure on the redo-log.
  • Based on the investigation so far it has been found that (b) condition
      is not causing adaptive flushing to kick in.

---------------------------------------------------------------------------

Let's understand this with some quick experiment

Let's say we have a very large buffer pool (1m pages = 160 GB).
Also, let's set innodb_max_dirty_pages_pct_lwm = 70% which means flushing
will not happen till we reach that limit unless adaptive flushing kicks in
(with 69 GB data the limit is never hit).

Adaptive flushing should kick in if there is pressure being built on the redo log
and is controlled by innodb_adaptive_flushing_lwm (default to 10% unchanged for the experiment)).

I am running an update-index workload in parallel and as we could see despite redo log
crossing the 10% (innodb_adaptive_flushing_lwm) limit flushing fails to kick in.
[condition (b) is true].

Ideally, on crossing 10% of the redo-log size (20GB * 10% = 2GB) it should start flushing.
Max-checkpoint age is correctly set to 85% of the redo-log size (I recall it should be 80-85%).

MariaDB [(none)]> show status like 'Innodb_buffer_pool_pages%'; show status like 'Innodb_checkpoint_%';
-------------------------------------------------+
Variable_name Value
-------------------------------------------------+

Innodb_buffer_pool_pages_data 4496537
Innodb_buffer_pool_pages_dirty 3100258
Innodb_buffer_pool_pages_flushed 0
Innodb_buffer_pool_pages_free 5826663
.....
Innodb_checkpoint_age 4260770018
Innodb_checkpoint_max_age 17393908102
--------------------------------------+

MariaDB [(none)]> show status like 'Innodb_buffer_pool_pages%'; show status like 'Innodb_checkpoint_%';
-------------------------------------------------+
Variable_name Value
-------------------------------------------------+

Innodb_buffer_pool_pages_data 4523411
Innodb_buffer_pool_pages_dirty 4483055
Innodb_buffer_pool_pages_flushed 0
Innodb_buffer_pool_pages_free 5799789
.....
Innodb_checkpoint_age 15647589898
Innodb_checkpoint_max_age 17393908102
--------------------------------------+

Version tested on: 10.5 (#4498714)

and of-course a sudden drop in tps is seen once the redo-log hit the max-checkpoint age (84K -> 34K)

[ 255s ] thds: 1024 tps: 84861.92 qps: 84862.12 (r/w/o: 0.00/84862.12/0.00) lat (ms,95%): 12.75 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 1024 tps: 78755.87 qps: 78755.87 (r/w/o: 0.00/78755.87/0.00) lat (ms,95%): 12.30 err/s: 0.00 reconn/s: 0.00
[ 265s ] thds: 1024 tps: 34419.32 qps: 34419.32 (r/w/o: 0.00/34419.32/0.00) lat (ms,95%): 27.17 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 1024 tps: 53913.70 qps: 53913.70 (r/w/o: 0.00/53913.70/0.00) lat (ms,95%): 13.70 err/s: 0.00 reconn/s: 0.00
[ 275s ] thds: 1024 tps: 59043.41 qps: 59043.41 (r/w/o: 0.00/59043.41/0.00) lat (ms,95%): 14.73 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 1024 tps: 73390.11 qps: 73390.11 (r/w/o: 0.00/73390.11/0.00) lat (ms,95%): 13.70 err/s: 0.00 reconn/s: 0.00

---------------

Said issue looks to be a regression and older version should be studied to findout when it started regressing but likely it is 10.5 onwards only.



 Comments   
Comment by Marko Mäkelä [ 2021-04-28 ]

It seems to me that this regression was introduced in MDEV-23855. Already in 10.5.7 we would skip any flushing (including adaptive flushing) if the following condition holds:

    if (dirty_pct < srv_max_dirty_pages_pct_lwm)
      continue;

That condition was later revised in MDEV-24537, but this regression remained.

Comment by Krunal Bauskar [ 2021-04-28 ]

As per the original semantics (please refer to mysql documentation) https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool-flushing.html

The innodb_adaptive_flushing_lwm variable defines a low water mark for redo log capacity. When that threshold is crossed, adaptive flushing is enabled, even if the innodb_adaptive_flushing variable is disabled.

So srv_adaptive_flushing doesn't make difference if the threshold is crossed.

Comment by Marko Mäkelä [ 2021-04-28 ]

Thank you, krunalbauskar! I was not aware that innodb_adaptive_flushing=OFF does not necessarily mean "no". greenman, please update our documentation on that.

Comment by Ian Gilfillan [ 2021-04-28 ]

Thanks, added MDEV-25557 to track

Generated at Thu Feb 08 09:35:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.