[MDEV-31227] innodb_flush_method=O_DIRECT causes 3x regression in workload Created: 2023-05-09  Updated: 2023-08-15  Resolved: 2023-07-06

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.12, 10.6.13
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Rob Schwyzer Assignee: Aleksey Midenkov
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS 7, kernel 3.10


Issue Links:
Blocks
is blocked by MDEV-29967 innodb_read_ahead_threshold (linear r... Closed
Duplicate
duplicates MDEV-27772 Performance regression with default c... Open
Relates
relates to MDEV-30567 rec_get_offsets() is not optimal Closed
relates to MDEV-30986 Slow full index scan in 10.6 vs 10.5 ... Closed

 Description   

Repro procedure is the same as MDEV-30567. Additional details for that in the comments.

In short, this is the performance discrepancy we're seeing in 10.6.13 (10.6.12 performs similarly but slightly worse)-

# Default (O_DIRECT)
# * Q0: real: 14.21  min: 14.21  max: 14.21
 
# innodb-flush-method = fsync
# * Q0: real: 4.94  min: 4.94  max: 4.94
 
# innodb-flush-method = O_DSYNC
# * Q0: real: 4.92  min: 4.92  max: 4.92
 
# innodb-flush-method = littlesync
# * Q0: real: 4.91  min: 4.91  max: 4.91
 
# innodb-flush-method = nosync
# * Q0: real: 4.95  min: 4.95  max: 4.95
 
# innodb-flush-method = O_DIRECT
# * Q0: real: 14.21  min: 14.21  max: 14.21
 
# innodb-flush-method = O_DIRECT_NO_FSYNC
# * Q0: real: 12.81  min: 12.81  max: 12.81

The customer for whom we are concerned about this issue for is currently using 10.4 where innodb_flush_method=fsync is the default. For 10.6 though, this changes to O_DIRECT. For now, we're advising the customer to hard-specify innodb_flush_method=fsync for 10.6. However, we know that O_DIRECT was made default in 10.6 as it generally improves over fsync, so we'd like to know if the regression we're seeing here is due to a bug that, if fixed, would make O_DIRECT the all-around ideal it is intended to be.



 Comments   
Comment by Marko Mäkelä [ 2023-05-10 ]

As per MDEV-29967, the linear read-ahead is broken already in 10.5.

Also MDEV-30986 has been filed for something similar.

Comment by Marko Mäkelä [ 2023-05-24 ]

rob.schwyzer@mariadb.com, can you please test the MDEV-30986 branch, which implements logical read-ahead?

Comment by Marko Mäkelä [ 2023-06-07 ]

If I understood correctly, the workload consists solely of queries like the following:

SELECT c1 FROM "table" ORDER BY c1 LIMIT 1;

This looks like random I/O to me. Would setting innodb_random_read_ahead=ON help here? It is disabled by default.

Comment by Marko Mäkelä [ 2023-06-29 ]

Would the performance of innodb_flush_method=fsync degrade to the level of innodb_flush_method=O_DIRECT if the Linux kernel’s file system cache is emptied before starting the workload? I read somewhere that the command for that should be sudo sysctl vm.drop_caches=3, but I did not try that.

If the performance of innodb_flush_method=fsync will remain better than innodb_flush_method=O_DIRECT even with this, then I would think that the file system cache will act as a cache between the InnoDB buffer pool and the storage.

How large would the file system cache be at the end of the benchmark run? To have a fair comparison, I think that for the innodb_flush_method=O_DIRECT case, the innodb_buffer_pool_size must be increased by that amount, so that both cases will have access to the same amount of memory.

Comment by Aleksey Midenkov [ 2023-07-04 ]

After drop caches fsync is slightly worse (7.33 is good number now):

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3
 
vm.drop_caches = 3
{10.6} midenok@snsvr01:cmake$ free -h
              total        used        free      shared  buff/cache   available
Mem:           565G         10G        550G        4.0G        4.7G        549G
Swap:          127G        115M        127G
 
{10.6} midenok@snsvr01:cmake$ mp -q
! Started: 127393
* Q0: real: 9.50  min: 9.50  max: 9.50
! Stopped: 127393
* Server dirty stats:
faults: 155  rfaults: 17424  fsi: 1279376  fso: 184  rss_max: 226332  cpu: 88  sys: 0.56  usr: 9.19  real: 11.04  ctx0: 8  ctx1: 15353
 
{10.6} midenok@snsvr01:cmake$ free -h
              total        used        free      shared  buff/cache   available
Mem:           565G         10G        549G        4.0G        5.3G        549G
Swap:          127G        115M        127G
 
{10.6} midenok@snsvr01:cmake$ mp -q
! Started: 127686
* Q0: real: 7.33  min: 7.33  max: 7.33
! Stopped: 127686
* Server dirty stats:
rfaults: 16757  fsi: 160  fso: 184  rss_max: 229176  cpu: 95  sys: 0.43  usr: 8.00  real: 8.84  ctx0: 19  ctx1: 7081
 
{10.6} midenok@snsvr01:cmake$ free -h
              total        used        free      shared  buff/cache   available
Mem:           565G         10G        549G        4.0G        5.3G        549G
Swap:          127G        115M        127G

O_DIRECT is same bad:

{10.6} midenok@snsvr01:cmake$ mp -q
! Started: 128117
* Q0: real: 14.46  min: 14.46  max: 14.46
! Stopped: 128117
* Server dirty stats:
rfaults: 20930  fsi: 1300064  fso: 184  rss_max: 245112  cpu: 68  sys: 0.48  usr: 10.52  real: 15.96  ctx0: 43  ctx1: 44265
 
{10.6} midenok@snsvr01:cmake$ free -h
              total        used        free      shared  buff/cache   available
Mem:           565G         10G        549G        4.0G        5.3G        549G
Swap:          127G        115M        127G

Marko, I'd suggest you to draw your attention to fsi and ctx1 numbers. For O_DIRECT they are much bigger.

fsi: Number of file system inputs by the process.
ctx1: Number of times that the program was context-switched voluntarily, for instance while waiting for an I/O operation to complete.

Comment by Marko Mäkelä [ 2023-07-05 ]

midenok, thank you for the update. If I interpreted the output correctly, the file system cache grew by 0.6 GiB (from 4.7GiB to 5.3GiB) during the test with innodb_flush_method=fsync (file system cache enabled). If that is the case, for the innodb_flush_method=O_DIRECT test to be fair, the innodb_buffer_pool_size should have been increased by the same amount (and the file system cache emptied before running the test).

Which server configuration parameters had been specified? I do not see any server parameters mentioned either in this ticket or in the test script itself.

In particular, had innodb_random_read_ahead=ON been set? Would the prototype of logical read-ahead (MDEV-30986) improve the O_DIRECT performance? I suppose that when the Linux kernel’s file system cache is being used, the kernel can submit some read-ahead requests on its own.

The most important one would be innodb_buffer_pool_size. The output rss_max is from time(1):

%M Maximum resident set size of the process during its
lifetime, in Kbytes.

Based on this and the amount of available memory, I would assume innodb_buffer_pool_size to be over 100 gigabytes, and hence the 0.6GiB increase of the file system cache might not make much difference. But, I do not think that the kernel file system size would be covered in this number. We would need some monitoring of the global system state.

The ctx1 and fsi are from the following time(1) attributes:

%w Number of waits: times that the program was context-
switched voluntarily, for instance while waiting for an
I/O operation to complete.
%I Number of filesystem inputs by the process.

Supposedly, copying data from the kernel’s file system cache to the memory of the process would not affect either metric.

Read latency would be tracked better by the instrumentation that MDEV-31558 is adding.

I would expect there to be a number of performance improvements in the Linux kernel, especially in 5.x and 6.x.

Comment by Marko Mäkelä [ 2023-07-05 ]

There are two more sources of nondeterminism that should be accounted for, to prevent any unexpected "warm-up" of the buffer pool:

  • If the file ib_buffer_pool exists in the data directory, it must be deleted before starting the server under test.
  • Crash recovery, or purging the history of committed transactions: The server should previously have been shut down with innodb_fast_shutdown=0.
Comment by Aleksey Midenkov [ 2023-07-05 ]

innodb_buffer_pool_size did the trick.

O_DIRECT

* Q0: real: 7.52  min: 7.52  max: 7.52
! Stopped: 220861
* Server dirty stats:
rfaults: 18686  fsi: 1238336  fso: 464  rss_max: 748692  cpu: 110  sys: 1.17  usr: 8.78  real: 9.04  ctx0: 20  ctx1: 128360

fsync

* Q0: real: 7.55  min: 7.55  max: 7.55
! Stopped: 221140
* Server dirty stats:
rfaults: 19576  fsi: 2144  fso: 456  rss_max: 744456  cpu: 110  sys: 1.24  usr: 8.77  real: 9.06  ctx0: 43  ctx1: 11396

InnoDB config:

innodb_random_read_ahead=ON
innodb_buffer_pool_size=1342177280
innodb-flush-method = fsync
# innodb-flush-method = O_DSYNC
# innodb-flush-method = littlesync
# innodb-flush-method = nosync
# innodb-flush-method = O_DIRECT
# innodb-flush-method = O_DIRECT_NO_FSYNC
innodb_stats_persistent=OFF
innodb_stats_auto_recalc=OFF

Comment by Aleksey Midenkov [ 2023-07-05 ]

After dropping FS cache, with the above config:

O_DIRECT

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3
vm.drop_caches = 3
{10.6} midenok@snsvr01:cmake$ mp -q
! Started: 222519
* Q0: real: 8.31  min: 8.31  max: 8.31
! Stopped: 222519
* Server dirty stats:
faults: 149  rfaults: 21529  fsi: 1275536  fso: 456  rss_max: 752284  cpu: 104  sys: 1.30  usr: 8.98  real: 9.80  ctx0: 50  ctx1: 117028

fsync

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3
vm.drop_caches = 3
{10.6} midenok@snsvr01:cmake$ mp -q
! Started: 223108
* Q0: real: 7.86  min: 7.86  max: 7.86
! Stopped: 223108
* Server dirty stats:
faults: 162  rfaults: 23074  fsi: 1280512  fso: 456  rss_max: 761780  cpu: 107  sys: 1.17  usr: 8.80  real: 9.32  ctx0: 45  ctx1: 103667

Comment by Aleksey Midenkov [ 2023-07-05 ]

O_DIRECT innodb_random_read_ahead=OFF vs ON

innodb_random_read_ahead=OFF

innodb_random_read_ahead=OFF
innodb_buffer_pool_size=1342177280
innodb-flush-method = O_DIRECT
innodb_stats_persistent=OFF
innodb_stats_auto_recalc=OFF

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
vm.drop_caches = 3
! Started: 245416
* Q0: real: 10.33  min: 10.33  max: 10.33
* Q1: real: 7.23  min: 7.23  max: 10.33
* Q2: real: 7.35  min: 7.23  max: 10.33
* Average for 3 queries:
real: 8.30  min: 7.23  max: 10.33
! Stopped: 245416
* Server dirty stats:
faults: 164  rfaults: 21545  fsi: 1278528  fso: 456  rss_max: 741840  cpu: 91  sys: 0.93  usr: 23.69  real: 26.91  ctx0: 60  ctx1: 87688

innodb_random_read_ahead=ON

innodb_random_read_ahead=ON
innodb_buffer_pool_size=1342177280
innodb-flush-method = O_DIRECT
innodb_stats_persistent=OFF
innodb_stats_auto_recalc=OFF

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
vm.drop_caches = 3
! Started: 245958
* Q0: real: 8.08  min: 8.08  max: 8.08
* Q1: real: 7.40  min: 7.40  max: 8.08
* Q2: real: 7.31  min: 7.31  max: 8.08
* Average for 3 queries:
real: 7.59  min: 7.31  max: 8.08
! Stopped: 245958
* Server dirty stats:
faults: 164  rfaults: 21209  fsi: 1278112  fso: 472  rss_max: 745548  cpu: 100  sys: 1.24  usr: 23.55  real: 24.68  ctx0: 54  ctx1: 118957

Comment by Aleksey Midenkov [ 2023-07-05 ]

fsync innodb_random_read_ahead=OFF vs ON

innodb_random_read_ahead=OFF

innodb_random_read_ahead=OFF
innodb_buffer_pool_size=1342177280
innodb-flush-method = fsync
innodb_stats_persistent=OFF
innodb_stats_auto_recalc=OFF

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
vm.drop_caches = 3
! Started: 246520
* Q0: real: 8.53  min: 8.53  max: 8.53
* Q1: real: 7.19  min: 7.19  max: 8.53
* Q2: real: 7.13  min: 7.13  max: 8.53
* Average for 3 queries:
real: 7.61  min: 7.13  max: 8.53
! Stopped: 246520
* Server dirty stats:
faults: 164  rfaults: 24188  fsi: 1280376  fso: 456  rss_max: 762416  cpu: 96  sys: 0.96  usr: 23.11  real: 24.84  ctx0: 25  ctx1: 65197

innodb_random_read_ahead=ON

innodb_random_read_ahead=ON
innodb_buffer_pool_size=1342177280
innodb-flush-method = fsync
innodb_stats_persistent=OFF
innodb_stats_auto_recalc=OFF

{10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
vm.drop_caches = 3
! Started: 247049
* Q0: real: 8.32  min: 8.32  max: 8.32
* Q1: real: 7.59  min: 7.59  max: 8.32
* Q2: real: 7.48  min: 7.48  max: 8.32
* Average for 3 queries:
real: 7.79  min: 7.48  max: 8.32
! Stopped: 247049
* Server dirty stats:
faults: 164  rfaults: 31906  fsi: 1281136  fso: 472  rss_max: 765080  cpu: 101  sys: 1.29  usr: 24.53  real: 25.34  ctx0: 44  ctx1: 103060

Comment by Marko Mäkelä [ 2023-07-05 ]

Thank you! One of my suspicions was that this was a case of the working set exceeding innodb_buffer_pool_size, and then having all "excess" RAM acting as a cache between the durable storage and InnoDB, or not having any kernel file system cache at all.

Here is one more resource that I just came across: the README.md file of the nocache tool explains how you might be able to limit the size of the Linux kernel’s file system cache while a process is running. It might be useful for a valid comparison when using a smaller InnoDB buffer pool.

Is there anything that needs to be changed in the code for this particular ticket? Out of curiosity, what was the original innodb_buffer_pool_size? The default 128MiB (vs. 1280MiB for the updated results)?

Comment by Aleksey Midenkov [ 2023-07-06 ]

> The default 128MiB (vs. 1280MiB for the updated results)?

Right, it was the default.

Comment by Marko Mäkelä [ 2023-07-07 ]

In MDEV-30986 the following approach worked for me. The statement under test would access about 400MiB of data.

mkdir /var/tmp/var && rm -fr var && ln -s /var/tmp/var var
./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
./mtr --mysqld=--innodb-buffer-pool-size=400m --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
sudo sysctl vm.drop_caches=3
systemd-run --scope --property=MemoryLimit=250M -- ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
sudo sysctl vm.drop_caches=3
systemd-run --scope --property=MemoryLimit=550M -- ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-buffer-pool-size=400m --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test

For that test, I got the best read performance (for either buffer pool size) by using the default innodb_flush_method=O_DIRECT and enabling innodb_random_read_ahead=ON.

Generated at Thu Feb 08 10:22:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.