[MDEV-37692] Huge performance decrease when using innodb_flush_method=O_DIRECT vs fsync - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.6, 10.11
Fix Version/s: N/A
Component/s: Server
Labels:
- cache
- innodb
- performance
- regression
Environment:
Linux. AWS AL2023. EBS gp3 (default settings IOPS=3000; Throughput=125); MariaDB 10.11.13; 16GB RAM; 4 CPU cores.
All default MariaDB settings.

Bug Category:
Not for Release Notes

Description

After upgrading from 10.5 to 10.11, I noticed a massive increase in I/O resulting in > 80 wa (I/O wait time) reported by top and rendering the application (a website) unusable under normal peak load.

After investigating, we found that the change to the default innodb_flush_method was the problem - changing back to fsync (from O_DIRECT) resolved it.

This report is to provide more information to the developers to consider for future changes in environments that may differ from what they're testing with.

The attached image shows the volume metrics with 10.5 using fsync, 10.11 using O_DIRECT, and 10.11 using fsync.

The issue seems to be that when O_DIRECT is used, it makes over 1,000x more I/O read ops per second; each I/O read seems to be for exactly 16k. The I/O ops per second is limited and throttled/queued in AWS. Increasing the limit helps alleviate the situation, but it's ALWAYS slower making so many calls when the disk is not directly attached to the same computer/CPU (typical of cloud environments).

fsync seems to be much more efficient in how often it does I/O and makes requests for larger sizes (not 16k each), so results in better throughput.

This needs to be communicated very clearly as it only becomes apparent under heavy load and is not the result of any particular query type or complexity, so difficult to debug.

This also needs to be taken into consideration with the plans to deprecate innodb_flush_method in v11.0: Will it still be possible to use the exact same fsync behavior using the four new boolean dynamic variables? If not, that will be a clear blocker for upgrading unless there's some other solution.

This issue is extremely risky because it only shows up during heavy/peak load, which is usual in production, but not easy to replicate in a test environment. Downgrading to the previous major version is not supported and risky in itself, so can we be sure this won't happen again in the future?

Hope this information helps.