[MDEV-27772] Performance regression with default configuration in 10.6 Created: 2022-02-08  Updated: 2023-08-15

Status: Open
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Danilo Spinella Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

SUSE Enterprise Linux 15 SP4, openSUSE Tumbleweed


Issue Links:
Duplicate
is duplicated by MDEV-31227 innodb_flush_method=O_DIRECT causes 3... Closed
Relates
relates to MDEV-19895 Support "autoset" in SET GLOBAL for A... Open
relates to MDEV-24670 avoid OOM by linux kernel co-operativ... Closed
relates to MDEV-27812 Allow innodb_log_file_size to change ... Closed
relates to MDEV-24854 Change innodb_flush_method=O_DIRECT b... Closed

 Description   

MariaDB 10.6 has O_DIRECT value as default for innodb_flush_method instead of fsync. This lead to a performance regression in the default configuration, regardless of the filesystem used.
To benchmark the performance, I have used sysbench-0.4.8 oltp read test on btrfs (the most affected), ext4 and xfs.

Filesystem threads innodb_flush_method=fsync innodb_flush_method=O_DIRECT
ext4 1 6.5s 7.0s
ext4 16 0.97s 1.15s
brtfs 1 6.5s 77s
brtfs 16 0.8s 3.1s
xfs 1 6.4s 7.2s
xfs 16 0.8s 1.1s

I have used a mount point in /abuild and the following configuration in /etc/my.cnf.d/sysbench_testing.cnf:

[mysqld]
port = 3306
datadir=/abuild/mysql-sysbench
socket=/abuild/mysql-sysbench/sysbench-mysql.sock
#innodb_flush_method=fsync

To reproduce, run the following commands:

$ # prepare the db
$ sysbench --test=oltp --mysql-user=root \
       --mysql-socket=/abuild/mysql-sysbench/sysbench-mysql.sock \
       --mysql-db=test --mysql-port=3306 \
       --oltp-dist-type=uniform --oltp-read-only --oltp-order-ranges=0 \
       --oltp-table-size=1000000 --oltp-range-size=1000 prepare
$ # run the benchmark
$ sysbench --test=oltp --mysql-user=root \
       --mysql-socket=/abuild/mysql-sysbench/sysbench-mysql.sock \
       --mysql-db=test --mysql-port=3306 \
       --oltp-dist-type=uniform --oltp-read-only --oltp-order-ranges=0 \
       --oltp-table-size=1000000 --oltp-range-size=1000 \
       --num-threads=$NTHREADS run
$ # cleanup the db
$ sysbench --test=oltp --mysql-user=root \
         --mysql-socket=/abuild/mysql-sysbench/sysbench-mysql.sock \
	 --mysql-db=test --mysql-port=3306  cleanup >/dev/null 1>&2

I am also aware that some testing by some colleague lead to similar regression in xfs (353.5476s in 10.6 vs 16.9560s in 10.5).



 Comments   
Comment by Vladislav Vaintroub [ 2022-02-08 ]

but if you benchmark a table with 1000 rows, instead of 1 million, it will probably not result in a performance regression, right?

Comment by Marko Mäkelä [ 2022-02-08 ]

If this is with the default innodb_buffer_pool_size=128M and innodb_log_file_size=96M, then this is as expected. Due to O_DIRECT, any page reads due to the (by today’s standards) tiny buffer pool cannot be served from the Linux file system cache.

The InnoDB buffer pool itself is supposed to be the data file cache. A tiny buffer pool will also cause more frequent page writes and page checksum calculations on both reads and writes.

I expect MDEV-19895 to address this.

Comment by Danilo Spinella [ 2022-02-08 ]

Yea, with 1000 rows there is almost no difference (just tried with brtfs).

Comment by Vladislav Vaintroub [ 2022-02-09 ]

ok , for danyspin97 once again the explanation.

  • if innodb_flush_method = fsync , then filesystem cache serves as a not very efficient level-2 cash for the database
  • if innodb_flush_method =O_DIRECT, then there is no level-2 cash so to say, but you can make your level-1 much bigger with innodb_buffer_pool_size=<size>

There is a certain appeal in innodb_flush_method = fsync, that is, you can misconfigure your database instance, and it will still perform OK-ish, not ideal of course. Unlike SQLServer, Innodb unfortunately does not yet have a buffer pool that would automatically grow or shrink depending on the load.

As marko says, 128M default is tiny, in fact this would be SQLServer minimum. Maybe 1G should be the new default.

Comment by Danilo Spinella [ 2022-02-09 ]

Thank you for the detailed explanation.

However in the default configuration, fsync outperforms (20 times faster in some cases) O_DIRECT. Would it make sense to have fsync as default until MDEV-19895 has been fixed? This way, O_DIRECT would be used by people who also increase and tweak innodb_buffer_pool_size according to their needs and wouldn't lead to any performance regression on the default configuration. I see that the auto_shrink feature will be added in 10.9, which from my understanding it won't come out any time soon.

Comment by Vladislav Vaintroub [ 2022-02-09 ]

In the default configuration, on your benchmark, which does not fit into bufferpool, fsync outperforms. This would be a more accurate statement, as you confirmed that if you benchmark fits into bufferpool, 20x faster does not happen.

There is a possibility to check if buffer pool size was not set, and then maybe then choose fsync over O_DIRECT, perhaps writing a
[Note] Using fsync for innodb_flush_method, because innodb_buffer_pool_size was not configured
marko ?

In other news, I would not like autosetting of Innodb memory to 80% of my RAM as per MDEV-19895 , going from previous default of 128MB.
I'm more for intelligent, SQLServer style buffer pool autoallocation, growing and shrinking as needed, taking into account read and write rates, amount of RAM available, and the maximum buffer pool , if configured.

Comment by Marko Mäkelä [ 2022-02-09 ]

wlad, I think that someone is always guaranteed to be surprised or disappointed, whatever we do.

danyspin97, we try to follow a policy that the default values of parameters are only changed before a server is released as generally available. Sometimes we do deviate from it when there are strong reasons (see MDEV-27734 for a recent example). I don’t think that this is a strong enough reason to do anything about innodb_flush_method. The current default setting O_DIRECT should work in properly configured systems. If the working set fits in the default-size buffer pool, or if you have configured the buffer pool size to match the available RAM and buffer pool, it works well.

If MDEV-19895 made the default innodb_buffer_pool_size 80% of the available RAM, and the default innodb_log_file_size the same as innodb_buffer_pool_size, someone could be upset that their tiny database is hogging all RAM for no reason, or their mostly read-only database is suddenly using a huge redo log file. Or they never cared about poor write performance when using the current default innodb_log_file_size=96MiB.

I do not think that there can ever be a default configuration that really works for everyone in all cases. It is hard to say what is a ‘typical’ or ‘common’ use case.

When it comes to the innodb_buffer_pool_size, there are some ideas for improving it in MDEV-24670. I do not know how much of that could be implemented on other operating systems than Linux.

Theoretically, we might also resize the log file based on workload and some limits configured by the DBA.

Comment by Danilo Spinella [ 2022-02-09 ]

I think 20x times slower in some basic read queries on the default configuration is strong enough to revert this change. Especially since the benchmark doesn't show any difference for small queries (i.e. when innodb_buffer_pool_size is big enough).

Saying that that the default configuration requires the user to configure the system it's totally counter intuitive. If I am configuring the system properly, I can set O_DIRECT manually. Indeed, the default configuration could never work for everyone in all cases, but I can't understand why fsync should not be the default since, in the default configuration, it worked for more use cases than O_DIRECT is.

Comment by Vladislav Vaintroub [ 2022-02-09 ]

Note, that while I do support fsync (if innodb_buffer_pool_size is not configured), I think you're pushing this 20x a little bit too much. First, why would anyone consider btrfs at all - that's not the default, and its read speed as shown by your benchmark, is questionable. So for a database, maybe I'd stay away from btrfs pretty much.

Second, asking about how much buffer pool size to dedicate to database, might be considered the installer's or if you want, packager's responsibility . MariaDB's Windows MSI by default asks about 12.5% of RAM, and also lets user overwrite this, in both UI and silent mode. MSI has been doing it since 10 years, so maybe zypper also can consider asking user that one question.

Comment by Otto Kekäläinen [ 2022-02-09 ]

Can't we make MariaDB/InnoDB emit a warning or suggestion to users if it detects that the below is not the case?

If the working set fits in the default-size buffer pool, or if you have configured the buffer pool size to match the available RAM and buffer pool, it works well.

We can't rely on deb/rpm/msi packaging here. Many systems run installs/upgrades in an automatic and unattended way, and there is no human to tell things to and even less to ask them to decide on something. And anyway workloads might change over time long after the installation was done. Best place to decide on good defaults is MariaDB itself and best place to tell users something is wrong or suboptimal is by emitting warnings in the error logs of MariaDB at runtime.

Comment by Danilo Spinella [ 2022-02-10 ]

First, why would anyone consider btrfs at all

The 20x difference is on xfs. My benchmark shows regression on brtfs as well. I don't think making assumptions on the filesystem used by default is good idea either.

read speed as shown by your benchmark, is questionable

Why is it questionable? Can you please explain better?

Comment by Marko Mäkelä [ 2022-02-11 ]

There exist monitoring tools that can build on existing instrumentation in the server to highlight problems. I do not think that such logic should be duplicated in the database server itself. The server can provide raw data. It cannot guess the intentions of the user.

My intuition suggests that any copy-on-write file system, such as xfs or btrfs, could be a bad match for a write-heavy InnoDB setup, or for any database that is based on B-trees and a circular write-ahead log.

Comment by Timo Jyrinki [ 2022-05-27 ]

For a future MariaDB LTS release, it could be beneficial to have fallback scenarios in case MDEV-19895 does not get implemented by then. The pool size default could be bumped to the mentioned 1GB, or reverting to fsync if innodb_buffer_pool_size not configured, maybe with an informational message suggesting to set the value. This would likely improve the out-of-the-box experience for many.

With O_DIRECT we generally see good results given the 1GB pool size (or higher). The plea for a warning message only comes from the fact that the default does not seem like an optimal compromise, and that the new behavior needs manual intervention compared to more "forgiving" fsync.

All in all, learning about these is an acceptable learning curve about MariaDB 10.6 and good documentation helps. O_DIRECT has been the correct default choice in general. Good (or at least ok) defaults would be always useful too, though.

Generated at Thu Feb 08 09:55:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.