Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.8.3, 10.9.1
-
None
-
Linux
Description
MDEV-28111 and specifically commit c4c88307091 breaks the meaning of innodb_flush_log_at_trx_commit = 2 by enabling O_DIRECT for the redo log.
O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set. Since this variable is dynamic, it has not only to be checked at server startup, but each time the variable is written to.
Attachments
Issue Links
- is caused by
-
MDEV-28111 Redo log writes are being buffered on Linux for no good reason
-
- Closed
-
- relates to
-
MDEV-30136 Map innodb_flush_method to new settable Booleans innodb_{log,data}_file_{buffering,write_through}
-
- Closed
-
-
MDEV-31642 Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF
-
- Closed
-
-
MDEV-24854 Change innodb_flush_method=O_DIRECT by default
-
- Closed
-
-
MDEV-33894 MariaDB does unexpected storage read IO for the redo log
-
- Closed
-
Activity
I think what makes this critical is the fact that it breaks existing behavior of the server that might be important or some (many?) customers.
Some may have legacy applications running (written for MyISAM) and thus enable innodb_flush_log_at_trx_commit=2. Or they do that only temporary to speed up certain operations (mass inserts, mass deletes).
When you run the server as part of a HA setup (Galera, replication) then you can run it permanently with innodb_flush_log_at_trx_commit=2
Which operating system kernel version are you using, which file system, and what type of storage? MDEV-28111 enabled O_DIRECT for the redo log writes.
On my system (currently running on the Linux 5.17.11 kernel), libaio or io_uring on ext4fs on an Intel Optane 960 PCIe NVMe drive, I see a significant difference between the innodb_flush_log_at_trx_commit values 0 and 1. The difference will be smaller for larger number of concurrent users, thanks to the MDEV-21534 group commit mechanism, but it is there.
threads | 0+ |
1+ |
1 |
---|---|---|---|
20 | 178254.86tps | 42795.70tps | 47291.61tps |
40 | 215179.33tps | 85515.37tps | 95972.42tps |
80 | 244187.58tps | 146439.29tps | 148455.56tps |
160 | 235432.55tps | 178445.29tps | 178190.07tps |
320 | 221834.57tps | 146592.81tps | 145885.36tps |
640 | 230340.60tps | 137114.63tps | 137960.68tps |
This was a 10.9 branch with MDEV-28708 and with/without MDEV-26603.
OS: Linux cheetah02 4.15.0-45-generic #48-Ubuntu
|
|
FS: /dev/md127 on /mnt/ssd-raid type xfs (rw,relatime,attr2,inode64,sunit=2048,swidth=4096,noquota)
|
|
$cat /proc/mdstat
|
md127 : active raid0 sdd1[1] sdc1[0]
|
586070016 blocks super 1.0 1024k chunks
|
|
$sudo hdparm -I /dev/sdc
|
|
ATA device, with non-removable media
|
Model Number: INTEL SSDSC2BB300H4
|
Serial Number: BTWH5054050L300AGN
|
Firmware Revision: D2010380
|
Media Serial Num:
|
Media Manufacturer:
|
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
|
As noted in MDEV-28111, I had conducted tests on 3 types of storage: SATA 3.0 HDD, SATA SSD, and PCIe NVMe. All were with ext4fs, no RAID. None of the drives support FUA. Those tests were conducted on a 5.16 kernel.
Would it be possible to try a newer kernel? Linux 4.15 was released in 2018. I see that it was used in Ubuntu 18.04 LTS, which will reach its normal end of life in April 2023, less than a year from now. According to Wikipedia, 4.15 is not a LTS release; 4.14 was one.
Linux 5.4 (LTS) was packaged in Ubuntu 20.04 LTS, which is supported until April 2025.
I found reports that Linux I/O performance has been steadily improved in the past year. I have used kernel versions 5.4, 5.10 (LTS), 5.14, 5.15, 5.16, 5.17, but I do not remember observing any significant performance difference between them, or between libaio and io_uring. That could be because there are many CPU bottlenecks in our code. The Intel Optane 960 drive in my system can achieve 2300 MB/s writes or 2600 MB/s reads, but I never have seen even half of that being achieved under any test with MariaDB Server.
I conducted some more tests, again Sysbench oltp_update_index but this time innodb_flush_log_at_trx_commit=0 on 10.10 with (merging) and without MDEV-26603, as well as with and without (reverting) MDEV-28111. Previously I mainly tested these changes with innodb_flush_log_at_trx_commit=1.
variant | 20 | 40 | 80 | 160 | 320 | 640 |
---|---|---|---|---|---|---|
+26603 +28111 | 177076.88 | 218881.81 | 248083.36 | 235197.21 | 223951.18 | 229256.12 |
+26603 -28111 | 173345.30 | 218769.01 | 245545.64 | 231961.31 | 215493.93 | 226815.86 |
-26603 +28111 | 174355.62 | 218934.21 | 247682.87 | 234316.09 | 218591.32 | 231514.93 |
-26603 -28111 | 174834.69 | 218338.09 | 244947.61 | 242759.96 | 209059.92 | 222771.75 |
We can see that even with the non-durable setting, O_DIRECT writes are faster on my hardware and on the 5.17.11 kernel. Any variant that reverts MDEV-28111 (that is, disables O_DIRECT on the redo log) is slightly slower.
There are some glitches at 160 concurrent connections. It could be that a log checkpoint occurred during that time.
Furthermore, MDEV-26603 seems to improve things slightly, or the difference fits within the margin of error. More extensive testing will be needed to say anything conclusive about it.
I think that apart from the Linux kernel version, the file system and the type of drive could matter. I have not encountered any drive that supports FUA; that could be yet another factor. Perhaps instead of trying to automatically guess whether to enable O_DIRECT on the redo log, we should introduce a new value for innodb_flush_method to control it? At the same time, we could allow that parameter to be changed without a server restart.
This graph shows the results for innodb_flush_log_at_trx_commit=1 (before: only innodb_flush_log_at_trx_commit=2). We can see that switching to O_DIRECT improves performance for innodb_flush_log_at_trx_commit=1 somewhat. Not much, but that is probably due to hardware/kernel/fs combo.
And finally the result for innodb_flush_log_at_trx_commit=0
It turns out that the negative effect exists only for innodb_flush_log_at_trx_commit=2. For all other settings of innodb_flush_log_at_trx_commit MDEV-28111 is beneficial.
I will change the issue description accordingly.
I extended the tests that I conducted yesterday on 10.10, running them for
each value of innodb_flush_trx_at_commit and all 3 storage devices that I have easy access to.
Like yesterday, I tested both the impact of disabling O_DIRECT for the redo log (reverting MDEV-28111) and enabling asynchronous redo log (applying MDEV-26603).
INTEL SSDPED1D960GAY (Intel Optane 960 NVMe PCIe, 1 TiB):
variant | 20 | 40 | 80 | 160 | 320 | 640 |
---|---|---|---|---|---|---|
-26603 -28111 0 | 174877.25 | 218600.14 | 243620.69 | 235452.35 | 221782.77 | 230218.65 |
-26603 +28111 0 | 177523.85 | 220885.63 | 250248.56 | 222407.88 | 228020.70 | 231261.60 |
-26603 -28111 1 | 40154.01 | 83822.27 | 146941.02 | 169106.35 | 159355.62 | 148137.08 |
-26603 +28111 1 | 44393.10 | 87522.12 | 154563.69 | 175664.35 | 149363.81 | 140018.75 |
-26603 -28111 2 | 116627.95 | 110871.80 | 93141.31 | 90384.32 | 95568.63 | 100567.27 |
-26603 +28111 2 | 77489.69 | 126257.11 | 170294.80 | 153718.74 | 146924.24 | 139760.85 |
+26603 -28111 0 | 175585.31 | 219198.76 | 246444.31 | 234371.68 | 223168.61 | 230352.08 |
+26603 +28111 0 | 175449.09 | 218696.74 | 245015.51 | 232257.35 | 223078.96 | 231192.25 |
+26603 -28111 1 | 38898.41 | 77419.98 | 143587.57 | 171343.71 | 160899.60 | 147169.91 |
+26603 +28111 1 | 39056.53 | 80030.78 | 152584.63 | 178147.98 | 149525.98 | 141598.37 |
+26603 -28111 2 | 116206.19 | 122350.90 | 102117.73 | 94406.60 | 106100.66 | 112945.75 |
+26603 +28111 2 | 61655.26 | 119870.79 | 170313.32 | 155449.44 | 149877.05 | 141699.37 |
Samsung SSD 850 EVO 500GB, SATA 3.1, 6.0 Gb/s (SSD):
variant | 20 | 40 | 80 | 160 | 320 | 640 |
---|---|---|---|---|---|---|
-26603 -28111 0 | 173837.73 | 196463.82 | 181172.24 | 248655.80 | 121209.19 | 185363.02 |
-26603 +28111 0 | 173456.74 | 227876.18 | 140814.39 | 247316.21 | 147288.87 | 195739.68 |
-26603 -28111 1 | 1637.31 | 3293.68 | 6268.56 | 28426.46 | 29659.73 | 37819.90 |
-26603 +28111 1 | 1651.81 | 3117.78 | 5865.57 | 11192.76 | 21408.23 | 42071.74 |
-26603 -28111 2 | 116561.43 | 111697.04 | 92891.91 | 77973.84 | 83535.76 | 102135.74 |
-26603 +28111 2 | 50282.03 | 134429.60 | 115351.79 | 120100.46 | 151187.50 | 145336.60 |
+26603 -28111 0 | 172827.09 | 195303.67 | 175215.87 | 247712.12 | 112698.70 | 181610.86 |
+26603 +28111 0 | 172311.93 | 196527.93 | 187143.44 | 249095.29 | 117095.25 | 185992.18 |
+26603 -28111 1 | 4859.92 | 2997.25 | 5666.87 | 10431.58 | 19529.83 | 37454.78 |
+26603 +28111 1 | 1581.06 | 3132.87 | 5732.18 | 10837.41 | 20893.13 | 41192.27 |
+26603 -28111 2 | 113534.45 | 120130.84 | 101364.19 | 77589.95 | 90102.00 | 113379.56 |
+26603 +28111 2 | 45810.35 | 128985.45 | 139551.91 | 103551.01 | 153063.98 | 144727.32 |
Western Digital Blue WDC WD20EZRZ-00Z5HB0, SATA 3.0, 6.0 Gb/s (2TiB HDD):
variant | 20 | 40 | 80 | 160 | 320 | 640 |
---|---|---|---|---|---|---|
-26603 -28111 0 | 151030.35 | 156442.72 | 59875.14 | 40854.54 | 32574.75 | 66260.11 |
-26603 +28111 0 | 152691.94 | 198770.81 | 45308.39 | 36743.35 | 38068.41 | 38420.25 |
-26603 -28111 1 | 442.81 | 1029.68 | 1922.92 | 3391.98 | 5221.32 | 8323.79 |
-26603 +28111 1 | 483.88 | 1027.51 | 1845.69 | 3305.48 | 5473.87 | 8721.01 |
-26603 -28111 2 | 107780.99 | 104655.24 | 86786.08 | 86942.45 | 14004.30 | 37662.52 |
-26603 +28111 2 | 38057.16 | 81649.51 | 124179.58 | 66601.85 | 4696.54 | 8492.77 |
+26603 -28111 0 | 147208.47 | 184330.29 | 48711.60 | 46878.98 | 44284.67 | 45194.95 |
+26603 +28111 0 | 153007.97 | 197770.54 | 36163.94 | 35519.65 | 42716.21 | 30158.95 |
+26603 -28111 1 | 463.67 | 935.19 | 1665.83 | 3219.84 | 5437.44 | 8793.70 |
+26603 +28111 1 | 457.01 | 935.99 | 1551.90 | 1922.09 | 3504.11 | 8506.51 |
+26603 -28111 2 | 107788.64 | 115177.97 | 91923.53 | 49215.16 | 42875.74 | 38165.49 |
+26603 +28111 2 | 33210.59 | 80173.39 | 127686.29 | 62756.50 | 4504.81 | 8561.01 |
There occasionally occur some log checkpoints (page writes) that will affect the throughput. This is particularly visible for the HDD benchmark, where the throughput drops to 0 for 20 seconds of the 30-second test at 80 concurrent connections, and never recovers from that average in the following 90 seconds (testing with 160, 320, and 640 concurrent connections). I guess that this happened because I was running the workload back-to-back, without triggering any log checkpoint or waiting for the history of committed transactions to be purged.
We can observe that on the slower storage (or on the NVMe at low numbers of concurrent connections), enabling O_DIRECT on the redo log does seem to help in this simple benchmark where all data fits in the buffer pool, and most writes are for the redo log. We can observe this even for the safe setting innodb_flush_log_at_trx_commit=1 in some cases.
However, we should also keep in mind that not using O_DIRECT will pollute the file system cache with useless data.
Because my results do not show any clear pattern, I think that we must introduce a settable Boolean configuration parameter, for enabling or disabling the file system cache for the redo log. I maintain that the caching should be disabled by default. Also mariadb-backup should benefit from enabling the file system cache when it is running.
I developed a patch that introduces a new Boolean parameter
SET GLOBAL innodb_log_file_buffering=ON; |
to explicitly the disable file system cache on the InnoDB redo log. This parameter is only available on Linux and Microsoft Windows. On other platforms, we do not implement any interface for enabling or disabling any file system cache for the redo log.
On Linux, if the physical block size cannot be determined, this parameter will be hard-wired to ON. If the log file is on persistent memory (MDEV-25090), the parameter will be hard-wired to OFF.
On Microsoft Windows, we will retain the feature that if the server is started up with innodb_flush_log_at_trx_commit=2, the file system cache on the redo log file will be disabled.
It is generally a bad idea to add any configuration parameters. For this change, there seems to be a consensus that the file system cache must not be disabled whenever innodb_flush_log_at_trx_commit=2 is set, either when the server is being started up, or while it is running. I will revise the fix accordingly.
For the sake of completeness, I tested on the Intel Optane 960 the durable setting innodb_flush_log_at_trx_commit=1 at lower numbers of concurrent connections, with the file system cache for the log enabled and disabled. Direct unbuffered writes were always faster.
log | 1 | 2 | 4 | 8 | 16 | 20 |
---|---|---|---|---|---|---|
unbuffered | 7722.51 | 14888.17 | 23673.12 | 35600.87 | 46511.59 | 51967.62 |
buffered | 7133.68 | 12955.10 | 20733.74 | 34113.07 | 45751.61 | 49746.02 |
I got some feedback that in an environment based on a Ubuntu 20.04 kernel (Linux 5.4.something), on a 3TiB NVMe drive, innodb_flush_log_at_trx_commit=1 was significantly faster when the file system cache was not enabled (the log file was opened without O_DIRECT). Based on this, it looks like we must introduce the Boolean parameter
SET GLOBAL innodb_log_file_buffering=OFF; |
SET GLOBAL innodb_log_file_buffering=ON; |
so that the file system cache can be enabled or disabled independently of innodb_flush_log_at_trx_commit or innodb_flush_method. For compatibility with old releases, we could set the default value to ON when innodb_flush_log_at_trx_commit=2 has been specified at startup.
On Linux and Microsoft Windows, the Boolean parameter innodb_log_file_buffering will indicate whether the file system cache on the redo log file is enabled. Its default value is OFF. If innodb_flush_log_at_trx_commit=2 on startup, the parameter will be set ON.
The parameter innodb_flush_method no longer determines whether file system caching is enabled on the redo log; it will only affect the buffering of InnoDB data files.
Hi axel
What is the reason O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set ?
I actually always had both of these enabled (O_DIRECT and "innodb_flush_log_at_trx_commit = 2"), for many years.
Never seen this combination causing an issue, but I'm wondering if the problems I'm having since I upgraded from 10.5 to 10.8.3 could be related to these.
Thank you!
Hello nunop,
this has nothing to do with innodb_flush_method which is the only way you could "use O_DIRECT". It is however only be used for writes to table spaces (checkpoints).
MDEV-28111 is about using O_DIRECT for the redo log. This is especially bad with innodb_flush_log_at_trx_commit=2. To put it in other words: it stops buffering log writes and makes the server behave as if innodb_flush_log_at_trx_commit=1.
That's why we introduced the variable innodb_log_file_buffering. It defaults to a setting that makes innodb_flush_log_at_trx_commit=2 working again. So with this MDEV-28766 in place everything should be as before.
Hi axel
Thank you very much for clarifying. I understand now, and makes sense!
So when I update to 10.8.4, I'll put back O_DIRECT and innodb_flush_log_at_trx_commit=2, as I had disabled these two due to problems I was having with mariabackup.
Thanks! Have a great day.
Uploaded: