[MDEV-28766] MDEV-28111 breaks innodb_flush_log_at_trx_commit=2 - Jira

Axel Schwenke created issue - 2022-06-07 13:11

Axel Schwenke made changes - 2022-06-07 13:12

Field	Original Value	New Value
Description		~~MDEV-28111~~ and specifically commit c4c88307091 breaks the meaning of

Axel Schwenke made changes - 2022-06-07 13:16

Description

~~MDEV-28111~~ and specifically commit c4c88307091 breaks the meaning of

~~MDEV-28111~~ and specifically commit c4c88307091 breaks the meaning of {{innodb_flush_log_at_trx_commit = 0|2}} by enabling O_DIRECT for the redo log.

O_DIRECT should not be used when {{innodb_flush_log_at_trx_commit = 0}} (or = 2) is set. Since this variable is dynamic, it has not only to be checked at server startup, but each time the variable is written to.

Axel Schwenke made changes - 2022-06-07 13:16

Summary

~~MDEV-28111~~ breaks innodb_flush_log_at_trx_commit != 1

~~MDEV-28111~~ breaks innodb_flush_log_at_trx_commit

Axel Schwenke made changes - 2022-06-07 13:18

Assignee

Marko Mäkelä [ marko ]

Axel Schwenke made changes - 2022-06-07 13:20

Priority

Major [ 3 ]

Critical [ 2 ]

Axel Schwenke made changes - 2022-06-07 13:26

Link

This issue relates to PERF-185 [ PERF-185 ]

Axel Schwenke made changes - 2022-06-07 13:31

Fix Version/s

10.10 [ 27530 ]

Axel Schwenke made changes - 2022-06-07 13:35

Attachment

sysbench_insert.png [ 64102 ]

Axel Schwenke made changes - 2022-06-07 13:36

Comment

[ !sysbench_insert.png|thumbnail! ]

Axel Schwenke made changes - 2022-06-07 13:36

Attachment

sysbench_insert.png [ 64102 ]

Axel Schwenke made changes - 2022-06-07 13:37

Attachment

sysbench_insert.png [ 64103 ]

Axel Schwenke added a comment - 2022-06-07 13:38

Uploaded:

Axel Schwenke added a comment - 2022-06-07 13:38 Uploaded:

Axel Schwenke added a comment - 2022-06-07 14:00

I think what makes this critical is the fact that it breaks existing behavior of the server that might be important or some (many?) customers.

Some may have legacy applications running (written for MyISAM) and thus enable innodb_flush_log_at_trx_commit=2. Or they do that only temporary to speed up certain operations (mass inserts, mass deletes).

When you run the server as part of a HA setup (Galera, replication) then you can run it permanently with innodb_flush_log_at_trx_commit=2

Axel Schwenke added a comment - 2022-06-07 14:00 I think what makes this critical is the fact that it breaks existing behavior of the server that might be important or some (many?) customers. Some may have legacy applications running (written for MyISAM) and thus enable innodb_flush_log_at_trx_commit=2 . Or they do that only temporary to speed up certain operations (mass inserts, mass deletes). When you run the server as part of a HA setup (Galera, replication) then you can run it permanently with innodb_flush_log_at_trx_commit=2

Marko Mäkelä added a comment - 2022-06-07 14:43

Which operating system kernel version are you using, which file system, and what type of storage? ~~MDEV-28111~~ enabled O_DIRECT for the redo log writes.

On my system (currently running on the Linux 5.17.11 kernel), libaio or io_uring on ext4fs on an Intel Optane 960 PCIe NVMe drive, I see a significant difference between the innodb_flush_log_at_trx_commit values 0 and 1. The difference will be smaller for larger number of concurrent users, thanks to the ~~MDEV-21534~~ group commit mechanism, but it is there.

threads	0+~~MDEV-26603~~	1+~~MDEV-26603~~	1
20	178254.86tps	42795.70tps	47291.61tps
40	215179.33tps	85515.37tps	95972.42tps
80	244187.58tps	146439.29tps	148455.56tps
160	235432.55tps	178445.29tps	178190.07tps
320	221834.57tps	146592.81tps	145885.36tps
640	230340.60tps	137114.63tps	137960.68tps

This was a 10.9 branch with ~~MDEV-28708~~ and with/without ~~MDEV-26603~~.

Marko Mäkelä added a comment - 2022-06-07 14:43 Which operating system kernel version are you using, which file system, and what type of storage? MDEV-28111 enabled O_DIRECT for the redo log writes. On my system (currently running on the Linux 5.17.11 kernel), libaio or io_uring on ext4fs on an Intel Optane 960 PCIe NVMe drive, I see a significant difference between the innodb_flush_log_at_trx_commit values 0 and 1. The difference will be smaller for larger number of concurrent users, thanks to the MDEV-21534 group commit mechanism, but it is there. threads 0+ MDEV-26603 1+ MDEV-26603 1 20 178254.86tps 42795.70tps 47291.61tps 40 215179.33tps 85515.37tps 95972.42tps 80 244187.58tps 146439.29tps 148455.56tps 160 235432.55tps 178445.29tps 178190.07tps 320 221834.57tps 146592.81tps 145885.36tps 640 230340.60tps 137114.63tps 137960.68tps This was a 10.9 branch with MDEV-28708 and with /without MDEV-26603 .

Marko Mäkelä made changes - 2022-06-07 14:43

Link

This issue is caused by ~~MDEV-28111~~ [ ~~MDEV-28111~~ ]

Marko Mäkelä made changes - 2022-06-07 14:43

Environment

Linux

Axel Schwenke added a comment - 2022-06-07 15:10

OS: Linux cheetah02 4.15.0-45-generic #48-Ubuntu

FS: /dev/md127 on /mnt/ssd-raid type xfs (rw,relatime,attr2,inode64,sunit=2048,swidth=4096,noquota)

$cat /proc/mdstat

md127 : active raid0 sdd1[1] sdc1[0]

      586070016 blocks super 1.0 1024k chunks

$sudo hdparm -I /dev/sdc

ATA device, with non-removable media

        Model Number:       INTEL SSDSC2BB300H4

        Serial Number:      BTWH5054050L300AGN

        Firmware Revision:  D2010380

        Media Serial Num:

        Media Manufacturer:

        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6

Axel Schwenke added a comment - 2022-06-07 15:10 OS: Linux cheetah02 4.15.0-45-generic #48-Ubuntu FS: /dev/md127 on /mnt/ssd-raid type xfs (rw,relatime,attr2,inode64,sunit=2048,swidth=4096,noquota) $cat /proc/mdstat md127 : active raid0 sdd1[1] sdc1[0] 586070016 blocks super 1.0 1024k chunks $sudo hdparm -I /dev/sdc ATA device, with non-removable media Model Number: INTEL SSDSC2BB300H4 Serial Number: BTWH5054050L300AGN Firmware Revision: D2010380 Media Serial Num: Media Manufacturer: Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6

Marko Mäkelä added a comment - 2022-06-08 05:31

As noted in ~~MDEV-28111~~, I had conducted tests on 3 types of storage: SATA 3.0 HDD, SATA SSD, and PCIe NVMe. All were with ext4fs, no RAID. None of the drives support FUA. Those tests were conducted on a 5.16 kernel.

Would it be possible to try a newer kernel? Linux 4.15 was released in 2018. I see that it was used in Ubuntu 18.04 LTS, which will reach its normal end of life in April 2023, less than a year from now. According to Wikipedia, 4.15 is not a LTS release; 4.14 was one.

Linux 5.4 (LTS) was packaged in Ubuntu 20.04 LTS, which is supported until April 2025.

I found reports that Linux I/O performance has been steadily improved in the past year. I have used kernel versions 5.4, 5.10 (LTS), 5.14, 5.15, 5.16, 5.17, but I do not remember observing any significant performance difference between them, or between libaio and io_uring. That could be because there are many CPU bottlenecks in our code. The Intel Optane 960 drive in my system can achieve 2300 MB/s writes or 2600 MB/s reads, but I never have seen even half of that being achieved under any test with MariaDB Server.

Marko Mäkelä added a comment - 2022-06-08 05:31 As noted in MDEV-28111 , I had conducted tests on 3 types of storage: SATA 3.0 HDD, SATA SSD, and PCIe NVMe. All were with ext4fs, no RAID. None of the drives support FUA. Those tests were conducted on a 5.16 kernel. Would it be possible to try a newer kernel? Linux 4.15 was released in 2018. I see that it was used in Ubuntu 18.04 LTS, which will reach its normal end of life in April 2023, less than a year from now. According to Wikipedia , 4.15 is not a LTS release; 4.14 was one. Linux 5.4 (LTS) was packaged in Ubuntu 20.04 LTS, which is supported until April 2025. I found reports that Linux I/O performance has been steadily improved in the past year. I have used kernel versions 5.4, 5.10 (LTS), 5.14, 5.15, 5.16, 5.17, but I do not remember observing any significant performance difference between them, or between libaio and io_uring . That could be because there are many CPU bottlenecks in our code. The Intel Optane 960 drive in my system can achieve 2300 MB/s writes or 2600 MB/s reads, but I never have seen even half of that being achieved under any test with MariaDB Server.

Marko Mäkelä added a comment - 2022-06-09 15:02

I conducted some more tests, again Sysbench oltp_update_index but this time innodb_flush_log_at_trx_commit=0 on 10.10 with (merging) and without ~~MDEV-26603~~, as well as with and without (reverting) ~~MDEV-28111~~. Previously I mainly tested these changes with innodb_flush_log_at_trx_commit=1.

variant	20	40	80	160	320	640
+26603 +28111	177076.88	218881.81	248083.36	235197.21	223951.18	229256.12
+26603 -28111	173345.30	218769.01	245545.64	231961.31	215493.93	226815.86
-26603 +28111	174355.62	218934.21	247682.87	234316.09	218591.32	231514.93
-26603 -28111	174834.69	218338.09	244947.61	242759.96	209059.92	222771.75

We can see that even with the non-durable setting, O_DIRECT writes are faster on my hardware and on the 5.17.11 kernel. Any variant that reverts ~~MDEV-28111~~ (that is, disables O_DIRECT on the redo log) is slightly slower.

There are some glitches at 160 concurrent connections. It could be that a log checkpoint occurred during that time.

Furthermore, ~~MDEV-26603~~ seems to improve things slightly, or the difference fits within the margin of error. More extensive testing will be needed to say anything conclusive about it.

I think that apart from the Linux kernel version, the file system and the type of drive could matter. I have not encountered any drive that supports FUA; that could be yet another factor. Perhaps instead of trying to automatically guess whether to enable O_DIRECT on the redo log, we should introduce a new value for innodb_flush_method to control it? At the same time, we could allow that parameter to be changed without a server restart.

Marko Mäkelä added a comment - 2022-06-09 15:02 I conducted some more tests, again Sysbench oltp_update_index but this time innodb_flush_log_at_trx_commit=0 on 10.10 with (merging) and without MDEV-26603 , as well as with and without (reverting) MDEV-28111 . Previously I mainly tested these changes with innodb_flush_log_at_trx_commit=1 . variant 20 40 80 160 320 640 +26603 +28111 177076.88 218881.81 248083.36 235197.21 223951.18 229256.12 +26603 -28111 173345.30 218769.01 245545.64 231961.31 215493.93 226815.86 -26603 +28111 174355.62 218934.21 247682.87 234316.09 218591.32 231514.93 -26603 -28111 174834.69 218338.09 244947.61 242759.96 209059.92 222771.75 We can see that even with the non-durable setting, O_DIRECT writes are faster on my hardware and on the 5.17.11 kernel. Any variant that reverts MDEV-28111 (that is, disables O_DIRECT on the redo log) is slightly slower. There are some glitches at 160 concurrent connections. It could be that a log checkpoint occurred during that time. Furthermore, MDEV-26603 seems to improve things slightly, or the difference fits within the margin of error. More extensive testing will be needed to say anything conclusive about it. I think that apart from the Linux kernel version, the file system and the type of drive could matter. I have not encountered any drive that supports FUA; that could be yet another factor. Perhaps instead of trying to automatically guess whether to enable O_DIRECT on the redo log, we should introduce a new value for innodb_flush_method to control it? At the same time, we could allow that parameter to be changed without a server restart.

Axel Schwenke made changes - 2022-06-09 15:50

Attachment

sysbench_insert2.png [ 64166 ]

Axel Schwenke added a comment - 2022-06-09 15:55

This graph shows the results for innodb_flush_log_at_trx_commit=1 (before: only innodb_flush_log_at_trx_commit=2). We can see that switching to O_DIRECT improves performance for innodb_flush_log_at_trx_commit=1 somewhat. Not much, but that is probably due to hardware/kernel/fs combo.

Axel Schwenke added a comment - 2022-06-09 15:55 This graph shows the results for innodb_flush_log_at_trx_commit=1 (before: only innodb_flush_log_at_trx_commit=2). We can see that switching to O_DIRECT improves performance for innodb_flush_log_at_trx_commit=1 somewhat. Not much, but that is probably due to hardware/kernel/fs combo.

Axel Schwenke made changes - 2022-06-09 16:43

Attachment

sysbench_insert3.png [ 64167 ]

Axel Schwenke added a comment - 2022-06-09 16:46

And finally the result for innodb_flush_log_at_trx_commit=0

It turns out that the negative effect exists only for innodb_flush_log_at_trx_commit=2. For all other settings of innodb_flush_log_at_trx_commit ~~MDEV-28111~~ is beneficial.

I will change the issue description accordingly.

Axel Schwenke added a comment - 2022-06-09 16:46 And finally the result for innodb_flush_log_at_trx_commit=0 It turns out that the negative effect exists only for innodb_flush_log_at_trx_commit=2 . For all other settings of innodb_flush_log_at_trx_commit MDEV-28111 is beneficial. I will change the issue description accordingly.

Axel Schwenke made changes - 2022-06-09 16:47

Summary

~~MDEV-28111~~ breaks innodb_flush_log_at_trx_commit

~~MDEV-28111~~ breaks innodb_flush_log_at_trx_commit=2

Axel Schwenke made changes - 2022-06-09 16:47

Description

~~MDEV-28111~~ and specifically commit c4c88307091 breaks the meaning of {{innodb_flush_log_at_trx_commit = 0|2}} by enabling O_DIRECT for the redo log.

O_DIRECT should not be used when {{innodb_flush_log_at_trx_commit = 0}} (or = 2) is set. Since this variable is dynamic, it has not only to be checked at server startup, but each time the variable is written to.

~~MDEV-28111~~ and specifically commit c4c88307091 breaks the meaning of {{innodb_flush_log_at_trx_commit = 2}} by enabling O_DIRECT for the redo log.

O_DIRECT should not be used when {{innodb_flush_log_at_trx_commit = 2}} is set. Since this variable is dynamic, it has not only to be checked at server startup, but each time the variable is written to.

Axel Schwenke made changes - 2022-06-09 16:47

Description

~~MDEV-28111~~ and specifically commit c4c88307091 breaks the meaning of {{innodb_flush_log_at_trx_commit = 2}} by enabling O_DIRECT for the redo log.

O_DIRECT should not be used when {{innodb_flush_log_at_trx_commit = 2}} is set. Since this variable is dynamic, it has not only to be checked at server startup, but each time the variable is written to.

Axel Schwenke made changes - 2022-06-09 16:54

Attachment

sysbench_insert3.png [ 64167 ]

Axel Schwenke made changes - 2022-06-09 16:54

Attachment

sysbench_insert3.png [ 64169 ]

Marko Mäkelä added a comment - 2022-06-10 08:53 - edited

I extended the tests that I conducted yesterday on 10.10, running them for
each value of innodb_flush_trx_at_commit and all 3 storage devices that I have easy access to.

Like yesterday, I tested both the impact of disabling O_DIRECT for the redo log (reverting ~~MDEV-28111~~) and enabling asynchronous redo log (applying ~~MDEV-26603~~).

INTEL SSDPED1D960GAY (Intel Optane 960 NVMe PCIe, 1 TiB):

variant	20	40	80	160	320	640
-26603 -28111 0	174877.25	218600.14	243620.69	235452.35	221782.77	230218.65
-26603 +28111 0	177523.85	220885.63	250248.56	222407.88	228020.70	231261.60
-26603 -28111 1	40154.01	83822.27	146941.02	169106.35	159355.62	148137.08
-26603 +28111 1	44393.10	87522.12	154563.69	175664.35	149363.81	140018.75
-26603 -28111 2	116627.95	110871.80	93141.31	90384.32	95568.63	100567.27
-26603 +28111 2	77489.69	126257.11	170294.80	153718.74	146924.24	139760.85
+26603 -28111 0	175585.31	219198.76	246444.31	234371.68	223168.61	230352.08
+26603 +28111 0	175449.09	218696.74	245015.51	232257.35	223078.96	231192.25
+26603 -28111 1	38898.41	77419.98	143587.57	171343.71	160899.60	147169.91
+26603 +28111 1	39056.53	80030.78	152584.63	178147.98	149525.98	141598.37
+26603 -28111 2	116206.19	122350.90	102117.73	94406.60	106100.66	112945.75
+26603 +28111 2	61655.26	119870.79	170313.32	155449.44	149877.05	141699.37

Samsung SSD 850 EVO 500GB, SATA 3.1, 6.0 Gb/s (SSD):

variant	20	40	80	160	320	640
-26603 -28111 0	173837.73	196463.82	181172.24	248655.80	121209.19	185363.02
-26603 +28111 0	173456.74	227876.18	140814.39	247316.21	147288.87	195739.68
-26603 -28111 1	1637.31	3293.68	6268.56	28426.46	29659.73	37819.90
-26603 +28111 1	1651.81	3117.78	5865.57	11192.76	21408.23	42071.74
-26603 -28111 2	116561.43	111697.04	92891.91	77973.84	83535.76	102135.74
-26603 +28111 2	50282.03	134429.60	115351.79	120100.46	151187.50	145336.60
+26603 -28111 0	172827.09	195303.67	175215.87	247712.12	112698.70	181610.86
+26603 +28111 0	172311.93	196527.93	187143.44	249095.29	117095.25	185992.18
+26603 -28111 1	4859.92	2997.25	5666.87	10431.58	19529.83	37454.78
+26603 +28111 1	1581.06	3132.87	5732.18	10837.41	20893.13	41192.27
+26603 -28111 2	113534.45	120130.84	101364.19	77589.95	90102.00	113379.56
+26603 +28111 2	45810.35	128985.45	139551.91	103551.01	153063.98	144727.32

Western Digital Blue WDC WD20EZRZ-00Z5HB0, SATA 3.0, 6.0 Gb/s (2TiB HDD):

variant	20	40	80	160	320	640
-26603 -28111 0	151030.35	156442.72	59875.14	40854.54	32574.75	66260.11
-26603 +28111 0	152691.94	198770.81	45308.39	36743.35	38068.41	38420.25
-26603 -28111 1	442.81	1029.68	1922.92	3391.98	5221.32	8323.79
-26603 +28111 1	483.88	1027.51	1845.69	3305.48	5473.87	8721.01
-26603 -28111 2	107780.99	104655.24	86786.08	86942.45	14004.30	37662.52
-26603 +28111 2	38057.16	81649.51	124179.58	66601.85	4696.54	8492.77
+26603 -28111 0	147208.47	184330.29	48711.60	46878.98	44284.67	45194.95
+26603 +28111 0	153007.97	197770.54	36163.94	35519.65	42716.21	30158.95
+26603 -28111 1	463.67	935.19	1665.83	3219.84	5437.44	8793.70
+26603 +28111 1	457.01	935.99	1551.90	1922.09	3504.11	8506.51
+26603 -28111 2	107788.64	115177.97	91923.53	49215.16	42875.74	38165.49
+26603 +28111 2	33210.59	80173.39	127686.29	62756.50	4504.81	8561.01

There occasionally occur some log checkpoints (page writes) that will affect the throughput. This is particularly visible for the HDD benchmark, where the throughput drops to 0 for 20 seconds of the 30-second test at 80 concurrent connections, and never recovers from that average in the following 90 seconds (testing with 160, 320, and 640 concurrent connections). I guess that this happened because I was running the workload back-to-back, without triggering any log checkpoint or waiting for the history of committed transactions to be purged.

We can observe that on the slower storage (or on the NVMe at low numbers of concurrent connections), enabling O_DIRECT on the redo log does seem to help in this simple benchmark where all data fits in the buffer pool, and most writes are for the redo log. We can observe this even for the safe setting innodb_flush_log_at_trx_commit=1 in some cases.

However, we should also keep in mind that not using O_DIRECT will pollute the file system cache with useless data.

Because my results do not show any clear pattern, I think that we must introduce a settable Boolean configuration parameter, for enabling or disabling the file system cache for the redo log. I maintain that the caching should be disabled by default. Also mariadb-backup should benefit from enabling the file system cache when it is running.

Marko Mäkelä added a comment - 2022-06-10 08:53 - edited I extended the tests that I conducted yesterday on 10.10 , running them for each value of innodb_flush_trx_at_commit and all 3 storage devices that I have easy access to. Like yesterday, I tested both the impact of disabling O_DIRECT for the redo log (reverting MDEV-28111 ) and enabling asynchronous redo log ( applying MDEV-26603 ). INTEL SSDPED1D960GAY (Intel Optane 960 NVMe PCIe, 1 TiB): variant 20 40 80 160 320 640 -26603 -28111 0 174877.25 218600.14 243620.69 235452.35 221782.77 230218.65 -26603 +28111 0 177523.85 220885.63 250248.56 222407.88 228020.70 231261.60 -26603 -28111 1 40154.01 83822.27 146941.02 169106.35 159355.62 148137.08 -26603 +28111 1 44393.10 87522.12 154563.69 175664.35 149363.81 140018.75 -26603 -28111 2 116627.95 110871.80 93141.31 90384.32 95568.63 100567.27 -26603 +28111 2 77489.69 126257.11 170294.80 153718.74 146924.24 139760.85 +26603 -28111 0 175585.31 219198.76 246444.31 234371.68 223168.61 230352.08 +26603 +28111 0 175449.09 218696.74 245015.51 232257.35 223078.96 231192.25 +26603 -28111 1 38898.41 77419.98 143587.57 171343.71 160899.60 147169.91 +26603 +28111 1 39056.53 80030.78 152584.63 178147.98 149525.98 141598.37 +26603 -28111 2 116206.19 122350.90 102117.73 94406.60 106100.66 112945.75 +26603 +28111 2 61655.26 119870.79 170313.32 155449.44 149877.05 141699.37 Samsung SSD 850 EVO 500GB, SATA 3.1, 6.0 Gb/s (SSD): variant 20 40 80 160 320 640 -26603 -28111 0 173837.73 196463.82 181172.24 248655.80 121209.19 185363.02 -26603 +28111 0 173456.74 227876.18 140814.39 247316.21 147288.87 195739.68 -26603 -28111 1 1637.31 3293.68 6268.56 28426.46 29659.73 37819.90 -26603 +28111 1 1651.81 3117.78 5865.57 11192.76 21408.23 42071.74 -26603 -28111 2 116561.43 111697.04 92891.91 77973.84 83535.76 102135.74 -26603 +28111 2 50282.03 134429.60 115351.79 120100.46 151187.50 145336.60 +26603 -28111 0 172827.09 195303.67 175215.87 247712.12 112698.70 181610.86 +26603 +28111 0 172311.93 196527.93 187143.44 249095.29 117095.25 185992.18 +26603 -28111 1 4859.92 2997.25 5666.87 10431.58 19529.83 37454.78 +26603 +28111 1 1581.06 3132.87 5732.18 10837.41 20893.13 41192.27 +26603 -28111 2 113534.45 120130.84 101364.19 77589.95 90102.00 113379.56 +26603 +28111 2 45810.35 128985.45 139551.91 103551.01 153063.98 144727.32 Western Digital Blue WDC WD20EZRZ-00Z5HB0, SATA 3.0, 6.0 Gb/s (2TiB HDD): variant 20 40 80 160 320 640 -26603 -28111 0 151030.35 156442.72 59875.14 40854.54 32574.75 66260.11 -26603 +28111 0 152691.94 198770.81 45308.39 36743.35 38068.41 38420.25 -26603 -28111 1 442.81 1029.68 1922.92 3391.98 5221.32 8323.79 -26603 +28111 1 483.88 1027.51 1845.69 3305.48 5473.87 8721.01 -26603 -28111 2 107780.99 104655.24 86786.08 86942.45 14004.30 37662.52 -26603 +28111 2 38057.16 81649.51 124179.58 66601.85 4696.54 8492.77 +26603 -28111 0 147208.47 184330.29 48711.60 46878.98 44284.67 45194.95 +26603 +28111 0 153007.97 197770.54 36163.94 35519.65 42716.21 30158.95 +26603 -28111 1 463.67 935.19 1665.83 3219.84 5437.44 8793.70 +26603 +28111 1 457.01 935.99 1551.90 1922.09 3504.11 8506.51 +26603 -28111 2 107788.64 115177.97 91923.53 49215.16 42875.74 38165.49 +26603 +28111 2 33210.59 80173.39 127686.29 62756.50 4504.81 8561.01 There occasionally occur some log checkpoints (page writes) that will affect the throughput. This is particularly visible for the HDD benchmark, where the throughput drops to 0 for 20 seconds of the 30-second test at 80 concurrent connections, and never recovers from that average in the following 90 seconds (testing with 160, 320, and 640 concurrent connections). I guess that this happened because I was running the workload back-to-back, without triggering any log checkpoint or waiting for the history of committed transactions to be purged. We can observe that on the slower storage (or on the NVMe at low numbers of concurrent connections), enabling O_DIRECT on the redo log does seem to help in this simple benchmark where all data fits in the buffer pool, and most writes are for the redo log. We can observe this even for the safe setting innodb_flush_log_at_trx_commit=1 in some cases. However, we should also keep in mind that not using O_DIRECT will pollute the file system cache with useless data. Because my results do not show any clear pattern, I think that we must introduce a settable Boolean configuration parameter, for enabling or disabling the file system cache for the redo log. I maintain that the caching should be disabled by default. Also mariadb-backup should benefit from enabling the file system cache when it is running.

Marko Mäkelä made changes - 2022-06-10 15:08

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2022-06-10 15:15

I developed a patch that introduces a new Boolean parameter

SET GLOBAL innodb_log_file_buffering=ON;

to explicitly the disable file system cache on the InnoDB redo log. This parameter is only available on Linux and Microsoft Windows. On other platforms, we do not implement any interface for enabling or disabling any file system cache for the redo log.

On Linux, if the physical block size cannot be determined, this parameter will be hard-wired to ON. If the log file is on persistent memory (~~MDEV-25090~~), the parameter will be hard-wired to OFF.

On Microsoft Windows, we will retain the feature that if the server is started up with innodb_flush_log_at_trx_commit=2, the file system cache on the redo log file will be disabled.

Marko Mäkelä added a comment - 2022-06-10 15:15 I developed a patch that introduces a new Boolean parameter SET GLOBAL innodb_log_file_buffering= ON ; to explicitly the disable file system cache on the InnoDB redo log. This parameter is only available on Linux and Microsoft Windows. On other platforms, we do not implement any interface for enabling or disabling any file system cache for the redo log. On Linux, if the physical block size cannot be determined, this parameter will be hard-wired to ON . If the log file is on persistent memory ( MDEV-25090 ), the parameter will be hard-wired to OFF . On Microsoft Windows, we will retain the feature that if the server is started up with innodb_flush_log_at_trx_commit=2 , the file system cache on the redo log file will be disabled.

Marko Mäkelä added a comment - 2022-06-13 14:32

It is generally a bad idea to add any configuration parameters. For this change, there seems to be a consensus that the file system cache must not be disabled whenever innodb_flush_log_at_trx_commit=2 is set, either when the server is being started up, or while it is running. I will revise the fix accordingly.

Marko Mäkelä added a comment - 2022-06-13 14:32 It is generally a bad idea to add any configuration parameters. For this change, there seems to be a consensus that the file system cache must not be disabled whenever innodb_flush_log_at_trx_commit=2 is set, either when the server is being started up, or while it is running. I will revise the fix accordingly.

Marko Mäkelä added a comment - 2022-06-14 07:54

For the sake of completeness, I tested on the Intel Optane 960 the durable setting innodb_flush_log_at_trx_commit=1 at lower numbers of concurrent connections, with the file system cache for the log enabled and disabled. Direct unbuffered writes were always faster.

log	1	2	4	8	16	20
unbuffered	7722.51	14888.17	23673.12	35600.87	46511.59	51967.62
buffered	7133.68	12955.10	20733.74	34113.07	45751.61	49746.02

Marko Mäkelä added a comment - 2022-06-14 07:54 For the sake of completeness, I tested on the Intel Optane 960 the durable setting innodb_flush_log_at_trx_commit=1 at lower numbers of concurrent connections, with the file system cache for the log enabled and disabled. Direct unbuffered writes were always faster. log 1 2 4 8 16 20 unbuffered 7722.51 14888.17 23673.12 35600.87 46511.59 51967.62 buffered 7133.68 12955.10 20733.74 34113.07 45751.61 49746.02

Marko Mäkelä added a comment - 2022-06-14 11:46

I got some feedback that in an environment based on a Ubuntu 20.04 kernel (Linux 5.4.something), on a 3TiB NVMe drive, innodb_flush_log_at_trx_commit=1 was significantly faster when the file system cache was not enabled (the log file was opened without O_DIRECT). Based on this, it looks like we must introduce the Boolean parameter

SET GLOBAL innodb_log_file_buffering=OFF;

SET GLOBAL innodb_log_file_buffering=ON;

so that the file system cache can be enabled or disabled independently of innodb_flush_log_at_trx_commit or innodb_flush_method. For compatibility with old releases, we could set the default value to ON when innodb_flush_log_at_trx_commit=2 has been specified at startup.

Marko Mäkelä added a comment - 2022-06-14 11:46 I got some feedback that in an environment based on a Ubuntu 20.04 kernel (Linux 5.4.something), on a 3TiB NVMe drive, innodb_flush_log_at_trx_commit=1 was significantly faster when the file system cache was not enabled (the log file was opened without O_DIRECT ). Based on this, it looks like we must introduce the Boolean parameter SET GLOBAL innodb_log_file_buffering= OFF ; SET GLOBAL innodb_log_file_buffering= ON ; so that the file system cache can be enabled or disabled independently of innodb_flush_log_at_trx_commit or innodb_flush_method . For compatibility with old releases, we could set the default value to ON when innodb_flush_log_at_trx_commit=2 has been specified at startup.

Marko Mäkelä added a comment - 2022-06-14 16:07

On Linux and Microsoft Windows, the Boolean parameter innodb_log_file_buffering will indicate whether the file system cache on the redo log file is enabled. Its default value is OFF. If innodb_flush_log_at_trx_commit=2 on startup, the parameter will be set ON.

The parameter innodb_flush_method no longer determines whether file system caching is enabled on the redo log; it will only affect the buffering of InnoDB data files.

Marko Mäkelä added a comment - 2022-06-14 16:07 On Linux and Microsoft Windows, the Boolean parameter innodb_log_file_buffering will indicate whether the file system cache on the redo log file is enabled. Its default value is OFF . If innodb_flush_log_at_trx_commit=2 on startup, the parameter will be set ON . The parameter innodb_flush_method no longer determines whether file system caching is enabled on the redo log; it will only affect the buffering of InnoDB data files.

Marko Mäkelä made changes - 2022-06-14 16:07

issue.field.resolutiondate

2022-06-14 16:07:05.0

2022-06-14 16:07:05.271

Marko Mäkelä made changes - 2022-06-14 16:07

Fix Version/s		10.8.4 [ 27503 ]
Fix Version/s		10.9.2 [ 27115 ]
Fix Version/s		10.10.0 [ 27912 ]
Fix Version/s	10.8 [ 26121 ]
Fix Version/s	10.9 [ 26905 ]
Fix Version/s	10.10 [ 27530 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Nuno added a comment - 2022-06-28 16:45

Hi axel

What is the reason O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set ?

I actually always had both of these enabled (O_DIRECT and "innodb_flush_log_at_trx_commit = 2"), for many years.

Never seen this combination causing an issue, but I'm wondering if the problems I'm having since I upgraded from 10.5 to 10.8.3 could be related to these.

Thank you!

Nuno added a comment - 2022-06-28 16:45 Hi axel What is the reason O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set ? I actually always had both of these enabled (O_DIRECT and "innodb_flush_log_at_trx_commit = 2"), for many years. Never seen this combination causing an issue, but I'm wondering if the problems I'm having since I upgraded from 10.5 to 10.8.3 could be related to these. Thank you!

Axel Schwenke added a comment - 2022-07-12 15:52

Hello nunop,

this has nothing to do with innodb_flush_method which is the only way you could "use O_DIRECT". It is however only be used for writes to table spaces (checkpoints).

~~MDEV-28111~~ is about using O_DIRECT for the redo log. This is especially bad with innodb_flush_log_at_trx_commit=2. To put it in other words: it stops buffering log writes and makes the server behave as if innodb_flush_log_at_trx_commit=1.

That's why we introduced the variable innodb_log_file_buffering. It defaults to a setting that makes innodb_flush_log_at_trx_commit=2 working again. So with this ~~MDEV-28766~~ in place everything should be as before.

Axel Schwenke added a comment - 2022-07-12 15:52 Hello nunop , this has nothing to do with innodb_flush_method which is the only way you could "use O_DIRECT". It is however only be used for writes to table spaces (checkpoints). MDEV-28111 is about using O_DIRECT for the redo log. This is especially bad with innodb_flush_log_at_trx_commit=2 . To put it in other words: it stops buffering log writes and makes the server behave as if innodb_flush_log_at_trx_commit=1 . That's why we introduced the variable innodb_log_file_buffering . It defaults to a setting that makes innodb_flush_log_at_trx_commit=2 working again. So with this MDEV-28766 in place everything should be as before.

Nuno added a comment - 2022-07-12 18:59

Hi axel

Thank you very much for clarifying. I understand now, and makes sense!

So when I update to 10.8.4, I'll put back O_DIRECT and innodb_flush_log_at_trx_commit=2, as I had disabled these two due to problems I was having with mariabackup.

Thanks! Have a great day.

Nuno added a comment - 2022-07-12 18:59 Hi axel Thank you very much for clarifying. I understand now, and makes sense! So when I update to 10.8.4, I'll put back O_DIRECT and innodb_flush_log_at_trx_commit=2 , as I had disabled these two due to problems I was having with mariabackup. Thanks! Have a great day.

Marko Mäkelä made changes - 2022-09-01 05:38

Link

This issue relates to ~~MDEV-24854~~ [ ~~MDEV-24854~~ ]

Marko Mäkelä made changes - 2022-11-30 14:14

Link

This issue relates to ~~MDEV-30136~~ [ ~~MDEV-30136~~ ]

Marko Mäkelä made changes - 2023-07-07 08:03

Link

This issue relates to ~~MDEV-31642~~ [ ~~MDEV-31642~~ ]

Marko Mäkelä made changes - 2024-04-12 10:21

Link

This issue relates to ~~MDEV-33894~~ [ ~~MDEV-33894~~ ]

MariaDB Server

MDEV-28111 breaks innodb_flush_log_at_trx_commit=2

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration