Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28766

MDEV-28111 breaks innodb_flush_log_at_trx_commit=2

Details

    Description

      MDEV-28111 and specifically commit c4c88307091 breaks the meaning of innodb_flush_log_at_trx_commit = 2 by enabling O_DIRECT for the redo log.

      O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set. Since this variable is dynamic, it has not only to be checked at server startup, but each time the variable is written to.

      Attachments

        Issue Links

          Activity

            axel Axel Schwenke added a comment -

            Uploaded:

            axel Axel Schwenke added a comment - Uploaded:
            axel Axel Schwenke added a comment -

            I think what makes this critical is the fact that it breaks existing behavior of the server that might be important or some (many?) customers.

            Some may have legacy applications running (written for MyISAM) and thus enable innodb_flush_log_at_trx_commit=2. Or they do that only temporary to speed up certain operations (mass inserts, mass deletes).

            When you run the server as part of a HA setup (Galera, replication) then you can run it permanently with innodb_flush_log_at_trx_commit=2

            axel Axel Schwenke added a comment - I think what makes this critical is the fact that it breaks existing behavior of the server that might be important or some (many?) customers. Some may have legacy applications running (written for MyISAM) and thus enable innodb_flush_log_at_trx_commit=2 . Or they do that only temporary to speed up certain operations (mass inserts, mass deletes). When you run the server as part of a HA setup (Galera, replication) then you can run it permanently with innodb_flush_log_at_trx_commit=2

            Which operating system kernel version are you using, which file system, and what type of storage? MDEV-28111 enabled O_DIRECT for the redo log writes.

            On my system (currently running on the Linux 5.17.11 kernel), libaio or io_uring on ext4fs on an Intel Optane 960 PCIe NVMe drive, I see a significant difference between the innodb_flush_log_at_trx_commit values 0 and 1. The difference will be smaller for larger number of concurrent users, thanks to the MDEV-21534 group commit mechanism, but it is there.

            threads 0+MDEV-26603 1+MDEV-26603 1
            20 178254.86tps 42795.70tps 47291.61tps
            40 215179.33tps 85515.37tps 95972.42tps
            80 244187.58tps 146439.29tps 148455.56tps
            160 235432.55tps 178445.29tps 178190.07tps
            320 221834.57tps 146592.81tps 145885.36tps
            640 230340.60tps 137114.63tps 137960.68tps

            This was a 10.9 branch with MDEV-28708 and with/without MDEV-26603.

            marko Marko Mäkelä added a comment - Which operating system kernel version are you using, which file system, and what type of storage? MDEV-28111 enabled O_DIRECT for the redo log writes. On my system (currently running on the Linux 5.17.11 kernel), libaio or io_uring on ext4fs on an Intel Optane 960 PCIe NVMe drive, I see a significant difference between the innodb_flush_log_at_trx_commit values 0 and 1. The difference will be smaller for larger number of concurrent users, thanks to the MDEV-21534 group commit mechanism, but it is there. threads 0+ MDEV-26603 1+ MDEV-26603 1 20 178254.86tps 42795.70tps 47291.61tps 40 215179.33tps 85515.37tps 95972.42tps 80 244187.58tps 146439.29tps 148455.56tps 160 235432.55tps 178445.29tps 178190.07tps 320 221834.57tps 146592.81tps 145885.36tps 640 230340.60tps 137114.63tps 137960.68tps This was a 10.9 branch with MDEV-28708 and with /without MDEV-26603 .
            axel Axel Schwenke added a comment -

            OS: Linux cheetah02 4.15.0-45-generic #48-Ubuntu
             
            FS: /dev/md127 on /mnt/ssd-raid type xfs (rw,relatime,attr2,inode64,sunit=2048,swidth=4096,noquota)
             
            $cat /proc/mdstat 
            md127 : active raid0 sdd1[1] sdc1[0]
                  586070016 blocks super 1.0 1024k chunks
             
            $sudo hdparm -I /dev/sdc
             
            ATA device, with non-removable media
                    Model Number:       INTEL SSDSC2BB300H4
                    Serial Number:      BTWH5054050L300AGN
                    Firmware Revision:  D2010380
                    Media Serial Num:
                    Media Manufacturer:
                    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
            

            axel Axel Schwenke added a comment - OS: Linux cheetah02 4.15.0-45-generic #48-Ubuntu   FS: /dev/md127 on /mnt/ssd-raid type xfs (rw,relatime,attr2,inode64,sunit=2048,swidth=4096,noquota)   $cat /proc/mdstat md127 : active raid0 sdd1[1] sdc1[0] 586070016 blocks super 1.0 1024k chunks   $sudo hdparm -I /dev/sdc   ATA device, with non-removable media Model Number: INTEL SSDSC2BB300H4 Serial Number: BTWH5054050L300AGN Firmware Revision: D2010380 Media Serial Num: Media Manufacturer: Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6

            As noted in MDEV-28111, I had conducted tests on 3 types of storage: SATA 3.0 HDD, SATA SSD, and PCIe NVMe. All were with ext4fs, no RAID. None of the drives support FUA. Those tests were conducted on a 5.16 kernel.

            Would it be possible to try a newer kernel? Linux 4.15 was released in 2018. I see that it was used in Ubuntu 18.04 LTS, which will reach its normal end of life in April 2023, less than a year from now. According to Wikipedia, 4.15 is not a LTS release; 4.14 was one.

            Linux 5.4 (LTS) was packaged in Ubuntu 20.04 LTS, which is supported until April 2025.

            I found reports that Linux I/O performance has been steadily improved in the past year. I have used kernel versions 5.4, 5.10 (LTS), 5.14, 5.15, 5.16, 5.17, but I do not remember observing any significant performance difference between them, or between libaio and io_uring. That could be because there are many CPU bottlenecks in our code. The Intel Optane 960 drive in my system can achieve 2300 MB/s writes or 2600 MB/s reads, but I never have seen even half of that being achieved under any test with MariaDB Server.

            marko Marko Mäkelä added a comment - As noted in MDEV-28111 , I had conducted tests on 3 types of storage: SATA 3.0 HDD, SATA SSD, and PCIe NVMe. All were with ext4fs, no RAID. None of the drives support FUA. Those tests were conducted on a 5.16 kernel. Would it be possible to try a newer kernel? Linux 4.15 was released in 2018. I see that it was used in Ubuntu 18.04 LTS, which will reach its normal end of life in April 2023, less than a year from now. According to Wikipedia , 4.15 is not a LTS release; 4.14 was one. Linux 5.4 (LTS) was packaged in Ubuntu 20.04 LTS, which is supported until April 2025. I found reports that Linux I/O performance has been steadily improved in the past year. I have used kernel versions 5.4, 5.10 (LTS), 5.14, 5.15, 5.16, 5.17, but I do not remember observing any significant performance difference between them, or between libaio and io_uring . That could be because there are many CPU bottlenecks in our code. The Intel Optane 960 drive in my system can achieve 2300 MB/s writes or 2600 MB/s reads, but I never have seen even half of that being achieved under any test with MariaDB Server.

            I conducted some more tests, again Sysbench oltp_update_index but this time innodb_flush_log_at_trx_commit=0 on 10.10 with (merging) and without MDEV-26603, as well as with and without (reverting) MDEV-28111. Previously I mainly tested these changes with innodb_flush_log_at_trx_commit=1.

            variant 20 40 80 160 320 640
            +26603 +28111 177076.88 218881.81 248083.36 235197.21 223951.18 229256.12
            +26603 -28111 173345.30 218769.01 245545.64 231961.31 215493.93 226815.86
            -26603 +28111 174355.62 218934.21 247682.87 234316.09 218591.32 231514.93
            -26603 -28111 174834.69 218338.09 244947.61 242759.96 209059.92 222771.75

            We can see that even with the non-durable setting, O_DIRECT writes are faster on my hardware and on the 5.17.11 kernel. Any variant that reverts MDEV-28111 (that is, disables O_DIRECT on the redo log) is slightly slower.

            There are some glitches at 160 concurrent connections. It could be that a log checkpoint occurred during that time.

            Furthermore, MDEV-26603 seems to improve things slightly, or the difference fits within the margin of error. More extensive testing will be needed to say anything conclusive about it.

            I think that apart from the Linux kernel version, the file system and the type of drive could matter. I have not encountered any drive that supports FUA; that could be yet another factor. Perhaps instead of trying to automatically guess whether to enable O_DIRECT on the redo log, we should introduce a new value for innodb_flush_method to control it? At the same time, we could allow that parameter to be changed without a server restart.

            marko Marko Mäkelä added a comment - I conducted some more tests, again Sysbench oltp_update_index but this time innodb_flush_log_at_trx_commit=0 on 10.10 with (merging) and without MDEV-26603 , as well as with and without (reverting) MDEV-28111 . Previously I mainly tested these changes with innodb_flush_log_at_trx_commit=1 . variant 20 40 80 160 320 640 +26603 +28111 177076.88 218881.81 248083.36 235197.21 223951.18 229256.12 +26603 -28111 173345.30 218769.01 245545.64 231961.31 215493.93 226815.86 -26603 +28111 174355.62 218934.21 247682.87 234316.09 218591.32 231514.93 -26603 -28111 174834.69 218338.09 244947.61 242759.96 209059.92 222771.75 We can see that even with the non-durable setting, O_DIRECT writes are faster on my hardware and on the 5.17.11 kernel. Any variant that reverts MDEV-28111 (that is, disables O_DIRECT on the redo log) is slightly slower. There are some glitches at 160 concurrent connections. It could be that a log checkpoint occurred during that time. Furthermore, MDEV-26603 seems to improve things slightly, or the difference fits within the margin of error. More extensive testing will be needed to say anything conclusive about it. I think that apart from the Linux kernel version, the file system and the type of drive could matter. I have not encountered any drive that supports FUA; that could be yet another factor. Perhaps instead of trying to automatically guess whether to enable O_DIRECT on the redo log, we should introduce a new value for innodb_flush_method to control it? At the same time, we could allow that parameter to be changed without a server restart.
            axel Axel Schwenke added a comment -

            This graph shows the results for innodb_flush_log_at_trx_commit=1 (before: only innodb_flush_log_at_trx_commit=2). We can see that switching to O_DIRECT improves performance for innodb_flush_log_at_trx_commit=1 somewhat. Not much, but that is probably due to hardware/kernel/fs combo.

            axel Axel Schwenke added a comment - This graph shows the results for innodb_flush_log_at_trx_commit=1 (before: only innodb_flush_log_at_trx_commit=2). We can see that switching to O_DIRECT improves performance for innodb_flush_log_at_trx_commit=1 somewhat. Not much, but that is probably due to hardware/kernel/fs combo.
            axel Axel Schwenke added a comment -

            And finally the result for innodb_flush_log_at_trx_commit=0

            It turns out that the negative effect exists only for innodb_flush_log_at_trx_commit=2. For all other settings of innodb_flush_log_at_trx_commit MDEV-28111 is beneficial.

            I will change the issue description accordingly.

            axel Axel Schwenke added a comment - And finally the result for innodb_flush_log_at_trx_commit=0 It turns out that the negative effect exists only for innodb_flush_log_at_trx_commit=2 . For all other settings of innodb_flush_log_at_trx_commit MDEV-28111 is beneficial. I will change the issue description accordingly.
            marko Marko Mäkelä added a comment - - edited

            I extended the tests that I conducted yesterday on 10.10, running them for
            each value of innodb_flush_trx_at_commit and all 3 storage devices that I have easy access to.

            Like yesterday, I tested both the impact of disabling O_DIRECT for the redo log (reverting MDEV-28111) and enabling asynchronous redo log (applying MDEV-26603).

            INTEL SSDPED1D960GAY (Intel Optane 960 NVMe PCIe, 1 TiB):

            variant 20 40 80 160 320 640
            -26603 -28111 0 174877.25 218600.14 243620.69 235452.35 221782.77 230218.65
            -26603 +28111 0 177523.85 220885.63 250248.56 222407.88 228020.70 231261.60
            -26603 -28111 1 40154.01 83822.27 146941.02 169106.35 159355.62 148137.08
            -26603 +28111 1 44393.10 87522.12 154563.69 175664.35 149363.81 140018.75
            -26603 -28111 2 116627.95 110871.80 93141.31 90384.32 95568.63 100567.27
            -26603 +28111 2 77489.69 126257.11 170294.80 153718.74 146924.24 139760.85
            +26603 -28111 0 175585.31 219198.76 246444.31 234371.68 223168.61 230352.08
            +26603 +28111 0 175449.09 218696.74 245015.51 232257.35 223078.96 231192.25
            +26603 -28111 1 38898.41 77419.98 143587.57 171343.71 160899.60 147169.91
            +26603 +28111 1 39056.53 80030.78 152584.63 178147.98 149525.98 141598.37
            +26603 -28111 2 116206.19 122350.90 102117.73 94406.60 106100.66 112945.75
            +26603 +28111 2 61655.26 119870.79 170313.32 155449.44 149877.05 141699.37

            Samsung SSD 850 EVO 500GB, SATA 3.1, 6.0 Gb/s (SSD):

            variant 20 40 80 160 320 640
            -26603 -28111 0 173837.73 196463.82 181172.24 248655.80 121209.19 185363.02
            -26603 +28111 0 173456.74 227876.18 140814.39 247316.21 147288.87 195739.68
            -26603 -28111 1 1637.31 3293.68 6268.56 28426.46 29659.73 37819.90
            -26603 +28111 1 1651.81 3117.78 5865.57 11192.76 21408.23 42071.74
            -26603 -28111 2 116561.43 111697.04 92891.91 77973.84 83535.76 102135.74
            -26603 +28111 2 50282.03 134429.60 115351.79 120100.46 151187.50 145336.60
            +26603 -28111 0 172827.09 195303.67 175215.87 247712.12 112698.70 181610.86
            +26603 +28111 0 172311.93 196527.93 187143.44 249095.29 117095.25 185992.18
            +26603 -28111 1 4859.92 2997.25 5666.87 10431.58 19529.83 37454.78
            +26603 +28111 1 1581.06 3132.87 5732.18 10837.41 20893.13 41192.27
            +26603 -28111 2 113534.45 120130.84 101364.19 77589.95 90102.00 113379.56
            +26603 +28111 2 45810.35 128985.45 139551.91 103551.01 153063.98 144727.32

            Western Digital Blue WDC WD20EZRZ-00Z5HB0, SATA 3.0, 6.0 Gb/s (2TiB HDD):

            variant 20 40 80 160 320 640
            -26603 -28111 0 151030.35 156442.72 59875.14 40854.54 32574.75 66260.11
            -26603 +28111 0 152691.94 198770.81 45308.39 36743.35 38068.41 38420.25
            -26603 -28111 1 442.81 1029.68 1922.92 3391.98 5221.32 8323.79
            -26603 +28111 1 483.88 1027.51 1845.69 3305.48 5473.87 8721.01
            -26603 -28111 2 107780.99 104655.24 86786.08 86942.45 14004.30 37662.52
            -26603 +28111 2 38057.16 81649.51 124179.58 66601.85 4696.54 8492.77
            +26603 -28111 0 147208.47 184330.29 48711.60 46878.98 44284.67 45194.95
            +26603 +28111 0 153007.97 197770.54 36163.94 35519.65 42716.21 30158.95
            +26603 -28111 1 463.67 935.19 1665.83 3219.84 5437.44 8793.70
            +26603 +28111 1 457.01 935.99 1551.90 1922.09 3504.11 8506.51
            +26603 -28111 2 107788.64 115177.97 91923.53 49215.16 42875.74 38165.49
            +26603 +28111 2 33210.59 80173.39 127686.29 62756.50 4504.81 8561.01

            There occasionally occur some log checkpoints (page writes) that will affect the throughput. This is particularly visible for the HDD benchmark, where the throughput drops to 0 for 20 seconds of the 30-second test at 80 concurrent connections, and never recovers from that average in the following 90 seconds (testing with 160, 320, and 640 concurrent connections). I guess that this happened because I was running the workload back-to-back, without triggering any log checkpoint or waiting for the history of committed transactions to be purged.

            We can observe that on the slower storage (or on the NVMe at low numbers of concurrent connections), enabling O_DIRECT on the redo log does seem to help in this simple benchmark where all data fits in the buffer pool, and most writes are for the redo log. We can observe this even for the safe setting innodb_flush_log_at_trx_commit=1 in some cases.

            However, we should also keep in mind that not using O_DIRECT will pollute the file system cache with useless data.

            Because my results do not show any clear pattern, I think that we must introduce a settable Boolean configuration parameter, for enabling or disabling the file system cache for the redo log. I maintain that the caching should be disabled by default. Also mariadb-backup should benefit from enabling the file system cache when it is running.

            marko Marko Mäkelä added a comment - - edited I extended the tests that I conducted yesterday on 10.10 , running them for each value of innodb_flush_trx_at_commit and all 3 storage devices that I have easy access to. Like yesterday, I tested both the impact of disabling O_DIRECT for the redo log (reverting MDEV-28111 ) and enabling asynchronous redo log ( applying MDEV-26603 ). INTEL SSDPED1D960GAY (Intel Optane 960 NVMe PCIe, 1 TiB): variant 20 40 80 160 320 640 -26603 -28111 0 174877.25 218600.14 243620.69 235452.35 221782.77 230218.65 -26603 +28111 0 177523.85 220885.63 250248.56 222407.88 228020.70 231261.60 -26603 -28111 1 40154.01 83822.27 146941.02 169106.35 159355.62 148137.08 -26603 +28111 1 44393.10 87522.12 154563.69 175664.35 149363.81 140018.75 -26603 -28111 2 116627.95 110871.80 93141.31 90384.32 95568.63 100567.27 -26603 +28111 2 77489.69 126257.11 170294.80 153718.74 146924.24 139760.85 +26603 -28111 0 175585.31 219198.76 246444.31 234371.68 223168.61 230352.08 +26603 +28111 0 175449.09 218696.74 245015.51 232257.35 223078.96 231192.25 +26603 -28111 1 38898.41 77419.98 143587.57 171343.71 160899.60 147169.91 +26603 +28111 1 39056.53 80030.78 152584.63 178147.98 149525.98 141598.37 +26603 -28111 2 116206.19 122350.90 102117.73 94406.60 106100.66 112945.75 +26603 +28111 2 61655.26 119870.79 170313.32 155449.44 149877.05 141699.37 Samsung SSD 850 EVO 500GB, SATA 3.1, 6.0 Gb/s (SSD): variant 20 40 80 160 320 640 -26603 -28111 0 173837.73 196463.82 181172.24 248655.80 121209.19 185363.02 -26603 +28111 0 173456.74 227876.18 140814.39 247316.21 147288.87 195739.68 -26603 -28111 1 1637.31 3293.68 6268.56 28426.46 29659.73 37819.90 -26603 +28111 1 1651.81 3117.78 5865.57 11192.76 21408.23 42071.74 -26603 -28111 2 116561.43 111697.04 92891.91 77973.84 83535.76 102135.74 -26603 +28111 2 50282.03 134429.60 115351.79 120100.46 151187.50 145336.60 +26603 -28111 0 172827.09 195303.67 175215.87 247712.12 112698.70 181610.86 +26603 +28111 0 172311.93 196527.93 187143.44 249095.29 117095.25 185992.18 +26603 -28111 1 4859.92 2997.25 5666.87 10431.58 19529.83 37454.78 +26603 +28111 1 1581.06 3132.87 5732.18 10837.41 20893.13 41192.27 +26603 -28111 2 113534.45 120130.84 101364.19 77589.95 90102.00 113379.56 +26603 +28111 2 45810.35 128985.45 139551.91 103551.01 153063.98 144727.32 Western Digital Blue WDC WD20EZRZ-00Z5HB0, SATA 3.0, 6.0 Gb/s (2TiB HDD): variant 20 40 80 160 320 640 -26603 -28111 0 151030.35 156442.72 59875.14 40854.54 32574.75 66260.11 -26603 +28111 0 152691.94 198770.81 45308.39 36743.35 38068.41 38420.25 -26603 -28111 1 442.81 1029.68 1922.92 3391.98 5221.32 8323.79 -26603 +28111 1 483.88 1027.51 1845.69 3305.48 5473.87 8721.01 -26603 -28111 2 107780.99 104655.24 86786.08 86942.45 14004.30 37662.52 -26603 +28111 2 38057.16 81649.51 124179.58 66601.85 4696.54 8492.77 +26603 -28111 0 147208.47 184330.29 48711.60 46878.98 44284.67 45194.95 +26603 +28111 0 153007.97 197770.54 36163.94 35519.65 42716.21 30158.95 +26603 -28111 1 463.67 935.19 1665.83 3219.84 5437.44 8793.70 +26603 +28111 1 457.01 935.99 1551.90 1922.09 3504.11 8506.51 +26603 -28111 2 107788.64 115177.97 91923.53 49215.16 42875.74 38165.49 +26603 +28111 2 33210.59 80173.39 127686.29 62756.50 4504.81 8561.01 There occasionally occur some log checkpoints (page writes) that will affect the throughput. This is particularly visible for the HDD benchmark, where the throughput drops to 0 for 20 seconds of the 30-second test at 80 concurrent connections, and never recovers from that average in the following 90 seconds (testing with 160, 320, and 640 concurrent connections). I guess that this happened because I was running the workload back-to-back, without triggering any log checkpoint or waiting for the history of committed transactions to be purged. We can observe that on the slower storage (or on the NVMe at low numbers of concurrent connections), enabling O_DIRECT on the redo log does seem to help in this simple benchmark where all data fits in the buffer pool, and most writes are for the redo log. We can observe this even for the safe setting innodb_flush_log_at_trx_commit=1 in some cases. However, we should also keep in mind that not using O_DIRECT will pollute the file system cache with useless data. Because my results do not show any clear pattern, I think that we must introduce a settable Boolean configuration parameter, for enabling or disabling the file system cache for the redo log. I maintain that the caching should be disabled by default. Also mariadb-backup should benefit from enabling the file system cache when it is running.

            I developed a patch that introduces a new Boolean parameter

            SET GLOBAL innodb_log_file_buffering=ON;
            

            to explicitly the disable file system cache on the InnoDB redo log. This parameter is only available on Linux and Microsoft Windows. On other platforms, we do not implement any interface for enabling or disabling any file system cache for the redo log.

            On Linux, if the physical block size cannot be determined, this parameter will be hard-wired to ON. If the log file is on persistent memory (MDEV-25090), the parameter will be hard-wired to OFF.

            On Microsoft Windows, we will retain the feature that if the server is started up with innodb_flush_log_at_trx_commit=2, the file system cache on the redo log file will be disabled.

            marko Marko Mäkelä added a comment - I developed a patch that introduces a new Boolean parameter SET GLOBAL innodb_log_file_buffering= ON ; to explicitly the disable file system cache on the InnoDB redo log. This parameter is only available on Linux and Microsoft Windows. On other platforms, we do not implement any interface for enabling or disabling any file system cache for the redo log. On Linux, if the physical block size cannot be determined, this parameter will be hard-wired to ON . If the log file is on persistent memory ( MDEV-25090 ), the parameter will be hard-wired to OFF . On Microsoft Windows, we will retain the feature that if the server is started up with innodb_flush_log_at_trx_commit=2 , the file system cache on the redo log file will be disabled.

            It is generally a bad idea to add any configuration parameters. For this change, there seems to be a consensus that the file system cache must not be disabled whenever innodb_flush_log_at_trx_commit=2 is set, either when the server is being started up, or while it is running. I will revise the fix accordingly.

            marko Marko Mäkelä added a comment - It is generally a bad idea to add any configuration parameters. For this change, there seems to be a consensus that the file system cache must not be disabled whenever innodb_flush_log_at_trx_commit=2 is set, either when the server is being started up, or while it is running. I will revise the fix accordingly.

            For the sake of completeness, I tested on the Intel Optane 960 the durable setting innodb_flush_log_at_trx_commit=1 at lower numbers of concurrent connections, with the file system cache for the log enabled and disabled. Direct unbuffered writes were always faster.

            log 1 2 4 8 16 20
            unbuffered 7722.51 14888.17 23673.12 35600.87 46511.59 51967.62
            buffered 7133.68 12955.10 20733.74 34113.07 45751.61 49746.02
            marko Marko Mäkelä added a comment - For the sake of completeness, I tested on the Intel Optane 960 the durable setting innodb_flush_log_at_trx_commit=1 at lower numbers of concurrent connections, with the file system cache for the log enabled and disabled. Direct unbuffered writes were always faster. log 1 2 4 8 16 20 unbuffered 7722.51 14888.17 23673.12 35600.87 46511.59 51967.62 buffered 7133.68 12955.10 20733.74 34113.07 45751.61 49746.02

            I got some feedback that in an environment based on a Ubuntu 20.04 kernel (Linux 5.4.something), on a 3TiB NVMe drive, innodb_flush_log_at_trx_commit=1 was significantly faster when the file system cache was not enabled (the log file was opened without O_DIRECT). Based on this, it looks like we must introduce the Boolean parameter

            SET GLOBAL innodb_log_file_buffering=OFF;
            SET GLOBAL innodb_log_file_buffering=ON;
            

            so that the file system cache can be enabled or disabled independently of innodb_flush_log_at_trx_commit or innodb_flush_method. For compatibility with old releases, we could set the default value to ON when innodb_flush_log_at_trx_commit=2 has been specified at startup.

            marko Marko Mäkelä added a comment - I got some feedback that in an environment based on a Ubuntu 20.04 kernel (Linux 5.4.something), on a 3TiB NVMe drive, innodb_flush_log_at_trx_commit=1 was significantly faster when the file system cache was not enabled (the log file was opened without O_DIRECT ). Based on this, it looks like we must introduce the Boolean parameter SET GLOBAL innodb_log_file_buffering= OFF ; SET GLOBAL innodb_log_file_buffering= ON ; so that the file system cache can be enabled or disabled independently of innodb_flush_log_at_trx_commit or innodb_flush_method . For compatibility with old releases, we could set the default value to ON when innodb_flush_log_at_trx_commit=2 has been specified at startup.

            On Linux and Microsoft Windows, the Boolean parameter innodb_log_file_buffering will indicate whether the file system cache on the redo log file is enabled. Its default value is OFF. If innodb_flush_log_at_trx_commit=2 on startup, the parameter will be set ON.

            The parameter innodb_flush_method no longer determines whether file system caching is enabled on the redo log; it will only affect the buffering of InnoDB data files.

            marko Marko Mäkelä added a comment - On Linux and Microsoft Windows, the Boolean parameter innodb_log_file_buffering will indicate whether the file system cache on the redo log file is enabled. Its default value is OFF . If innodb_flush_log_at_trx_commit=2 on startup, the parameter will be set ON . The parameter innodb_flush_method no longer determines whether file system caching is enabled on the redo log; it will only affect the buffering of InnoDB data files.
            nunop Nuno added a comment -

            Hi axel

            What is the reason O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set ?

            I actually always had both of these enabled (O_DIRECT and "innodb_flush_log_at_trx_commit = 2"), for many years.

            Never seen this combination causing an issue, but I'm wondering if the problems I'm having since I upgraded from 10.5 to 10.8.3 could be related to these.

            Thank you!

            nunop Nuno added a comment - Hi axel What is the reason O_DIRECT should not be used when innodb_flush_log_at_trx_commit = 2 is set ? I actually always had both of these enabled (O_DIRECT and "innodb_flush_log_at_trx_commit = 2"), for many years. Never seen this combination causing an issue, but I'm wondering if the problems I'm having since I upgraded from 10.5 to 10.8.3 could be related to these. Thank you!
            axel Axel Schwenke added a comment -

            Hello nunop,

            this has nothing to do with innodb_flush_method which is the only way you could "use O_DIRECT". It is however only be used for writes to table spaces (checkpoints).

            MDEV-28111 is about using O_DIRECT for the redo log. This is especially bad with innodb_flush_log_at_trx_commit=2. To put it in other words: it stops buffering log writes and makes the server behave as if innodb_flush_log_at_trx_commit=1.

            That's why we introduced the variable innodb_log_file_buffering. It defaults to a setting that makes innodb_flush_log_at_trx_commit=2 working again. So with this MDEV-28766 in place everything should be as before.

            axel Axel Schwenke added a comment - Hello nunop , this has nothing to do with innodb_flush_method which is the only way you could "use O_DIRECT". It is however only be used for writes to table spaces (checkpoints). MDEV-28111 is about using O_DIRECT for the redo log. This is especially bad with innodb_flush_log_at_trx_commit=2 . To put it in other words: it stops buffering log writes and makes the server behave as if innodb_flush_log_at_trx_commit=1 . That's why we introduced the variable innodb_log_file_buffering . It defaults to a setting that makes innodb_flush_log_at_trx_commit=2 working again. So with this MDEV-28766 in place everything should be as before.
            nunop Nuno added a comment -

            Hi axel

            Thank you very much for clarifying. I understand now, and makes sense!

            So when I update to 10.8.4, I'll put back O_DIRECT and innodb_flush_log_at_trx_commit=2, as I had disabled these two due to problems I was having with mariabackup.

            Thanks! Have a great day.

            nunop Nuno added a comment - Hi axel Thank you very much for clarifying. I understand now, and makes sense! So when I update to 10.8.4, I'll put back O_DIRECT and innodb_flush_log_at_trx_commit=2 , as I had disabled these two due to problems I was having with mariabackup. Thanks! Have a great day.

            People

              marko Marko Mäkelä
              axel Axel Schwenke
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.