[MDEV-33894] MariaDB does unexpected storage read IO for the redo log - Jira

Marko Mäkelä added a comment - 2024-04-12 10:21

The description of ~~MDEV-30136~~ starts with the following:

Starting with ~~MDEV-28766~~ in 10.8, O_DIRECT can be enabled on the redo log:
SET GLOBAL innodb_log_file_buffering=OFF;

MySQL at least up to version 5.7 structured the write-ahead log in 512-byte blocks. In ~~MDEV-14425~~ this was changed, and arbitrary size logical log block size was implemented, that is, each mini-transaction is a log block on its own. We will make an effort to detect the underlying physical block size. If it can be determined, then innodb_log_file_buffering=OFF should be enabled by default.

I believe that the anomaly that you are observing may go away by executing the following:

SET GLOBAL innodb_log_file_buffering=ON;

As you can see in ~~MDEV-33367~~, this also affects users who try to make backups, until we have MDEV-14992 or some form of server-driven copying of the log.

In any case, this logic may need to be revised, or the parameter innodb_log_write_ahead_size that was removed in ~~MDEV-14425~~ may need to be resurrected in some form. I recently learned that in ZFS there is a recordsize parameter that could be megabytes. Even for file copy-on-write file systems that do not support O_DIRECT, it would make sense to write log aligned to the underlying buffers, to avoid read-before-write in the operating system. Currently, I think that we would typically write log aligned to 512 or 4096 byte blocks.

Marko Mäkelä added a comment - 2024-04-12 10:21 The description of MDEV-30136 starts with the following: Starting with MDEV-28766 in 10.8, O_DIRECT can be enabled on the redo log: SET GLOBAL innodb_log_file_buffering= OFF ; MySQL at least up to version 5.7 structured the write-ahead log in 512-byte blocks. In MDEV-14425 this was changed, and arbitrary size logical log block size was implemented, that is, each mini-transaction is a log block on its own. We will make an effort to detect the underlying physical block size. If it can be determined, then innodb_log_file_buffering=OFF should be enabled by default. I believe that the anomaly that you are observing may go away by executing the following: SET GLOBAL innodb_log_file_buffering= ON ; As you can see in MDEV-33367 , this also affects users who try to make backups, until we have MDEV-14992 or some form of server-driven copying of the log. In any case, this logic may need to be revised, or the parameter innodb_log_write_ahead_size that was removed in MDEV-14425 may need to be resurrected in some form. I recently learned that in ZFS there is a recordsize parameter that could be megabytes. Even for file copy-on-write file systems that do not support O_DIRECT , it would make sense to write log aligned to the underlying buffers, to avoid read-before-write in the operating system. Currently, I think that we would typically write log aligned to 512 or 4096 byte blocks.

Mark Callaghan added a comment - 2024-04-12 18:37

I have not been setting innodb_log_file_buffering=ON, but with the my.cnf I use I still see =ON for it after startup via SHOW GLOBAL VARIABLES. And then in the mariadbd error log I see:
2024-04-12 18:29:37 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)

When I add innodb_log_file_buffering=OFF to my.cnf, and then start MariaDB, nothing changes from what I described above, I still see =ON for it in SHOW GLOBAL VARIABLES output.

The storage devices have:
$ cat /sys/block/nvme0n1/queue/hw_sector_size
512

$ cat /sys/block/sda//queue/hw_sector_size
512

Mark Callaghan added a comment - 2024-04-12 18:37 I have not been setting innodb_log_file_buffering=ON, but with the my.cnf I use I still see =ON for it after startup via SHOW GLOBAL VARIABLES. And then in the mariadbd error log I see: 2024-04-12 18:29:37 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) When I add innodb_log_file_buffering=OFF to my.cnf, and then start MariaDB, nothing changes from what I described above, I still see =ON for it in SHOW GLOBAL VARIABLES output. The storage devices have: $ cat /sys/block/nvme0n1/queue/hw_sector_size 512 $ cat /sys/block/sda//queue/hw_sector_size 512

Sergei Golubchik added a comment - 2024-04-17 17:22

I see that in some cases InnoDB can force innodb_log_file_buffering=ON. marko, could you elaborate on that, please?

Sergei Golubchik added a comment - 2024-04-17 17:22 I see that in some cases InnoDB can force innodb_log_file_buffering=ON . marko , could you elaborate on that, please?

Marko Mäkelä added a comment - 2024-05-27 06:02

If the ib_logfile0 can’t be opened in O_DIRECT mode, or if the log is being opened in memory-mapped mode (currently, by so called "fake PMEM" on /dev/shm), then innodb_log_file_buffering will be ON.

Related to ~~MDEV-34062~~, I plan to experiment with memory-mapped access to the log file, no matter what the underlying file system or storage is. The read-before-write problem does exist also there, in the form of a page fault when starting to write to a new 4096-byte page. I am hoping that invoking fallocate(2) with FALLOC_FL_ZERO_RANGE would solve that problem.

Marko Mäkelä added a comment - 2024-05-27 06:02 If the ib_logfile0 can’t be opened in O_DIRECT mode, or if the log is being opened in memory-mapped mode (currently, by so called "fake PMEM" on /dev/shm ), then innodb_log_file_buffering will be ON . Related to MDEV-34062 , I plan to experiment with memory-mapped access to the log file, no matter what the underlying file system or storage is. The read-before-write problem does exist also there, in the form of a page fault when starting to write to a new 4096-byte page. I am hoping that invoking fallocate(2) with FALLOC_FL_ZERO_RANGE would solve that problem.

Marko Mäkelä added a comment - 2024-05-29 13:04

mdcallag, I guess that the issue could be that your file system’s block allocation size larger than the physical block size of 512 bytes. I probably had the wrong impression that the file system would only access 512 bytes in this case. So, we should probably consider the maximum of the physical block size and the allocation block size, or resurrect the parameter innodb_log_write_ahead_size.

Why the O_DIRECT is not being allowed on your system, I can’t tell. Possibly you are using a copy-on-write file system that does not allow O_DIRECT. Can you post some strace output that would help identify the reason? You can also try to execute SET GLOBAL innodb_log_file_buffering=ON while the server is running.

Related to this, my current fix of ~~MDEV-34062~~ would introduce another settable Boolean configuration parameter innodb_log_file_mmap=ON. If we allow this parameter to affect not only log reads but also log writes, the read-before-write phenomenon would look different, namely something that one could catch with the following:

sudo perf record -g -e major-faults -p $(pgrep mariadbd)

In my special benchmark (large buffer pool, tiny log file), all the faults would be attributed to writes to log_sys.buf, just like I expected. That is, the kernel would read old garbage (that was part of an earlier checkpoint) so that we can start writing new meaningful records. Except writing the CRC-32C of the mini-transaction, the fault would look something like this (captured when using the default event):

   - 5.13% mtr_t::commit()

      - 1.72% __memmove_avx_unaligned_erms_rtm

         - 1.62% asm_exc_page_fault

            - 1.58% exc_page_fault

               - 1.56% do_user_addr_fault

                  - 1.39% handle_mm_fault

                     - 1.30% __handle_mm_fault

                        - 1.18% do_fault

                           - 0.59% __do_fault

                                0.59% filemap_fault

This was recorded with an experimental patch that removed the FALLOC_FL_ZERO_RANGE calls that I would hope would allow the ‘read’ to be fulfilled by clear_page_erms or a similar Linux kernel function that would zerofill a page. Unfortunately I do not see any difference whether or not I disable those fallocate() calls in log_t::write_checkpoint() in my current fix. But, so far I was only testing on a rather fast NVMe. I checked that this commit applies cleanly on the current 11.4 branch. For your convenience, I created the branch bb-11.4-MDEV-34062.

Marko Mäkelä added a comment - 2024-05-29 13:04 mdcallag , I guess that the issue could be that your file system’s block allocation size larger than the physical block size of 512 bytes. I probably had the wrong impression that the file system would only access 512 bytes in this case. So, we should probably consider the maximum of the physical block size and the allocation block size, or resurrect the parameter innodb_log_write_ahead_size . Why the O_DIRECT is not being allowed on your system, I can’t tell. Possibly you are using a copy-on-write file system that does not allow O_DIRECT . Can you post some strace output that would help identify the reason? You can also try to execute SET GLOBAL innodb_log_file_buffering=ON while the server is running. Related to this, my current fix of MDEV-34062 would introduce another settable Boolean configuration parameter innodb_log_file_mmap=ON . If we allow this parameter to affect not only log reads but also log writes, the read-before-write phenomenon would look different, namely something that one could catch with the following: sudo perf record -g -e major-faults -p $(pgrep mariadbd) In my special benchmark (large buffer pool, tiny log file), all the faults would be attributed to writes to log_sys.buf , just like I expected. That is, the kernel would read old garbage (that was part of an earlier checkpoint) so that we can start writing new meaningful records. Except writing the CRC-32C of the mini-transaction, the fault would look something like this (captured when using the default event): - 5.13% mtr_t::commit() - 1.72% __memmove_avx_unaligned_erms_rtm - 1.62% asm_exc_page_fault - 1.58% exc_page_fault - 1.56% do_user_addr_fault - 1.39% handle_mm_fault - 1.30% __handle_mm_fault - 1.18% do_fault - 0.59% __do_fault 0.59% filemap_fault This was recorded with an experimental patch that removed the FALLOC_FL_ZERO_RANGE calls that I would hope would allow the ‘read’ to be fulfilled by clear_page_erms or a similar Linux kernel function that would zerofill a page. Unfortunately I do not see any difference whether or not I disable those fallocate() calls in log_t::write_checkpoint() in my current fix . But, so far I was only testing on a rather fast NVMe. I checked that this commit applies cleanly on the current 11.4 branch. For your convenience, I created the branch bb-11.4- MDEV-34062 .

Marko Mäkelä added a comment - 2024-05-30 09:38

I posted to ~~MDEV-34062~~ some results on a SATA 3.0 HDD (Western Digital Blue WDC WD20EZRZ-00Z5HB0) with the following parameters:

cat /sys/block/sdb/queue/hw_sector_size /sys/block/sdb/queue/physical_block_size

InnoDB is only referring to the latter, not to the hw_sector_size. My conclusion is that for the memory-mapped log write interface that ~~MDEV-34062~~ could introduce, this read-before-write phenomenon (in the form of page faults) is unavoidable (but not necessarily too bad), and an attempt to fix it with FALLOC_FL_ZERO_RANGE would only make things worse.

Independent of that, the parameter innodb_log_write_ahead_size will likely have to be resurrected. That parameter would be independent of ~~MDEV-34062~~; it can’t have any impact on memory-mapped log writes.

mdcallag, what is reported as the physical_block_size on your devices? Do you know if there is a parameter that would expose any shingled magnetic recording block size?

Marko Mäkelä added a comment - 2024-05-30 09:38 I posted to MDEV-34062 some results on a SATA 3.0 HDD (Western Digital Blue WDC WD20EZRZ-00Z5HB0) with the following parameters: cat /sys/block/sdb/queue/hw_sector_size /sys/block/sdb/queue/physical_block_size 512 4096 InnoDB is only referring to the latter, not to the hw_sector_size . My conclusion is that for the memory-mapped log write interface that MDEV-34062 could introduce, this read-before-write phenomenon (in the form of page faults) is unavoidable (but not necessarily too bad), and an attempt to fix it with FALLOC_FL_ZERO_RANGE would only make things worse. Independent of that, the parameter innodb_log_write_ahead_size will likely have to be resurrected. That parameter would be independent of MDEV-34062 ; it can’t have any impact on memory-mapped log writes. mdcallag , what is reported as the physical_block_size on your devices? Do you know if there is a parameter that would expose any shingled magnetic recording block size?

Mark Callaghan added a comment - 2024-06-06 22:27

While working on another bug, I countered this again for a sysbench microbenchmark that gets much slower starting in MariaDB 10.11. My comment on the other bug is here

The comment was ...

From the gists I shared above, if you scroll to the end of the link for the socket2 server (see here) then you will see that results are much worse for x.ma101107_rel_withdbg.z11a_c24r64.pk1 (MariaDB 10.11.7) and x.ma110401_rel_withdbg.z11b_c24r64.pk1 (MariaDB 11.4.1) and the obvious change is that r/o and rKB/o were 0 in MariaDB 10.6 and earlier releases but they are non-zero starting in 10.11.7. The r/o column is iostat reads per operation and (r/s divided by IPS) and rKB/o is iostat KB read per operation (read KB/s divided by IPS).

Mark Callaghan added a comment - 2024-06-06 22:27 While working on another bug, I countered this again for a sysbench microbenchmark that gets much slower starting in MariaDB 10.11. My comment on the other bug is here The comment was ... From the gists I shared above, if you scroll to the end of the link for the socket2 server ( see here ) then you will see that results are much worse for x.ma101107_rel_withdbg.z11a_c24r64.pk1 (MariaDB 10.11.7) and x.ma110401_rel_withdbg.z11b_c24r64.pk1 (MariaDB 11.4.1) and the obvious change is that r/o and rKB/o were 0 in MariaDB 10.6 and earlier releases but they are non-zero starting in 10.11.7. The r/o column is iostat reads per operation and (r/s divided by IPS) and rKB/o is iostat KB read per operation (read KB/s divided by IPS).

Marko Mäkelä added a comment - 2024-06-11 12:19

I see that innodb_log_write_ahead_size used to be constrained to a power of 2, which is nice. An additional constraint would seem to be that innodb_log_file_size must be an integer multiple of innodb_log_write_ahead_size. I would disallow SET GLOBAL innodb_log_write_ahead_size while SET GLOBAL innodb_log_file_size (~~MDEV-27812~~) is in progress.

I think that we can rather easily implement this in log_t::write_buf(). A special case is when we are writing to the ib_logfile0 near the start of its payload area (offsets between 12288 and innodb_log_write_ahead_size). In that case, I would disregard the parameter and write in multiples of the physical block size (typically 512 or 4096 bytes), just like we currently do.

Marko Mäkelä added a comment - 2024-06-11 12:19 I see that innodb_log_write_ahead_size used to be constrained to a power of 2, which is nice. An additional constraint would seem to be that innodb_log_file_size must be an integer multiple of innodb_log_write_ahead_size . I would disallow SET GLOBAL innodb_log_write_ahead_size while SET GLOBAL innodb_log_file_size ( MDEV-27812 ) is in progress. I think that we can rather easily implement this in log_t::write_buf() . A special case is when we are writing to the ib_logfile0 near the start of its payload area (offsets between 12288 and innodb_log_write_ahead_size ). In that case, I would disregard the parameter and write in multiples of the physical block size (typically 512 or 4096 bytes), just like we currently do.

Marko Mäkelä added a comment - 2024-06-11 14:57

https://github.com/MariaDB/server/pull/3327

Marko Mäkelä added a comment - 2024-06-11 14:57 https://github.com/MariaDB/server/pull/3327

Marko Mäkelä added a comment - 2024-06-12 08:53

The innodb_log_write_ahead_size can be at most innodb_log_buffer_size. I think that for maximum usability with some copy-on-write file systems that perform transparent compression, the size should be settable to at least a few megabytes.

Marko Mäkelä added a comment - 2024-06-12 08:53 The innodb_log_write_ahead_size can be at most innodb_log_buffer_size . I think that for maximum usability with some copy-on-write file systems that perform transparent compression, the size should be settable to at least a few megabytes.

Marko Mäkelä added a comment - 2024-06-13 17:58

I have been struggling a bit with the logic. If I change the parameter on the fly, the log would be written from the wrong offset in the buffer, or to the wrong file offset, so that crash recovery would occasionally fail. It is a little unfortunate that in ~~MDEV-14425~~ I decided that the ib_logfile0 payload area would start at 12,288 bytes, which is only divisible by up to 4096.

The good news is that I can rather easily reproduce the recovery failures with https://rr-project.org. My current plan is to shrink the requested innodb_log_write_ahead_size if needed, when we are writing near the start or end of the log record payload area. The goal is to ensure that normally, both the file offsets and write lengths are integer multiples of innodb_log_write_ahead_size, which would be a power of 2.

Marko Mäkelä added a comment - 2024-06-13 17:58 I have been struggling a bit with the logic. If I change the parameter on the fly, the log would be written from the wrong offset in the buffer, or to the wrong file offset, so that crash recovery would occasionally fail. It is a little unfortunate that in MDEV-14425 I decided that the ib_logfile0 payload area would start at 12,288 bytes, which is only divisible by up to 4096. The good news is that I can rather easily reproduce the recovery failures with https://rr-project.org . My current plan is to shrink the requested innodb_log_write_ahead_size if needed, when we are writing near the start or end of the log record payload area. The goal is to ensure that normally, both the file offsets and write lengths are integer multiples of innodb_log_write_ahead_size , which would be a power of 2.

Marko Mäkelä added a comment - 2024-06-15 11:28 - edited

It looks like I finally figured out the solution to the intermittent recovery problem: Whenever the requested block size is too large (we are writing close to the start of ib_logfile0; the record payload area follows a 12 KiB header), and we had previously used a larger write block size, then we must shift (memmove) the contents of the write buffers so that no data will be rewritten at the wrong offset.

While SET GLOBAL innodb_log_file_size is in progress (the log is being resized), it will be possible to execute SET GLOBAL innodb_log_write_ahead_size with an immediate effect.

Some stress testing with SET GLOBAL of innodb_log_file_size and innodb_log_write_ahead_size would be very useful. The server should run a write heavy workload and be killed and restarted.

Additionally, mariadb-backup --backup should be tested while SET GLOBAL innodb_log_write_ahead_size is being executed. (Remember from ~~MDEV-27812~~ that backup is expected to hang if SET GLOBAL innodb_log_file_size is executed, because it would fail to switch to track the resized log file.)

All testing should be conducted while the InnoDB redo log interface is not memory-mapped (such as by building with cmake -DWITH_INNODB_PMEM=OFF), because innodb_log_write_ahead_size has no effect on memory-mapped log.

Marko Mäkelä added a comment - 2024-06-15 11:28 - edited It looks like I finally figured out the solution to the intermittent recovery problem: Whenever the requested block size is too large (we are writing close to the start of ib_logfile0 ; the record payload area follows a 12 KiB header), and we had previously used a larger write block size, then we must shift ( memmove ) the contents of the write buffers so that no data will be rewritten at the wrong offset. While SET GLOBAL innodb_log_file_size is in progress (the log is being resized), it will be possible to execute SET GLOBAL innodb_log_write_ahead_size with an immediate effect. Some stress testing with SET GLOBAL of innodb_log_file_size and innodb_log_write_ahead_size would be very useful. The server should run a write heavy workload and be killed and restarted. Additionally, mariadb-backup --backup should be tested while SET GLOBAL innodb_log_write_ahead_size is being executed. (Remember from MDEV-27812 that backup is expected to hang if SET GLOBAL innodb_log_file_size is executed , because it would fail to switch to track the resized log file.) All testing should be conducted while the InnoDB redo log interface is not memory-mapped (such as by building with cmake -DWITH_INNODB_PMEM=OFF ), because innodb_log_write_ahead_size has no effect on memory-mapped log.

Mark Callaghan added a comment - 2024-06-19 17:25

I am repeating tests (cached sysbench, cached Insert Benchmark, IO-bound Insert Benchmark) to compare 10.11 with 10.11-~~MDEV-33894~~ as of:

commit 6aaf61836f6a07bb2d3d851b10cb5b3485522be7 (HEAD ~~> 10.11MDEV-33894~~, origin/10.11-~~MDEV-33894~~)
Merge: 5f33b5eaaa2 34813c1aa07
Author: Marko Mäkelä <marko.makela@mariadb.com>
Date: Wed Jun 19 15:18:49 2024 +0300

Merge 10.11

Mark Callaghan added a comment - 2024-06-19 17:25 I am repeating tests (cached sysbench, cached Insert Benchmark, IO-bound Insert Benchmark) to compare 10.11 with 10.11- MDEV-33894 as of: commit 6aaf61836f6a07bb2d3d851b10cb5b3485522be7 (HEAD > 10.11 MDEV-33894 , origin/10.11- MDEV-33894 ) Merge: 5f33b5eaaa2 34813c1aa07 Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Wed Jun 19 15:18:49 2024 +0300 Merge 10.11

Debarun Banerjee added a comment - 2024-06-20 13:02

marko Good to see the feature back. Perhaps it was too early to remove it with 4k/8k sectors storages still in use. The “read-on-write”avoidance is crucial and I agree with the decision.

I have reviewed the spec and high level changes and here are some initial comments.
This MDEV is more like a feature as opposed to a bug. I think both QA and Doc should be signalled.

I have now started reviewing the core implementation in log0log.cc and finish soon.

One thing I observed is that with 4k sector size the very DB creation hits an assert in debug mode. Any mtr test would fail on my HDD for the same reason. It is not repeatable on SSD or RAM disk with 512 bytes physical sector size.

sudo fdisk -l /dev/sda
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

debanerj@deb-7010:/home/hdd/deb/maria-src5/bld_install$ mariadb-install-db --srcdir=/home/hdd/deb/maria-src5/ --datadir=./maria_data
Installing MariaDB/MySQL system tables in './maria_data' ...
mariadbd: /home/hdd/deb/maria-src5/storage/innobase/log/log0log.cc:852: lsn_t log_t::write_buf() [with bool release_latch = true; lsn_t = long unsigned int]: Assertion `write_size_1 >= block_size_1' failed.
240620 18:12:48 [ERROR] mysqld got signal 6 ;

log/log0log.cc:853(unsigned long log_t::write_buf<true>())[0x57899ef7e046]
log/log0log.cc:1040(log_write_up_to(unsigned long, bool, completion_callback const*))[0x57899ef7b1a3]
buf/buf0flu.cc:1954(log_checkpoint_low(unsigned long, unsigned long))[0x57899f22b9e8]
buf/buf0flu.cc:1998(log_checkpoint())[0x57899f22bc34]
buf/buf0flu.cc:2098(buf_flush_wait_flushed(unsigned long))[0x57899f22c396]
buf/buf0flu.cc:2004(log_make_checkpoint())[0x57899f22bc5a]
srv/srv0start.cc:246(create_log_file(bool, unsigned long))[0x57899f12bd28]
srv/srv0start.cc:1356(srv_start(bool))[0x57899f12fccd]
handler/ha_innodb.cc:4251(innodb_init(void*))[0x57899ee78325]
sql/handler.cc:655(ha_initialize_handlerton(st_plugin_int*))[0x57899ea4a51f]
sql/sql_plugin.cc:1454(plugin_do_initialize(st_plugin_int*, unsigned int&))[0x57899e6b002f]
sql/sql_plugin.cc:1507(plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool))[0x57899e6b0376]
sql/sql_plugin.cc:1765(plugin_init(int*, char**, int))[0x57899e6b10e8]
sql/mysqld.cc:5256(init_server_components())[0x57899e4f1e31]
sql/mysqld.cc:5882(mysqld_main(int, char**))[0x57899e4f317f]
sql/main.cc:34(main)[0x57899e4e799d]

Debarun Banerjee added a comment - 2024-06-20 13:02 marko Good to see the feature back. Perhaps it was too early to remove it with 4k/8k sectors storages still in use. The “read-on-write”avoidance is crucial and I agree with the decision. I have reviewed the spec and high level changes and here are some initial comments. This MDEV is more like a feature as opposed to a bug. I think both QA and Doc should be signalled. I have now started reviewing the core implementation in log0log.cc and finish soon. One thing I observed is that with 4k sector size the very DB creation hits an assert in debug mode. Any mtr test would fail on my HDD for the same reason. It is not repeatable on SSD or RAM disk with 512 bytes physical sector size. sudo fdisk -l /dev/sda Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes debanerj@deb-7010:/home/hdd/deb/maria-src5/bld_install$ mariadb-install-db --srcdir=/home/hdd/deb/maria-src5/ --datadir=./maria_data Installing MariaDB/MySQL system tables in './maria_data' ... mariadbd: /home/hdd/deb/maria-src5/storage/innobase/log/log0log.cc:852: lsn_t log_t::write_buf() [with bool release_latch = true; lsn_t = long unsigned int] : Assertion `write_size_1 >= block_size_1' failed. 240620 18:12:48 [ERROR] mysqld got signal 6 ; log/log0log.cc:853(unsigned long log_t::write_buf<true>()) [0x57899ef7e046] log/log0log.cc:1040(log_write_up_to(unsigned long, bool, completion_callback const*)) [0x57899ef7b1a3] buf/buf0flu.cc:1954(log_checkpoint_low(unsigned long, unsigned long)) [0x57899f22b9e8] buf/buf0flu.cc:1998(log_checkpoint()) [0x57899f22bc34] buf/buf0flu.cc:2098(buf_flush_wait_flushed(unsigned long)) [0x57899f22c396] buf/buf0flu.cc:2004(log_make_checkpoint()) [0x57899f22bc5a] srv/srv0start.cc:246(create_log_file(bool, unsigned long)) [0x57899f12bd28] srv/srv0start.cc:1356(srv_start(bool)) [0x57899f12fccd] handler/ha_innodb.cc:4251(innodb_init(void*)) [0x57899ee78325] sql/handler.cc:655(ha_initialize_handlerton(st_plugin_int*)) [0x57899ea4a51f] sql/sql_plugin.cc:1454(plugin_do_initialize(st_plugin_int*, unsigned int&)) [0x57899e6b002f] sql/sql_plugin.cc:1507(plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool)) [0x57899e6b0376] sql/sql_plugin.cc:1765(plugin_init(int*, char**, int)) [0x57899e6b10e8] sql/mysqld.cc:5256(init_server_components()) [0x57899e4f1e31] sql/mysqld.cc:5882(mysqld_main(int, char**)) [0x57899e4f317f] sql/main.cc:34(main) [0x57899e4e799d]

Mark Callaghan added a comment - 2024-06-20 16:39 - edited

Marko - do I need to change any options? The problem (unexpected read IO) remains with 10.11-~~MDEV-33894~~ when I use the same options I used when first reporting this problem.

Options are here because the jira markup isn't working for me

Mark Callaghan added a comment - 2024-06-20 16:39 - edited Marko - do I need to change any options? The problem (unexpected read IO) remains with 10.11- MDEV-33894 when I use the same options I used when first reporting this problem. Options are here because the jira markup isn't working for me

Debarun Banerjee added a comment - 2024-06-20 17:51

Hi mdcallag

The patch brings back the configuration "innodb_log_write_ahead_size" but the default size is still 512 bytes . So, the default behaviour would remain same. For 4k file system sector size the configuration needs to be adjusted. Can you please see if setting the option to following values help ?

1. innodb_log_write_ahead_size=8192 [pre 10.8 default, also MySQL 5.7/8.0 default]
2. innodb_log_write_ahead_size=4096 [matching 4k sector size]

Debarun Banerjee added a comment - 2024-06-20 17:51 Hi mdcallag The patch brings back the configuration "innodb_log_write_ahead_size" but the default size is still 512 bytes . So, the default behaviour would remain same. For 4k file system sector size the configuration needs to be adjusted. Can you please see if setting the option to following values help ? 1. innodb_log_write_ahead_size=8192 [pre 10.8 default, also MySQL 5.7/8.0 default] 2. innodb_log_write_ahead_size=4096 [matching 4k sector size]

Vladislav Vaintroub added a comment - 2024-06-20 18:17

So, for me, without the patch, log size for write was matching 4K sector size, and I did not have to do anything for that.And there was no unexpected read IO, because I do not use XFS
The life was good already, for me. Did something change now?

Vladislav Vaintroub added a comment - 2024-06-20 18:17 So, for me, without the patch, log size for write was matching 4K sector size, and I did not have to do anything for that.And there was no unexpected read IO, because I do not use XFS The life was good already, for me. Did something change now?

Mark Callaghan added a comment - 2024-06-21 15:37

tl;dr - problem is fixed with the patch if I set innodb_log_write_ahead_size to 4k or 8k. By default it is =512. Some results are here and by "fixed" I mean the r/o and rKB/o columns drop to 0. Both are (value from iostat / QPS) and the value from iostat is r/s for r/o and rKB/s for rKB/o.

I though that =512 might be a function of the storage device I use – see the lsblk -t results below, but from looking at the code it looks like it is hardwired to 512 – see the code at this point in time

Using this patch

On a PN53 (see v7 here) with Ubuntu 22.04, XFS on 1 NVMe device (Samsung SSD 980 PRO 1TB)

$ lsblk -t /dev/nvme1n1
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
nvme1n1 0 512 0 512 512 0 none 1023 128 0B

Mark Callaghan added a comment - 2024-06-21 15:37 tl;dr - problem is fixed with the patch if I set innodb_log_write_ahead_size to 4k or 8k. By default it is =512. Some results are here and by "fixed" I mean the r/o and rKB/o columns drop to 0. Both are (value from iostat / QPS) and the value from iostat is r/s for r/o and rKB/s for rKB/o. I though that =512 might be a function of the storage device I use – see the lsblk -t results below, but from looking at the code it looks like it is hardwired to 512 – see the code at this point in time Using this patch On a PN53 ( see v7 here ) with Ubuntu 22.04, XFS on 1 NVMe device (Samsung SSD 980 PRO 1TB) $ lsblk -t /dev/nvme1n1 NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme1n1 0 512 0 512 512 0 none 1023 128 0B

Marko Mäkelä added a comment - 2024-06-25 10:27

Thank you! I realized last week that on Linux we only check the /sys/block/*/queue/physical_block_size when O_DIRECT access (innodb_log_file_buffering=OFF) is being used. Maybe XFS does not support O_DIRECT at all, and therefore we would end up using the hard-wired 512-byte default size, without bothering to check it.

I did not find any mention of an XFS logical block size. For ZFS, there would be volblocksize and recordsize, which default to 16 KiB and 128 KiB, respectively. On ZFS, I would assume that we would want to set innodb_log_write_ahead_size to match one of these parameters. The bcachefs documentation mentions block_size and btree_node_size being 4KiB and 256 KiB by default, respectively. I would imagine that copy-on-write file systems that implement transparent compression would benefit from using a larger size. That is why I would allow the innodb_log_write_ahead_size to be set to up to 16 MiB or the current value of innodb_log_buffer_size.

I am still analyzing an rr replay trace of a recovery error that occurs when innodb_log_file_size and innodb_log_write_ahead_size are being changed concurrently. In an earlier revision, the innodb_log_write_ahead_size was ‘frozen’ during log resizing and this problem might not exist. I hope to find the exact root cause soon.

Marko Mäkelä added a comment - 2024-06-25 10:27 Thank you! I realized last week that on Linux we only check the /sys/block/*/queue/physical_block_size when O_DIRECT access ( innodb_log_file_buffering=OFF ) is being used. Maybe XFS does not support O_DIRECT at all, and therefore we would end up using the hard-wired 512-byte default size, without bothering to check it. I did not find any mention of an XFS logical block size. For ZFS , there would be volblocksize and recordsize , which default to 16 KiB and 128 KiB, respectively. On ZFS, I would assume that we would want to set innodb_log_write_ahead_size to match one of these parameters. The bcachefs documentation mentions block_size and btree_node_size being 4KiB and 256 KiB by default, respectively. I would imagine that copy-on-write file systems that implement transparent compression would benefit from using a larger size. That is why I would allow the innodb_log_write_ahead_size to be set to up to 16 MiB or the current value of innodb_log_buffer_size . I am still analyzing an rr replay trace of a recovery error that occurs when innodb_log_file_size and innodb_log_write_ahead_size are being changed concurrently. In an earlier revision, the innodb_log_write_ahead_size was ‘frozen’ during log resizing and this problem might not exist. I hope to find the exact root cause soon.

Vladislav Vaintroub added a comment - 2024-06-25 11:11 - edited

On Windows, so far the block size was correctly determined by GetFileInformationByHandleEx / FileStorageInfo . I got 4096, in 10.11

2024-06-25 10:28:30 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)

It remains 4096, in non-default case, if buffering is used

Vladislav Vaintroub added a comment - 2024-06-25 11:11 - edited On Windows, so far the block size was correctly determined by GetFileInformationByHandleEx / FileStorageInfo . I got 4096, in 10.11 2024-06-25 10:28:30 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes) It remains 4096, in non-default case, if buffering is used

Marko Mäkelä added a comment - 2024-06-25 11:37

As far as I can tell, in the rr replay trace that I have been analyzing, we are writing the same garbage also to the "main" ib_logfile0 file that was later replaced when the being-resized ib_logfile101 was renamed. I plan to revise this so that any requested change of innodb_log_write_ahead_size will be buffered until the next log checkpoint. In this way, we can make the SET GLOBAL innodb_log_write_ahead_size trigger a log checkpoint as well. Currently we were missing a convenient way of doing that outside debug instrumented builds.

Marko Mäkelä added a comment - 2024-06-25 11:37 As far as I can tell, in the rr replay trace that I have been analyzing, we are writing the same garbage also to the "main" ib_logfile0 file that was later replaced when the being-resized ib_logfile101 was renamed. I plan to revise this so that any requested change of innodb_log_write_ahead_size will be buffered until the next log checkpoint. In this way, we can make the SET GLOBAL innodb_log_write_ahead_size trigger a log checkpoint as well. Currently we were missing a convenient way of doing that outside debug instrumented builds.

Marko Mäkelä added a comment - 2024-06-25 14:32

I am also facing recovery errors if I statically set innodb_log_write_ahead_size to 8192 bytes. An easy way out would seem to be to limit the innodb_log_write_ahead_size to be a power of 2 and between 512 and 4096 bytes. This is almost as ‘good’ as the parameter used to be; it previously was allowed to be up to 16384 bytes. The ~~MDEV-14425~~ log file format was designed with that in mind. As far as I can tell, this should still be compatible with most storage. Hopefully, ZFS, btrfs, bcachefs will have some cache in the kernel on their own for any larger block size that they may use.

Marko Mäkelä added a comment - 2024-06-25 14:32 I am also facing recovery errors if I statically set innodb_log_write_ahead_size to 8192 bytes. An easy way out would seem to be to limit the innodb_log_write_ahead_size to be a power of 2 and between 512 and 4096 bytes. This is almost as ‘good’ as the parameter used to be; it previously was allowed to be up to 16384 bytes. The MDEV-14425 log file format was designed with that in mind. As far as I can tell, this should still be compatible with most storage. Hopefully, ZFS, btrfs, bcachefs will have some cache in the kernel on their own for any larger block size that they may use.

Marko Mäkelä added a comment - 2024-06-26 12:31

For now, I gave up with the more ambitious fix https://github.com/MariaDB/server/pull/3327 and implemented something much simpler in https://github.com/MariaDB/server/pull/3363:

innodb_log_write_ahead_size is read-only, a power of 2 between 512 and 4096 bytes.

This will trivially work, because the log file format had been designed with those block sizes in mind.

I don’t know how common it would be to store the ib_logfile0 on a RAID system. In that scenario, I can imagine that a significantly larger innodb_log_write_ahead_size would be helpful.

Marko Mäkelä added a comment - 2024-06-26 12:31 For now, I gave up with the more ambitious fix https://github.com/MariaDB/server/pull/3327 and implemented something much simpler in https://github.com/MariaDB/server/pull/3363: innodb_log_write_ahead_size is read-only, a power of 2 between 512 and 4096 bytes. This will trivially work, because the log file format had been designed with those block sizes in mind. I don’t know how common it would be to store the ib_logfile0 on a RAID system. In that scenario, I can imagine that a significantly larger innodb_log_write_ahead_size would be helpful.

Mark Callaghan added a comment - 2024-06-26 17:03

Is the ~~MDEV-33894~~ branch still valid?
Perhaps the comment ("does not work") really means it does not work.

Because I can't start InnoDB from a build from the latest commit – 4444436e

—
240626 16:59:49 mysqld_safe Starting mariadbd daemon with databases from /data/m/my/data
2024-06-26 16:59:49 0 [Note] Starting MariaDB 10.11.9-MariaDB-log source revision 4444436eeab577e85f79ebdf818c9a73284d719d as process 804342
2024-06-26 16:59:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-06-26 16:59:49 0 [Note] InnoDB: Number of transaction pools: 1
2024-06-26 16:59:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-06-26 16:59:49 0 [Note] InnoDB: Using Linux native AIO
2024-06-26 16:59:49 0 [Note] InnoDB: Initializing buffer pool, total size = 23.000GiB, chunk size = 368.000MiB
2024-06-26 16:59:49 0 [Note] InnoDB: Completed initialization of buffer pool
2024-06-26 16:59:49 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2024-06-26 16:59:49 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(57992) at 57992
2024-06-26 16:59:49 0 [ERROR] InnoDB: Log scan aborted at LSN 57992
2024-06-26 16:59:49 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2024-06-26 16:59:49 0 [Note] InnoDB: Starting shutdown...
2024-06-26 16:59:50 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2024-06-26 16:59:50 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-06-26 16:59:50 0 [Warning] 'default-authentication-plugin' is MySQL 5.6 / 5.7 compatible option. To be implemented in later versions.
2024-06-26 16:59:50 0 [ERROR] Unknown/unsupported storage engine: InnoDB
2024-06-26 16:59:50 0 [ERROR] Aborting
240626 16:59:50 mysqld_safe mysqld from pid file /data/m/my/data/pn53-3.pid ended
—

commit 4444436eeab577e85f79ebdf818c9a73284d719d (HEAD ~~> 10.11MDEV-33894~~.jun26, origin/10.11-~~MDEV-33894~~)
Author: Marko Mäkelä <marko.makela@mariadb.com>
Date: Wed Jun 26 14:11:22 2024 +0300

WIP: Try to fix things (does not work)

Let us see if it would help to apply the changes of
innodb_log_write_ahead_size on log checkpoint completion.

TODO: Many things would be easier if we make innodb_log_write_ahead_size
a read-only parameter, with a maximum of 4096 bytes.

Mark Callaghan added a comment - 2024-06-26 17:03 Is the MDEV-33894 branch still valid? Perhaps the comment ("does not work") really means it does not work. Because I can't start InnoDB from a build from the latest commit – 4444436e — 240626 16:59:49 mysqld_safe Starting mariadbd daemon with databases from /data/m/my/data 2024-06-26 16:59:49 0 [Note] Starting MariaDB 10.11.9-MariaDB-log source revision 4444436eeab577e85f79ebdf818c9a73284d719d as process 804342 2024-06-26 16:59:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 2024-06-26 16:59:49 0 [Note] InnoDB: Number of transaction pools: 1 2024-06-26 16:59:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions 2024-06-26 16:59:49 0 [Note] InnoDB: Using Linux native AIO 2024-06-26 16:59:49 0 [Note] InnoDB: Initializing buffer pool, total size = 23.000GiB, chunk size = 368.000MiB 2024-06-26 16:59:49 0 [Note] InnoDB: Completed initialization of buffer pool 2024-06-26 16:59:49 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) 2024-06-26 16:59:49 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(57992) at 57992 2024-06-26 16:59:49 0 [ERROR] InnoDB: Log scan aborted at LSN 57992 2024-06-26 16:59:49 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error 2024-06-26 16:59:49 0 [Note] InnoDB: Starting shutdown... 2024-06-26 16:59:50 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed. 2024-06-26 16:59:50 0 [Note] Plugin 'FEEDBACK' is disabled. 2024-06-26 16:59:50 0 [Warning] 'default-authentication-plugin' is MySQL 5.6 / 5.7 compatible option. To be implemented in later versions. 2024-06-26 16:59:50 0 [ERROR] Unknown/unsupported storage engine: InnoDB 2024-06-26 16:59:50 0 [ERROR] Aborting 240626 16:59:50 mysqld_safe mysqld from pid file /data/m/my/data/pn53-3.pid ended — commit 4444436eeab577e85f79ebdf818c9a73284d719d (HEAD > 10.11 MDEV-33894 .jun26, origin/10.11- MDEV-33894 ) Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Wed Jun 26 14:11:22 2024 +0300 WIP: Try to fix things (does not work) Let us see if it would help to apply the changes of innodb_log_write_ahead_size on log checkpoint completion. TODO: Many things would be easier if we make innodb_log_write_ahead_size a read-only parameter, with a maximum of 4096 bytes.

Marko Mäkelä added a comment - 2024-06-27 06:06

For future reference, I had pushed the current broken state to https://github.com/MariaDB/server/pull/3327 and it really does not work, like the commit message says. I explicitly aborted the CI run for it, as you can see in the grid view.

I just checked https://github.com/MariaDB/server/pull/3363 and at the top of the page it does mention the branch name: "wants to merge 1 commit into 10.11 from …." In the "Commits" tab you can also find a link to the commit. You could download it in two different forms with

wget https://github.com/MariaDB/server/pull/3363/commits/4ae9e8ef058d472f92ab605d8741ade234544b8b.diff

wget https://github.com/MariaDB/server/pull/3363/commits/4ae9e8ef058d472f92ab605d8741ade234544b8b.patch

One of these forms is usable with git am.

Marko Mäkelä added a comment - 2024-06-27 06:06 For future reference, I had pushed the current broken state to https://github.com/MariaDB/server/pull/3327 and it really does not work, like the commit message says. I explicitly aborted the CI run for it, as you can see in the grid view . I just checked https://github.com/MariaDB/server/pull/3363 and at the top of the page it does mention the branch name: "wants to merge 1 commit into 10.11 from …." In the "Commits" tab you can also find a link to the commit . You could download it in two different forms with wget https://github.com/MariaDB/server/pull/3363/commits/4ae9e8ef058d472f92ab605d8741ade234544b8b.diff wget https://github.com/MariaDB/server/pull/3363/commits/4ae9e8ef058d472f92ab605d8741ade234544b8b.patch One of these forms is usable with git am .

Debarun Banerjee added a comment - 2024-06-27 09:53

I am now done with the review.

Dynamic Configuration Patch
I had shared my comments earlier. I understand we would like to defer this patch for now.

Read Only Configuration Patch
Please see my comments. I agree with the patch. Please check my comments.

Debarun Banerjee added a comment - 2024-06-27 09:53 I am now done with the review. Dynamic Configuration Patch I had shared my comments earlier. I understand we would like to defer this patch for now. Read Only Configuration Patch Please see my comments. I agree with the patch. Please check my comments.

Marko Mäkelä added a comment - 2024-06-28 06:21

The read-only parameter innodb_log_write_ahead_size with the default value 512 and the allowed values 512, 1024, 2048, 4096 was introduced. Up to MariaDB Server 10.6, additional allowed values were 8192, 16384, and the default value was 8192. On Linux and Microsoft Windows, the default or the specified innodb_log_write_ahead_size will be automatically adjusted to not be less than the physical block size (if it can be determined).

The more ambitious fix (to make the parameter settable at runtime and to allow larger values) in https://github.com/MariaDB/server/pull/3327 might be revisited later.

Marko Mäkelä added a comment - 2024-06-28 06:21 The read-only parameter innodb_log_write_ahead_size with the default value 512 and the allowed values 512, 1024, 2048, 4096 was introduced. Up to MariaDB Server 10.6, additional allowed values were 8192, 16384, and the default value was 8192. On Linux and Microsoft Windows, the default or the specified innodb_log_write_ahead_size will be automatically adjusted to not be less than the physical block size (if it can be determined). The more ambitious fix (to make the parameter settable at runtime and to allow larger values) in https://github.com/MariaDB/server/pull/3327 might be revisited later.

Mark Callaghan added a comment - 2024-07-26 23:13

The ~~MDEV-33894~~ fix makes a big difference. I was able to show that MariaDB is ~10% faster than MySQL on a medium server.
https://smalldatum.blogspot.com/2024/07/sysbench-on-medium-server-mariadb-is.html

Mark Callaghan added a comment - 2024-07-26 23:13 The MDEV-33894 fix makes a big difference. I was able to show that MariaDB is ~10% faster than MySQL on a medium server. https://smalldatum.blogspot.com/2024/07/sysbench-on-medium-server-mariadb-is.html

Mark Callaghan added a comment - 2024-09-05 22:06 - edited

This is still a problem for me. Or perhaps I don't understand what was fixed.

I think that I got the numbers listed below (512 vs 8192) backwards. See my next comment.
When I don't set innodb_log_write_ahead_size in etc/my.cnf the value of innodb_log_write_ahead_size is:

512 n MariaDB versions 10.4.33, 10.4.34, 10.5.25, 10.5.26, 10.6.18, 10.6.19, 10.11.7, 10.11.9
8192 in MariaDB 11.x

And those default values (512 prior to 11.x, 8192 in 11.x) are documented
https://mariadb.com/kb/en/innodb-system-variables/#innodb_log_write_ahead_size

When I repeat sysbench with a cached database using the latest point release versions then I still see a regression in throughput for update-index (see here). The numbers are throughput relative to 10.2.44 and the relative throughput is 1.08 in column 4 (10.6.19) and drops to 0.90 in column 5 (10.11.9).

From results for more versions, focus on columns 8, 9 and 10 which have ...

column 8 -> MariaDB 10.11.8 with default for innodb_log_write_ahead_size
column 9 -> MariaDB 10.11.9 with default value for innodb_log_write_ahead_size (8192)
column 10 -> MariaDB 10.11.9 with innodb_log_write_ahead_size=4096 in etc/my.cnf

Results are here and the performance improves in column 10. A similar pattern occurs for MariaDB 11.4, 11.5 and 11.6. Entries with "z11a_lwas4k_c8r32" set innodb_log_write_ahead_size to 4096, otherwise I don't set it.

Mark Callaghan added a comment - 2024-09-05 22:06 - edited This is still a problem for me. Or perhaps I don't understand what was fixed. I think that I got the numbers listed below (512 vs 8192) backwards. See my next comment. When I don't set innodb_log_write_ahead_size in etc/my.cnf the value of innodb_log_write_ahead_size is: 512 n MariaDB versions 10.4.33, 10.4.34, 10.5.25, 10.5.26, 10.6.18, 10.6.19, 10.11.7, 10.11.9 8192 in MariaDB 11.x And those default values (512 prior to 11.x, 8192 in 11.x) are documented https://mariadb.com/kb/en/innodb-system-variables/#innodb_log_write_ahead_size When I repeat sysbench with a cached database using the latest point release versions then I still see a regression in throughput for update-index ( see here ). The numbers are throughput relative to 10.2.44 and the relative throughput is 1.08 in column 4 (10.6.19) and drops to 0.90 in column 5 (10.11.9). From results for more versions, focus on columns 8, 9 and 10 which have ... column 8 -> MariaDB 10.11.8 with default for innodb_log_write_ahead_size column 9 -> MariaDB 10.11.9 with default value for innodb_log_write_ahead_size (8192) column 10 -> MariaDB 10.11.9 with innodb_log_write_ahead_size=4096 in etc/my.cnf Results are here and the performance improves in column 10. A similar pattern occurs for MariaDB 11.4, 11.5 and 11.6. Entries with "z11a_lwas4k_c8r32" set innodb_log_write_ahead_size to 4096, otherwise I don't set it.

Mark Callaghan added a comment - 2024-12-20 21:30

When I don't set innodb_log_write_ahead_size in my.cnf, it is 8192 for MariaDB 10.6.20 vs 512 for MariaDB 10.11.10, 11.4.4, 11.5.2,11.6.2 and 11.7.1.

For 10.6, 10.11, 11.4, 11.5, 11.6 and 11.7 I repeated sysbench with two my.cnf files:

my.cnf.cz11a_c8r32 or my.cnf.cz11b_c8r32 - these do not set innodb_log_write_ahead_size
my.cnf.cz11a_lwas4k_c8r32 or my.cnf.cz11b_lwas4k_c8r32 - these set innodb_log_write_ahead_size = 4096

And with sysbench I see a big improvement in QPS for 10.11, 11.4, 11.5, 11.6 and 11.7 when I set innodb_log_write_ahead_size on the write-heavy benchmark steps. The largest improvement might be on the update-index microbenchmark (see here). The numbers are the relative QPS (rQPS) which is (QPS for my version) / (QPS for MariaDB 10.5.27) and on update-index the rQPS in MariaDB 11.7.1 is 0.76 when I don't set innodb_log_write_ahead_size vs 1.08 when I set it to 4096. Similar results occur for versions 10.11.10, 11.4.4, 11.5.2 and 11.6.2.

Mark Callaghan added a comment - 2024-12-20 21:30 When I don't set innodb_log_write_ahead_size in my.cnf, it is 8192 for MariaDB 10.6.20 vs 512 for MariaDB 10.11.10, 11.4.4, 11.5.2,11.6.2 and 11.7.1. For 10.6, 10.11, 11.4, 11.5, 11.6 and 11.7 I repeated sysbench with two my.cnf files: my.cnf.cz11a_c8r32 or my.cnf.cz11b_c8r32 - these do not set innodb_log_write_ahead_size my.cnf.cz11a_lwas4k_c8r32 or my.cnf.cz11b_lwas4k_c8r32 - these set innodb_log_write_ahead_size = 4096 And with sysbench I see a big improvement in QPS for 10.11, 11.4, 11.5, 11.6 and 11.7 when I set innodb_log_write_ahead_size on the write-heavy benchmark steps. The largest improvement might be on the update-index microbenchmark ( see here ). The numbers are the relative QPS (rQPS) which is (QPS for my version) / (QPS for MariaDB 10.5.27) and on update-index the rQPS in MariaDB 11.7.1 is 0.76 when I don't set innodb_log_write_ahead_size vs 1.08 when I set it to 4096. Similar results occur for versions 10.11.10, 11.4.4, 11.5.2 and 11.6.2.

MariaDB Server

MariaDB does unexpected storage read IO for the redo log

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration