[MDEV-34062] mariadb-backup --backup is extremely slow at copying ib_logfile0 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.11
Fix Version/s: 10.11.10, 11.2.6, 11.4.4, 11.6.2
Component/s: Backup
Labels:
- performance

Description

While preparing to test ~~MDEV-33817~~ with a recovery-type workload (~~MDEV-29911~~), I noticed that in MariaDB Server 10.11 (but not 10.6), copying the InnoDB write-ahead log (innodb_logfile0) is taking an extremely long time, copying at most 3 MiB/s from a RAM disk. Here are some perf record samples from my local system:

10.11 9dfef3fb4178e44b3f3da029995ef11df8360235 on Debian linux-image-6.7.12-amd64 (2024-04-24)
44,19% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse<recv_buf, false>(recv_buf&, bool)
12,35% mariadb-backup [kernel.kallsyms] [k] rep_movs_alternative
8,14% mariadb-backup mariadb-backup [.] crc32c_3way
7,12% mariadb-backup [kernel.kallsyms] [k] clear_page_erms
1,81% mariadb-backup [kernel.kallsyms] [k] filemap_get_entry
1,78% mariadb-backup [kernel.kallsyms] [k] shmem_file_read_iter
1,60% mariadb-backup [kernel.kallsyms] [k] mem_cgroup_commit_charge
1,10% mariadb-backup [kernel.kallsyms] [k] get_mem_cgroup_from_mm
1,10% mariadb-backup [kernel.kallsyms] [k] shmem_get_folio_gfp
0,96% mariadb-backup [kernel.kallsyms] [k] xas_descend
0,92% mariadb-backup [kernel.kallsyms] [k] get_page_from_freelist
0,92% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse_mtr<false>(bool)
0,86% mariadb-backup [kernel.kallsyms] [k] try_charge_memcg
0,80% mariadb-backup mariadb-backup [.] my_crc32c
0,74% mariadb-backup [kernel.kallsyms] [k] __list_del_entry_valid_or_report
0,61% mariadb-backup [kernel.kallsyms] [k] __rmqueue_pcplist
0,55% mariadb-backup [kernel.kallsyms] [k] lru_gen_add_folio
0,55% mariadb-backup [kernel.kallsyms] [k] _copy_to_iter
0,50% mariadb-backup [kernel.kallsyms] [k] __rcu_read_unlock
0,49% mariadb-backup [kernel.kallsyms] [k] alloc_pages_mpol
0,49% mariadb-backup mariadb-backup [.] xtrabackup_copy_logfile()

The problem appears to be the default value of innodb_log_buffer_size (log_sys.buf_size), which is only 2 MiB. If I specify --innodb-log-buffer-size=1g, then xtrabackup_copy_logfile() will process much more data per a system call, the backup would finish in reasonable time during my Sysbench workload, and the top of perf report would look more reasonable as well:

10.11 9dfef3fb4178e44b3f3da029995ef11df8360235
31,06% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse<recv_buf, false>(recv_buf&, bool)
20,14% mariadb-backup [kernel.kallsyms] [k] rep_movs_alternative
10,01% mariadb-backup [kernel.kallsyms] [k] clear_page_erms
5,72% mariadb-backup mariadb-backup [.] crc32c_3way
3,86% mariadb-backup libc.so.6 [.] __memmove_avx_unaligned_erms_rtm
1,58% mariadb-backup [kernel.kallsyms] [k] filemap_get_entry
1,57% mariadb-backup libc.so.6 [.] __memcmp_avx2_movbe_rtm
1,50% mariadb-backup [kernel.kallsyms] [k] shmem_get_folio_gfp
1,40% mariadb-backup [kernel.kallsyms] [k] xas_descend
1,04% mariadb-backup [kernel.kallsyms] [k] get_mem_cgroup_from_mm
1,02% mariadb-backup [kernel.kallsyms] [k] __mem_cgroup_charge
0,98% mariadb-backup [kernel.kallsyms] [k] __rmqueue_pcplist
0,96% mariadb-backup [kernel.kallsyms] [k] shmem_file_read_iter
0,93% mariadb-backup [kernel.kallsyms] [k] fault_in_readable
0,92% mariadb-backup [kernel.kallsyms] [k] get_page_from_freelist
0,78% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse_mtr<false>(bool)
0,68% mariadb-backup [kernel.kallsyms] [k] xas_load
0,59% mariadb-backup [kernel.kallsyms] [k] xas_start
0,58% mariadb-backup [kernel.kallsyms] [k] try_charge_memcg
0,57% mariadb-backup [kernel.kallsyms] [k] _copy_to_iter
0,52% mariadb-backup [kernel.kallsyms] [k] copy_page_from_iter_atomic
0,52% mariadb-backup [kernel.kallsyms] [k] lru_gen_add_folio
0,50% mariadb-backup [kernel.kallsyms] [k] __alloc_pages
0,44% mariadb-backup [kernel.kallsyms] [k] shmem_add_to_page_cache
0,40% mariadb-backup [kernel.kallsyms] [k] generic_perform_write
0,39% mariadb-backup [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0,38% mariadb-backup [kernel.kallsyms] [k] copy_page_to_iter
0,35% mariadb-backup [kernel.kallsyms] [k] lru_add_fn
0,34% mariadb-backup [kernel.kallsyms] [k] __rcu_read_lock
0,34% mariadb-backup mariadb-backup [.] xtrabackup_copy_logfile()

The large share of kernel functions should be due to file I/O. This was on /dev/shm. The function xtrabackup_copy_logfile() is always invoking pread system calls to read the ib_logfile0.

Before ~~MDEV-14425~~, the default value of innodb_log_buffer_size in mariadb-backup was 1 MiB, but as far as I understand, the parameter was being ignored and the log was apparently being read RECV_SCAN_SIZE bytes at a time, that is, 4*innodb_page_size (by default, 64 KiB). This must have been acceptable when the log file was organized in blocks of 512 bytes. With the variable-size logical block format that was introduced in ~~MDEV-14425~~, we’d apparently better read larger chunks at a time.

I retested this on the 72-thread test system where I originally hit the problem yesterday. At innodb_log_buffer_size=128m the backup would manage to copy the log, but the operation would take longer than 40 seconds, at which point the server had been shut down by the benchmark script.

[00] FATAL ERROR: 2024-05-02 23:10:26 failed to execute query BACKUP STAGE BLOCK_COMMIT: Server has gone away

At innodb_log_buffer_size=256m it is better:

[00] 2024-05-02 23:12:55 Finished backing up non-InnoDB tables and files

[00] 2024-05-02 23:12:55 Waiting for log copy thread to read lsn 2039243060

…

[00] 2024-05-02 23:13:03 Redo log (from LSN 46894 to 2645660131) was copied.

[00] 2024-05-02 23:13:03 completed OK!

On my local system (dual socket Xeon E5-2630 v4, 40 concurrent Sysbench connections to match the 2×2×10 hardware threads), the backup can barely keep up with innodb_log_buffer_size=32m; the backup would finish in 42 seconds. With innodb_log_buffer_size=128m it would finish in 11 seconds. Note that this is an inaccurate test setup; I was manually starting the backup roughly 10 seconds into the write workload, and the InnoDB data files are growing during this workload.

Possibly we should increase the default value of innodb_log_file_size, based on some more testing in various environments (NVMe, SSD, HDD of different speeds). My test setting is completely artificial; the /dev/shm/*/ib_logfile0 will be written by the server via the memory-mapped log_sys.buf while the backup is using regular file system calls. The original motivation of this test was to measure the performance of a subsequent test (mariadb-backup --prepare).

This observation is related to ~~MDEV-30000~~ and ~~MDEV-31410~~ too. It could be a good idea to SET GLOBAL innodb_log_file_buffering=ON in the server during the backup, and specify a larger innodb_log_buffer_size.

Ultimately, we should implement the log copying in the server process (MDEV-14992).

Attachments

Issue Links

blocks

MDEV-34850 Unable to create backup with mariabackup: try increasing innodb_log_file_size

Closed

causes

MDEV-35608 Fake PMEM on /dev/shm no longer works

Closed

MDEV-35785 innodb_log_file_mmap is not defined on 32-bit systems

Closed

is blocked by

MDEV-34750 SET GLOBAL innodb_log_file_size is not crash safe

Closed

relates to

MDEV-31410 mariadb-backup prepare crash with InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT

Closed

MDEV-33894 MariaDB does unexpected storage read IO for the redo log

Closed

MDEV-34070 mariadb-backup --prepare with larger --use-memory may be slower

Confirmed

MDEV-34307 Recovery or mariadb-backup --prepare debug failure [FATAL] InnoDB: Page ... still fixed or dirty

Closed

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-14992 BACKUP: in-server backup

Open

MDEV-19492 Mariabackup hangs if table populated with INSERT... SELECT while it runs

Stalled

MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Closed

MDEV-30000 make mariadb-backup to force an innodb checkpoint

Closed

MDEV-31410 mariadb-backup prepare crash with InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT

Closed

MDEV-33817 Implement AVX512BW and VPCLMULQDQ based CRC-32 algorithms

Closed

MDEV-36159 mariabackup failed after upgrade 10.11.10

Stalled

(11 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-05-03 12:00 - edited

This may also affect the execution time of mariadb-backup --prepare. In a test that I just conducted on my local system, specifying a log buffer of 1 GiB instead of the default 2 MiB would increase by 3 seconds of the time needed to process a 6.2GiB ib_logfile0 from /dev/shm. It is hard to tell if this was due to the log parsing or some other bottlenecks (see MDEV-34070).

It occurred to me that maybe mariadb-backup should simply attempt to mmap with MAP_SHARED and PROT_READ to read the log file. In that way, the parameter innodb_log_buffer_size would only be applicable to 32-bit systems or cases where mmap fails with an error such as ENODEV or ENOMEM. A memory-mapped interface could allow mariadb-backup to access the log file contents straight from the operating system’s block cache, without invoking any system calls.

Marko Mäkelä added a comment - 2024-05-03 12:00 - edited This may also affect the execution time of mariadb-backup --prepare . In a test that I just conducted on my local system, specifying a log buffer of 1 GiB instead of the default 2 MiB would increase by 3 seconds of the time needed to process a 6.2GiB ib_logfile0 from /dev/shm . It is hard to tell if this was due to the log parsing or some other bottlenecks (see MDEV-34070 ). It occurred to me that maybe mariadb-backup should simply attempt to mmap with MAP_SHARED and PROT_READ to read the log file. In that way, the parameter innodb_log_buffer_size would only be applicable to 32-bit systems or cases where mmap fails with an error such as ENODEV or ENOMEM . A memory-mapped interface could allow mariadb-backup to access the log file contents straight from the operating system’s block cache, without invoking any system calls.

Marko Mäkelä added a comment - 2024-05-10 12:22

As far as I understand, each time when mariadb-backup --backup reaches the end of the memory-mapped log, at an LSN that is less than the backup target LSN, it must invoke msync() with MS_INVALIDATE in order to observe the latest writes to the ib_logfile0 by the mariadbd process.

Related to this, I plan to experiment with an innodb_log_file_buffering=mmap option in the server. This new option value would make us use mmap for writing the log. Unlike the PMEM-based log, this would require explicit msync() calls, both with MS_ASYNC (roughly equivalent to an asynchronous pwrite(), kind of ~~MDEV-26603~~) and with MS_SYNC (roughly equivalent to a combination of pwrite() and fdatasync()).

Marko Mäkelä added a comment - 2024-05-10 12:22 As far as I understand, each time when mariadb-backup --backup reaches the end of the memory-mapped log, at an LSN that is less than the backup target LSN, it must invoke msync() with MS_INVALIDATE in order to observe the latest writes to the ib_logfile0 by the mariadbd process. Related to this, I plan to experiment with an innodb_log_file_buffering=mmap option in the server. This new option value would make us use mmap for writing the log. Unlike the PMEM-based log, this would require explicit msync() calls, both with MS_ASYNC (roughly equivalent to an asynchronous pwrite() , kind of MDEV-26603 ) and with MS_SYNC (roughly equivalent to a combination of pwrite() and fdatasync() ).

Marko Mäkelä added a comment - 2024-05-21 05:27

For reproducing the slowness, it suffices to start a backup during a larger sysbench prepare:

sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\

--tables=64 --table_size=100000 &

mariadb-backup --innodb-log-buffer-size=1g -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup

If I omit the innodb_log_buffer_size, the Sysbench DDL workload (creating and populating tables) will be paused for an extremely long time, and the mariadb-backup lock will very slowly show some progress of copying the log:

10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67
2024-05-21 8:13:50 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
[00] 2024-05-21 08:13:50 mariabackup: Generating a list of tablespaces
[00] 2024-05-21 08:13:51 DDL tracking : create 8 "./test/sbtest1.ibd"
[00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock
[00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock
[00] 2024-05-21 08:13:51 BACKUP STAGE START
[00] 2024-05-21 08:13:51 Acquiring BACKUP LOCKS...
…
[00] 2024-05-21 08:13:51 BACKUP STAGE FLUSH
…
[00] 2024-05-21 08:13:51 Finished backing up non-InnoDB tables and files
[00] 2024-05-21 08:13:51 Waiting for log copy thread to read lsn 751079805
[00] 2024-05-21 08:14:03 DDL tracking : create 9 "./test/sbtest2.ibd"
[00] 2024-05-21 08:14:16 DDL tracking : create 10 "./test/sbtest3.ibd"

With the larger innodb_log_buffer_size, the entire backup completes in 7 seconds, even though there is much more log to copy:

10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67
2024-05-21 8:15:24 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
[00] 2024-05-21 08:15:24 mariabackup: Generating a list of tablespaces
[00] 2024-05-21 08:15:25 DDL tracking : create 8 "./test/sbtest1.ibd"
[00] 2024-05-21 08:15:25 DDL tracking : create 9 "./test/sbtest2.ibd"
…
[00] 2024-05-21 08:15:31 Redo log (from LSN 47562 to 1752320328) was copied.

Note: I am using a large enough innodb_log_file_size and innodb_buffer_pool_size on the server, so that no page flushing needs to take place. The checkpoint LSN 47562 is from the shutdown of the server bootstrap.

Marko Mäkelä added a comment - 2024-05-21 05:27 For reproducing the slowness, it suffices to start a backup during a larger sysbench prepare : sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\ --tables=64 --table_size=100000 & mariadb-backup --innodb-log-buffer-size=1g -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup If I omit the innodb_log_buffer_size , the Sysbench DDL workload (creating and populating tables) will be paused for an extremely long time, and the mariadb-backup lock will very slowly show some progress of copying the log: 10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 2024-05-21 8:13:50 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) [00] 2024-05-21 08:13:50 mariabackup: Generating a list of tablespaces [00] 2024-05-21 08:13:51 DDL tracking : create 8 "./test/sbtest1.ibd" [00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock [00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock [00] 2024-05-21 08:13:51 BACKUP STAGE START [00] 2024-05-21 08:13:51 Acquiring BACKUP LOCKS... … [00] 2024-05-21 08:13:51 BACKUP STAGE FLUSH … [00] 2024-05-21 08:13:51 Finished backing up non-InnoDB tables and files [00] 2024-05-21 08:13:51 Waiting for log copy thread to read lsn 751079805 [00] 2024-05-21 08:14:03 DDL tracking : create 9 "./test/sbtest2.ibd" [00] 2024-05-21 08:14:16 DDL tracking : create 10 "./test/sbtest3.ibd" With the larger innodb_log_buffer_size , the entire backup completes in 7 seconds, even though there is much more log to copy: 10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 2024-05-21 8:15:24 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) [00] 2024-05-21 08:15:24 mariabackup: Generating a list of tablespaces [00] 2024-05-21 08:15:25 DDL tracking : create 8 "./test/sbtest1.ibd" [00] 2024-05-21 08:15:25 DDL tracking : create 9 "./test/sbtest2.ibd" … [00] 2024-05-21 08:15:31 Redo log (from LSN 47562 to 1752320328) was copied. Note: I am using a large enough innodb_log_file_size and innodb_buffer_pool_size on the server, so that no page flushing needs to take place. The checkpoint LSN 47562 is from the shutdown of the server bootstrap.

Marko Mäkelä added a comment - 2024-05-21 12:43

With mmap based log parsing, it is quite a bit faster while using less memory:

10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 with patch
[00] 2024-05-21 15:38:46 mariabackup: Generating a list of tablespaces
[00] 2024-05-21 15:38:47 DDL tracking : create 8 "./test/sbtest1.ibd"
[00] 2024-05-21 15:38:47 DDL tracking : create 9 "./test/sbtest2.ibd"
…
[00] 2024-05-21 15:38:51 Copying ./test/sbtest63.ibd to /dev/shm/bu/test/sbtest63.new
…
[00] 2024-05-21 15:38:52 Redo log (from LSN 47514 to 1780641452) was copied.

This is not yet the final version. Even though this copied 28,321,172 bytes more log than the file I/O based experiment with innodb_log_buffer_size=1g, the operation completed 1 second faster.

Marko Mäkelä added a comment - 2024-05-21 12:43 With mmap based log parsing, it is quite a bit faster while using less memory: 10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 with patch [00] 2024-05-21 15:38:46 mariabackup: Generating a list of tablespaces [00] 2024-05-21 15:38:47 DDL tracking : create 8 "./test/sbtest1.ibd" [00] 2024-05-21 15:38:47 DDL tracking : create 9 "./test/sbtest2.ibd" … [00] 2024-05-21 15:38:51 Copying ./test/sbtest63.ibd to /dev/shm/bu/test/sbtest63.new … [00] 2024-05-21 15:38:52 Redo log (from LSN 47514 to 1780641452) was copied. This is not yet the final version. Even though this copied 28,321,172 bytes more log than the file I/O based experiment with innodb_log_buffer_size=1g , the operation completed 1 second faster.

Marko Mäkelä added a comment - 2024-05-24 14:38

It seems that at least on Linux 6.8.9, mmap will be updated automatically when an O_DIRECT written log file resides on ext4fs on an NVMe device. I do not know what exactly msync() with MS_INVALIDATE could be needed for, but apparently it is not really necessary, based on my simple testing.

Marko Mäkelä added a comment - 2024-05-24 14:38 It seems that at least on Linux 6.8.9, mmap will be updated automatically when an O_DIRECT written log file resides on ext4fs on an NVMe device. I do not know what exactly msync() with MS_INVALIDATE could be needed for, but apparently it is not really necessary, based on my simple testing.

Marko Mäkelä added a comment - 2024-05-29 10:35

https://github.com/MariaDB/server/pull/3282 implements a settable Boolean configuration parameter innodb_log_file_mmap that would be set ON by default in mariadb-backup and OFF by default in mariadbd. The use of memory-mapped log writes should allow direct access to the circular log file from the operating system kernel’s block cache, so mariadb-backup would only copy the log once, directly from the memory that backs the server’s ib_logfile0 to the backup copy of the log.

Marko Mäkelä added a comment - 2024-05-29 10:35 https://github.com/MariaDB/server/pull/3282 implements a settable Boolean configuration parameter innodb_log_file_mmap that would be set ON by default in mariadb-backup and OFF by default in mariadbd . The use of memory-mapped log writes should allow direct access to the circular log file from the operating system kernel’s block cache, so mariadb-backup would only copy the log once, directly from the memory that backs the server’s ib_logfile0 to the backup copy of the log.

Matthias Leich added a comment - 2024-05-30 08:21 - edited

origin/10.11-~~MDEV-34062~~ 617bac1d86f5c3d6351746d7000582a2ce5f9efb 2024-05-29T14:04:00+03:00
behaved well in RQG testing. No new problems.
Setup:

innodb_log_file_mmap was not set
filesystems used for storage
tmpfs
ext4 within some container located in tmpfs

Matthias Leich added a comment - 2024-05-30 08:21 - edited origin/10.11- MDEV-34062 617bac1d86f5c3d6351746d7000582a2ce5f9efb 2024-05-29T14:04:00+03:00 behaved well in RQG testing. No new problems. Setup: innodb_log_file_mmap was not set filesystems used for storage tmpfs ext4 within some container located in tmpfs

Marko Mäkelä added a comment - 2024-05-30 09:34 - edited

I ran some performance tests on a SATA 3.0 HDD (Western Digital Blue WDC WD20EZRZ-00Z5HB0) with the following parameters:

cat /sys/block/sdb/queue/hw_sector_size /sys/block/sdb/queue/physical_block_size

InnoDB is referring to the latter when it comes to pwrite.

The best test setup that I came up with for this problem is to "store" the database in /dev/shm and only point innodb_log_group_home_dir to the HDD, so that the only real I/O will be for the ib_logfile0.

With this setup, the fallocate(FALLOC_FL_ZERO_RANGE) calls turn out to actually reduce performance. Here are some results:

`innodb_log_file_buffering`	`innodb_log_file_mmap`	throughput/tps
ON	OFF	26758.98
OFF	OFF	29163.18
ON	ON without `MADV_DONTNEED`	26700.93
ON	ON with `MADV_DONTNEED`, with `fallocate`	19817.50
ON	ON with `MADV_DONTNEED`, without `fallocate`	27504.67
OFF	ON without `MADV_DONTNEED`	26891.79
OFF	ON with `MADV_DONTNEED`, with `fallocate`	18609.23
OFF	ON with `MADV_DONTNEED`, without `fallocate`	28082.70

The test workload was 60-second Sysbench oltp_write_only, 5×1000 tables, large enough buffer pool, innodb_log_file_size=100m, end LSN>18 times the log file size.

The results are rather consistent: the fallocate() will cause more work for a subsequent fdatasync() (the "these blocks pretend to be all zero" metadata must be updated), and they do not seem to help with the read-before-write problem that we should observe when running with innodb_log_file_buffering=OFF (O_DIRECT). By far, we get the fastest log writes when using pwrite and O_DIRECT. The MADV_DONTDEED, which a ‘well behaving software citizen’ should invoke seems to help a little in the innodb_log_file_buffering=ON case.

The conclusion of this experiment is that implementing mmap interface for log writes will unavoidably suffer from the ~~MDEV-33894~~ phenomenon, but it does not seem too bad. With innodb_log_file_buffering=ON (without O_DIRECT) and with a rather small innodb_log_file_size this interface seems to be faster, because the read-before-write could be satisfied from the operating system kernel’s block cache. Also with innodb_log_file_buffering=OFF the page fault penalty compared to plain pwrite does not seem to be too bad in this particular test.

Note: for reads (such as in mariadb-backup or in InnoDB recovery), I don’t think that there is any question whether enabling the mmap interface makes sense.

Marko Mäkelä added a comment - 2024-05-30 09:34 - edited I ran some performance tests on a SATA 3.0 HDD (Western Digital Blue WDC WD20EZRZ-00Z5HB0) with the following parameters: cat /sys/block/sdb/queue/hw_sector_size /sys/block/sdb/queue/physical_block_size 512 4096 InnoDB is referring to the latter when it comes to pwrite . The best test setup that I came up with for this problem is to "store" the database in /dev/shm and only point innodb_log_group_home_dir to the HDD, so that the only real I/O will be for the ib_logfile0 . With this setup, the fallocate(FALLOC_FL_ZERO_RANGE) calls turn out to actually reduce performance. Here are some results: innodb_log_file_buffering innodb_log_file_mmap throughput/tps ON OFF 26758.98 OFF OFF 29163.18 ON ON without MADV_DONTNEED 26700.93 ON ON with MADV_DONTNEED , with fallocate 19817.50 ON ON with MADV_DONTNEED , without fallocate 27504.67 OFF ON without MADV_DONTNEED 26891.79 OFF ON with MADV_DONTNEED , with fallocate 18609.23 OFF ON with MADV_DONTNEED , without fallocate 28082.70 The test workload was 60-second Sysbench oltp_write_only , 5×1000 tables, large enough buffer pool, innodb_log_file_size=100m , end LSN>18 times the log file size. The results are rather consistent: the fallocate() will cause more work for a subsequent fdatasync() (the "these blocks pretend to be all zero" metadata must be updated), and they do not seem to help with the read-before-write problem that we should observe when running with innodb_log_file_buffering=OFF  ( O_DIRECT ). By far, we get the fastest log writes when using pwrite and O_DIRECT . The MADV_DONTDEED , which a ‘well behaving software citizen’ should invoke seems to help a little in the innodb_log_file_buffering=ON case. The conclusion of this experiment is that implementing mmap interface for log writes will unavoidably suffer from the MDEV-33894 phenomenon, but it does not seem too bad. With innodb_log_file_buffering=ON (without O_DIRECT ) and with a rather small innodb_log_file_size this interface seems to be faster, because the read-before-write could be satisfied from the operating system kernel’s block cache. Also with innodb_log_file_buffering=OFF the page fault penalty compared to plain pwrite does not seem to be too bad in this particular test. Note: for reads (such as in mariadb-backup or in InnoDB recovery), I don’t think that there is any question whether enabling the mmap interface makes sense.

Marko Mäkelä added a comment - 2024-05-30 11:14

I reran a variation of my backup benchmark after retaining the MADV_DONTNEED but removing the FALLOC_FL_ZERO_RANGE from the log checkpoint:

sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\

--tables=64 --table_size=100000 prepare &

sleep 10

mariadb-backup -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup

On my system, the backup would complete while the 6th or 7th table is being loaded. This is with

sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

and https://github.com/akopytov/sysbench/pull/419/ applied.

The server is set up so that only the ib_logfile0 resides on the same HDD as in my previous comment; the rest of the data directory is in /dev/shm. The idea is to emphasize the log file operations. This is using the default innodb_log_buffer_size, and the innodb_log_file_size as well as innodb_buffer_pool_size are large enough so that there should be no data page writes during the workload.

server `innodb_log_file_mmap`	server `innodb_log_file_buffering`	backup `innodb_log_file_mmap`	backup time/ms	backup LSN
OFF	OFF	OFF	4495	196,412,051
OFF	OFF	ON	4155	196,413,992
OFF	ON	OFF	2873	168,020,291
OFF	ON	ON	649	170,372,505
ON	OFF	OFF	1686	194,734,268
ON	OFF	ON	639	170,371,573
ON	ON	OFF	1686	170,372,963
ON	ON	ON	4035	196,141,694

We can see that the biggest time savings can be obtained when innodb_log_file_mmap is enabled both on the server and in backup during the test. However, this does not yet demonstrate the original problem where the mariadb-backup --backup was practically hanging in a busy loop trying to parse gigabytes of log via the tiny innodb_log_buffer_size=2m peep hole. Let me add a longer sleep after initiating the sysbench prepare, so that there will be more tables to copy and we can see the original reported problem.

A test with sleep 100 would interrupt the data load around table 60, and with the mmap and file system cache enabled, backup would complete in 5196ms to proceed to LSN=1,701,067,693.

Let me use a shorter sleep 60 in order to make the original reported scenario (without mmap) a little less excessive and to let the test complete a little faster. The data load would be interrupted somewhere between the 35th and the 40th table:

server `innodb_log_file_mmap`	server `innodb_log_file_buffering`	backup `innodb_log_file_mmap`	backup time/ms	backup LSN
OFF	OFF	OFF	20212	1,135,622,987
OFF	OFF	ON	3587	990,945,004
OFF	ON	OFF	3444	1,019,715,163
OFF	ON	ON	3499	1,019,722,199
ON	OFF	OFF	18793	1,104,883,436
ON	OFF	ON	3567	990,947,981
ON	ON	OFF	3498	1,019,620,980
ON	ON	ON	3606	1,022,071,879

We should keep in mind that because this is DDL workload, the backup locks will block the sysbench prepare while it is creating the next table or index. Nevertheless, there is some variation in the end LSN that was reached, based on the server’s log writing speed.

Based on this, it seems to me that accessing a file via mmap in any process will cause the file to remain in the file system cache. The database was reinitialized between individual runs. Whether the server writes log via mmap does not seem to have much impact on this.

By far, the slowest cases are when the server is using O_DIRECT writes to ib_logfile0 and no mmap is being used. We should also keep in mind that we observe the fastest server write performance when O_DIRECT writes to ib_logfile0 are enabled. So, enabling mmap reads in backup seems to make sense.

This test was conducted on ext4fs and Linux 6.8.9.

Marko Mäkelä added a comment - 2024-05-30 11:14 I reran a variation of my backup benchmark after retaining the MADV_DONTNEED but removing the FALLOC_FL_ZERO_RANGE from the log checkpoint: sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\ --tables=64 --table_size=100000 prepare & sleep 10 mariadb-backup -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup On my system, the backup would complete while the 6th or 7th table is being loaded. This is with sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3) and https://github.com/akopytov/sysbench/pull/419/ applied. The server is set up so that only the ib_logfile0 resides on the same HDD as in my previous comment; the rest of the data directory is in /dev/shm . The idea is to emphasize the log file operations. This is using the default innodb_log_buffer_size , and the innodb_log_file_size as well as innodb_buffer_pool_size are large enough so that there should be no data page writes during the workload. server innodb_log_file_mmap server innodb_log_file_buffering backup innodb_log_file_mmap backup time/ms backup LSN OFF OFF OFF 4495 196,412,051 OFF OFF ON 4155 196,413,992 OFF ON OFF 2873 168,020,291 OFF ON ON 649 170,372,505 ON OFF OFF 1686 194,734,268 ON OFF ON 639 170,371,573 ON ON OFF 1686 170,372,963 ON ON ON 4035 196,141,694 We can see that the biggest time savings can be obtained when innodb_log_file_mmap is enabled both on the server and in backup during the test. However, this does not yet demonstrate the original problem where the mariadb-backup --backup was practically hanging in a busy loop trying to parse gigabytes of log via the tiny innodb_log_buffer_size=2m peep hole. Let me add a longer sleep after initiating the sysbench prepare , so that there will be more tables to copy and we can see the original reported problem. A test with sleep 100 would interrupt the data load around table 60, and with the mmap and file system cache enabled, backup would complete in 5196ms to proceed to LSN=1,701,067,693. Let me use a shorter sleep 60 in order to make the original reported scenario (without mmap ) a little less excessive and to let the test complete a little faster. The data load would be interrupted somewhere between the 35th and the 40th table: server innodb_log_file_mmap server innodb_log_file_buffering backup innodb_log_file_mmap backup time/ms backup LSN OFF OFF OFF 20212 1,135,622,987 OFF OFF ON 3587 990,945,004 OFF ON OFF 3444 1,019,715,163 OFF ON ON 3499 1,019,722,199 ON OFF OFF 18793 1,104,883,436 ON OFF ON 3567 990,947,981 ON ON OFF 3498 1,019,620,980 ON ON ON 3606 1,022,071,879 We should keep in mind that because this is DDL workload, the backup locks will block the sysbench prepare while it is creating the next table or index. Nevertheless, there is some variation in the end LSN that was reached, based on the server’s log writing speed. Based on this, it seems to me that accessing a file via mmap in any process will cause the file to remain in the file system cache. The database was reinitialized between individual runs. Whether the server writes log via mmap does not seem to have much impact on this. By far, the slowest cases are when the server is using O_DIRECT writes to ib_logfile0 and no mmap is being used. We should also keep in mind that we observe the fastest server write performance when O_DIRECT writes to ib_logfile0 are enabled. So, enabling mmap reads in backup seems to make sense. This test was conducted on ext4fs and Linux 6.8.9.

Matthias Leich added a comment - 2024-05-31 16:58

RQG testing on origin/10.11-~~MDEV-34062~~ 3d2ec55fd92078b4023ec39af50fcd30aaf8667c 2024-05-30T12:52:26+03:00

Setup:

-   innodb_log_file_mmap with on/off for DB server/mariadb-backup

-   filesystems used for storage

    - ext4 located on HDD

    - ext4 within some container located in tmpfs

Scenario used in all test runs:

1. Start the source DB server

2. Several connections run a DDL/DML mix concurrent

3. During 2. is ongoing run in a loop

     3.1 mariadb-backup --backup <to location A>

     3.2 mariadb-backup --prepare <on location A>

     3.3 Atempt to start a DB server on location A and runs some integrity checks

Results:

- a serious fraction of the tests storing data on the HDD failed during bootstrap because exceeding some RQG builtin timeout

   Reason: The HDD was overloaded and the timeout fits rather to data stored on filesystems in RAM. ==> No error ins DB server or mariadb-backup.

- pluto:/data/results/1717145612/MB-2134

   mariadb-backup --prepare aborts after [ERROR] [FATAL] InnoDB: Page [page id: space=0, page number=622] still fixed or dirty

(rr) bt

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:44

#1  __pthread_kill_internal (signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:78

#2  __GI___pthread_kill (threadid=140727483848960, signo=signo@entry=6) at ./nptl/pthread_kill.c:89

#3  0x00006a2639e19476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

#4  0x00006a2639dff7f3 in __GI_abort () at ./stdlib/abort.c:79

#5  0x00005654f22c7390 in ib::fatal::~fatal (this=this@entry=0x7ffdabae6b50, __in_chrg=<optimized out>) at /data/Server/10.11-MDEV-34062/storage/innobase/ut/ut0ut.cc:488

#6  0x00005654f1f2141a in buf_pool_t::assert_all_freed (this=this@entry=0x5654f308c680 <buf_pool>) at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:3986

#7  0x00005654f1f21470 in buf_pool_invalidate () at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:4009

#8  0x00005654f20fd594 in recv_sys_t::apply (this=this@entry=0x5654f3098080 <recv_sys>, last_batch=last_batch@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3997

#9  0x00005654f2118199 in recv_sys_t::parse<recv_buf, true> (this=this@entry=0x5654f3098080 <recv_sys>, l=..., if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:2944

#10 0x00005654f211894e in recv_sys_t::parse_mtr<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3112

#11 0x00005654f211aea2 in recv_sys_t::parse_mmap<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3123

#12 0x00005654f21029cc in recv_scan_log (last_phase=last_phase@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4161

#13 0x00005654f21037e0 in recv_recovery_from_checkpoint_start () at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4678

#14 0x00005654f2270fb2 in srv_start (create_new_db=create_new_db@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/srv/srv0start.cc:1458

#15 0x00005654f16f9935 in innodb_init () at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:2627

#16 0x00005654f170b768 in xtrabackup_prepare_func (argv=argv@entry=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:6819

#17 0x00005654f17135c6 in main_low (argv=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7754

#18 0x00005654f1713819 in main (argc=10, argv=0x7ffdabaf0688) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7538

(rr)

_RR_TRACE_DIR=./1/rr rr replay --mark-stdio                                        # Fate of the source DB server

_RR_TRACE_DIR=./1/rr rr replay --mark-stdio  mariadb-backup-0   # mariadb-backup --backup

                                                                                                                               # My "rr replay" failed because of maybe

                                                                                                                               # - problem with rr    or

                                                                                                                               # - the RQG test campaign running concurrent consumes too much ressources

_RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio                            # mariadb-backup --prepare

Matthias Leich added a comment - 2024-05-31 16:58 RQG testing on origin/10.11- MDEV-34062 3d2ec55fd92078b4023ec39af50fcd30aaf8667c 2024-05-30T12:52:26+03:00 Setup: - innodb_log_file_mmap with on/off for DB server/mariadb-backup - filesystems used for storage - ext4 located on HDD - ext4 within some container located in tmpfs Scenario used in all test runs: 1. Start the source DB server 2. Several connections run a DDL/DML mix concurrent 3. During 2. is ongoing run in a loop 3.1 mariadb-backup --backup <to location A> 3.2 mariadb-backup --prepare <on location A> 3.3 Atempt to start a DB server on location A and runs some integrity checks Results: - a serious fraction of the tests storing data on the HDD failed during bootstrap because exceeding some RQG builtin timeout Reason: The HDD was overloaded and the timeout fits rather to data stored on filesystems in RAM. ==> No error ins DB server or mariadb-backup. - pluto:/data/results/1717145612/MB-2134 mariadb-backup --prepare aborts after [ERROR] [FATAL] InnoDB: Page [page id: space=0, page number=622] still fixed or dirty (rr) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140727483848960, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00006a2639e19476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00006a2639dff7f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00005654f22c7390 in ib::fatal::~fatal (this=this@entry=0x7ffdabae6b50, __in_chrg=<optimized out>) at /data/Server/10.11-MDEV-34062/storage/innobase/ut/ut0ut.cc:488 #6 0x00005654f1f2141a in buf_pool_t::assert_all_freed (this=this@entry=0x5654f308c680 <buf_pool>) at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:3986 #7 0x00005654f1f21470 in buf_pool_invalidate () at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:4009 #8 0x00005654f20fd594 in recv_sys_t::apply (this=this@entry=0x5654f3098080 <recv_sys>, last_batch=last_batch@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3997 #9 0x00005654f2118199 in recv_sys_t::parse<recv_buf, true> (this=this@entry=0x5654f3098080 <recv_sys>, l=..., if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:2944 #10 0x00005654f211894e in recv_sys_t::parse_mtr<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3112 #11 0x00005654f211aea2 in recv_sys_t::parse_mmap<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3123 #12 0x00005654f21029cc in recv_scan_log (last_phase=last_phase@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4161 #13 0x00005654f21037e0 in recv_recovery_from_checkpoint_start () at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4678 #14 0x00005654f2270fb2 in srv_start (create_new_db=create_new_db@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/srv/srv0start.cc:1458 #15 0x00005654f16f9935 in innodb_init () at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:2627 #16 0x00005654f170b768 in xtrabackup_prepare_func (argv=argv@entry=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:6819 #17 0x00005654f17135c6 in main_low (argv=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7754 #18 0x00005654f1713819 in main (argc=10, argv=0x7ffdabaf0688) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7538 (rr) _RR_TRACE_DIR=./1/rr rr replay --mark-stdio # Fate of the source DB server _RR_TRACE_DIR=./1/rr rr replay --mark-stdio mariadb-backup-0 # mariadb-backup --backup # My "rr replay" failed because of maybe # - problem with rr or # - the RQG test campaign running concurrent consumes too much ressources _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio # mariadb-backup --prepare

Marko Mäkelä added a comment - 2024-06-05 12:35

mleich, thank you. As far as I understand, none of the observed problems are directly related to this change. I filed ~~MDEV-34307~~ for the still fixed or dirty crash.

Marko Mäkelä added a comment - 2024-06-05 12:35 mleich , thank you. As far as I understand, none of the observed problems are directly related to this change. I filed MDEV-34307 for the still fixed or dirty crash.

Marko Mäkelä added a comment - 2024-06-25 14:04

I have an idea how to check for the read-before-write anomaly:

perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 120

The perf report would show R for any reads. I just tested this, "storing" the data directory in /dev/shm and pointing innodb_log_group_home_dir to actual storage. There were only W events, with two different innodb_log_file_size that I tested. I assume that the larger log file did not have a chance to wrap around (start overwriting itself) and the memory-mapped buffer for smaller log file was never swapped out. I would seem to need a longer test or to repeat this test in a machine with smaller RAM.

I also have an idea how to avoid both page faults and the related read-before-write: Issue a pwrite to zero out the next page in the underlying file. Yes, this will incur a system call overhead, but the kernel should allocate a page and update the page table, so that on the next access from the user space the page will already be valid, and no read from the storage will be necessary.

Marko Mäkelä added a comment - 2024-06-25 14:04 I have an idea how to check for the read-before-write anomaly: perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 120 The perf report would show R for any reads. I just tested this, "storing" the data directory in /dev/shm and pointing innodb_log_group_home_dir to actual storage. There were only W events, with two different innodb_log_file_size that I tested. I assume that the larger log file did not have a chance to wrap around (start overwriting itself) and the memory-mapped buffer for smaller log file was never swapped out. I would seem to need a longer test or to repeat this test in a machine with smaller RAM. I also have an idea how to avoid both page faults and the related read-before-write: Issue a pwrite to zero out the next page in the underlying file. Yes, this will incur a system call overhead, but the kernel should allocate a page and update the page table, so that on the next access from the user space the page will already be valid, and no read from the storage will be necessary.

Marko Mäkelä added a comment - 2024-07-31 11:56

I updated https://github.com/MariaDB/server/pull/3282; there had been some conflicts due to ~~MDEV-33894~~. I did not yet test the above pwrite idea. On Microsoft Windows, the memory-mapped interface would be disabled by default also in mariadb-backup.

Marko Mäkelä added a comment - 2024-07-31 11:56 I updated https://github.com/MariaDB/server/pull/3282 ; there had been some conflicts due to MDEV-33894 . I did not yet test the above pwrite idea. On Microsoft Windows, the memory-mapped interface would be disabled by default also in mariadb-backup .

Marko Mäkelä added a comment - 2024-08-01 14:00

I tried again to reproduce the read-before-write anomaly. If I invoke perf record -e major-faults for the experiment where only the ib_logfile0 resides in persistent storage, then 100% of the samples will be attributed to __memmove_avx_unaligned_erms_rtm or mtr_t::finish_writer inside mtr_t::commit. The faults inside mtr_t::finish_writer are for writing the sequence bit or the CRC-32C.

If I reduce the size of the workload so that the end LSN will be significantly below innodb_log_file_size (the file will not wrap around), there will still be major-faults for the log writes. I tested further with innodb_log_file_buffering=ON (not enabling O_DIRECT access) and changing innodb_log_file_size on the server startup. Also in this case, major-faults were recorded at exactly the same spots.

To test whether some pwrite (of garbage from the buffer pool) would prevent the major faults, I applied the following quick and dirty patch, which is based on my earlier fallocate(FALLOC_FL_ZERO_RANGE attempt, which I found to be detrimental to performance:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index 6ff7a64409e..fac13ce5064 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -1942,21 +1942,29 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept

     const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)};

     if (offset == start_offset);

     else if (offset < start_offset)

+    {

       IF_WIN(VirtualAlloc(buf + offset, start_offset - offset,

                           MEM_RESET, PAGE_READWRITE),

              madvise(reinterpret_cast<char*>(buf) + offset,

                      start_offset - offset, MADV_DONTNEED));

+      pwrite(log.m_file, buf_pool.chunks->mem, start_offset - offset, offset);

+    }

     else

       IF_WIN(VirtualAlloc(buf + START_OFFSET, start_offset - START_OFFSET,

                           MEM_RESET, PAGE_READWRITE),

              madvise(reinterpret_cast<char*>(buf) + START_OFFSET,

                      start_offset - START_OFFSET, MADV_DONTNEED));

+      pwrite(log.m_file, buf_pool.chunks->mem, start_offset - START_OFFSET,

+             START_OFFSET);

       offset+= ps;

       if (size_t end_size= file_size - offset)

+      {

         IF_WIN(VirtualAlloc(buf + offset, end_size, MEM_RESET, PAGE_READWRITE),

                madvise(reinterpret_cast<char*>(buf) + offset, end_size,

                        MADV_DONTNEED));

+        pwrite(log.m_file, buf_pool.chunks->mem, end_size, offset);

+      }

 #endif

I still got roughly the same amount of major-faults events during the 2-minute run (about 40,000), all of them in mtr_t::commit. This was without any O_DIRECT and with log resizing at the start of the run:

10.11-MDEV-34062 e8dfe1fcc695baa61ac6ed263b22cca88fe0ff92 and the above
2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes)
2024-08-01 16:49:58 0 [Note] InnoDB: End of log at LSN=46830
2024-08-01 16:49:58 0 [Note] InnoDB: Resizing redo log from 1.000GiB to 2.000GiB; LSN=46830
2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes)

What I have been unable to conclusively prove or disprove so far is an actual read from the block level storage when a new page of the log buffer is being written. That said, it could be least risky to limit the memory-mapped access to reading the log in mariadb-backup or in the InnoDB startup. On 64-bit systems, InnoDB could be started with memory-mapped log recovery, but at the end of recovery it would switch to regular log file I/O.

Marko Mäkelä added a comment - 2024-08-01 14:00 I tried again to reproduce the read-before-write anomaly. If I invoke perf record -e major-faults for the experiment where only the ib_logfile0 resides in persistent storage, then 100% of the samples will be attributed to __memmove_avx_unaligned_erms_rtm or mtr_t::finish_writer inside mtr_t::commit . The faults inside mtr_t::finish_writer are for writing the sequence bit or the CRC-32C. If I reduce the size of the workload so that the end LSN will be significantly below innodb_log_file_size (the file will not wrap around), there will still be major-faults for the log writes. I tested further with innodb_log_file_buffering=ON (not enabling O_DIRECT access) and changing innodb_log_file_size on the server startup. Also in this case, major-faults were recorded at exactly the same spots. To test whether some pwrite (of garbage from the buffer pool) would prevent the major faults, I applied the following quick and dirty patch, which is based on my earlier fallocate(FALLOC_FL_ZERO_RANGE attempt, which I found to be detrimental to performance: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 6ff7a64409e..fac13ce5064 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1942,21 +1942,29 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)}; if (offset == start_offset); else if (offset < start_offset) + { IF_WIN(VirtualAlloc(buf + offset, start_offset - offset, MEM_RESET, PAGE_READWRITE), madvise(reinterpret_cast<char*>(buf) + offset, start_offset - offset, MADV_DONTNEED)); + pwrite(log.m_file, buf_pool.chunks->mem, start_offset - offset, offset); + } else { IF_WIN(VirtualAlloc(buf + START_OFFSET, start_offset - START_OFFSET, MEM_RESET, PAGE_READWRITE), madvise(reinterpret_cast<char*>(buf) + START_OFFSET, start_offset - START_OFFSET, MADV_DONTNEED)); + pwrite(log.m_file, buf_pool.chunks->mem, start_offset - START_OFFSET, + START_OFFSET); offset+= ps; if (size_t end_size= file_size - offset) + { IF_WIN(VirtualAlloc(buf + offset, end_size, MEM_RESET, PAGE_READWRITE), madvise(reinterpret_cast<char*>(buf) + offset, end_size, MADV_DONTNEED)); + pwrite(log.m_file, buf_pool.chunks->mem, end_size, offset); + } } } #endif I still got roughly the same amount of major-faults events during the 2-minute run (about 40,000), all of them in mtr_t::commit . This was without any O_DIRECT and with log resizing at the start of the run: 10.11-MDEV-34062 e8dfe1fcc695baa61ac6ed263b22cca88fe0ff92 and the above 2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes) 2024-08-01 16:49:58 0 [Note] InnoDB: End of log at LSN=46830 2024-08-01 16:49:58 0 [Note] InnoDB: Resizing redo log from 1.000GiB to 2.000GiB; LSN=46830 2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes) What I have been unable to conclusively prove or disprove so far is an actual read from the block level storage when a new page of the log buffer is being written. That said, it could be least risky to limit the memory-mapped access to reading the log in mariadb-backup or in the InnoDB startup. On 64-bit systems, InnoDB could be started with memory-mapped log recovery, but at the end of recovery it would switch to regular log file I/O.

Marko Mäkelä added a comment - 2024-08-19 09:21

Testing this is currently blocked by ~~MDEV-34750~~.

Marko Mäkelä added a comment - 2024-08-19 09:21 Testing this is currently blocked by MDEV-34750 .

Marko Mäkelä added a comment - 2024-09-09 06:06

Clarification: Initially, I was experimenting with allowing memory-mapped log writes. But, to fix this regression we would only need log reads to be memory-mapped. On 2024-08-05 I had revised the suggested fix accordingly. By default, we would use memory-mapped read in InnoDB recovery and in mariadb-backup on 64-bit Linux and FreeBSD. On other systems, it would have to be enabled explicitly by a new option innodb_log_file_mmap=ON. Log writes would not be memory-mapped except on 64-bit Linux if persistent memory is present or being emulated on /dev/shm.

The regression occurs, because starting with ~~MDEV-14425~~ the log is no longer divided into fixed-size (512-byte) blocks, but into mini-transaction-size blocks. Therefore, upon reading number of blocks from the ib_logfile0 we cannot quickly compute checksums of 512-byte blocks but must parse the individual log records and verify the checksums of each individual mini-transaction. A small innodb_log_buffer_size would force frequent costly re-parsing. Before ~~MDEV-14425~~, the buffer size for reading 512-byte blocks was hard-coded as 4*innodb_page_size (by default, 64 KiB). That was not a problem, because there was no need to re-read partially the same blocks over and over again.

With a memory-mapped log file read interface, which has been covered on our CI thanks to the PMEM emulation on /dev/shm, the log parsing buffer size is basically infinite, and no costly re-parsing will be necessary.

Marko Mäkelä added a comment - 2024-09-09 06:06 Clarification: Initially, I was experimenting with allowing memory-mapped log writes . But, to fix this regression we would only need log reads to be memory-mapped. On 2024-08-05 I had revised the suggested fix accordingly. By default, we would use memory-mapped read in InnoDB recovery and in mariadb-backup on 64-bit Linux and FreeBSD. On other systems, it would have to be enabled explicitly by a new option innodb_log_file_mmap=ON . Log writes would not be memory-mapped except on 64-bit Linux if persistent memory is present or being emulated on /dev/shm . The regression occurs, because starting with MDEV-14425 the log is no longer divided into fixed-size (512-byte) blocks, but into mini-transaction-size blocks. Therefore, upon reading number of blocks from the ib_logfile0 we cannot quickly compute checksums of 512-byte blocks but must parse the individual log records and verify the checksums of each individual mini-transaction. A small innodb_log_buffer_size would force frequent costly re-parsing. Before MDEV-14425 , the buffer size for reading 512-byte blocks was hard-coded as 4* innodb_page_size (by default, 64 KiB). That was not a problem, because there was no need to re-read partially the same blocks over and over again. With a memory-mapped log file read interface, which has been covered on our CI thanks to the PMEM emulation on /dev/shm , the log parsing buffer size is basically infinite, and no costly re-parsing will be necessary.

Marko Mäkelä added a comment - 2024-09-25 13:32 - edited

There was a possibility that simplifying the log record parsing function in https://github.com/MariaDB/server/pull/3507 (originally filed as something simpler that would only fix a bogus debug assertion ~~MDEV-34907~~ instead of removing an entire code branch) would make the parsing fast enough even with the minimal innodb_log_buffer_size=2m.

It turns out that simplifying the log parsing function is not sufficient for addressing this. I tested a combined branch that contains both fixes and invoked the scenario that I had written down in my first 2024-05-21 comment in this ticket:

sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\

--tables=64 --table_size=100000 prepare &

mariadb-backup --innodb-log-buffer-size=2m --innodb-log-file-mmap=off -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup

In an invocation like this, it still takes an extremely long time to copy the log, long after the sysbench prepare completed:

[00] 2024-09-25 16:18:53 Waiting for log copy thread to read lsn 1816982998

[00] 2024-09-25 16:19:05 DDL tracking : create 6 "./test/sbtest2.ibd"

[00] 2024-09-25 16:19:18 DDL tracking : create 7 "./test/sbtest3.ibd"

…

[00] 2024-09-25 16:25:26 DDL tracking : create 34 "./test/sbtest30.ibd"

[00] 2024-09-25 16:25:40 DDL tracking : create 35 "./test/sbtest31.ibd"

[00] 2024-09-25 16:25:54 DDL tracking : create 36 "./test/sbtest32.ibd"

…

[00] 2024-09-25 16:28:51 DDL tracking : create 49 "./test/sbtest45.ibd"

[00] 2024-09-25 16:29:05 DDL tracking : create 50 "./test/sbtest46.ibd"

[00] 2024-09-25 16:29:18 DDL tracking : create 51 "./test/sbtest47.ibd"

…

[00] 2024-09-25 16:31:07 DDL tracking : create 59 "./test/sbtest55.ibd"

[00] 2024-09-25 16:31:21 DDL tracking : create 60 "./test/sbtest56.ibd"

[00] 2024-09-25 16:31:35 DDL tracking : create 61 "./test/sbtest57.ibd"

…

If I specify an innodb_log_buffer_size of 1 GiB or do not disable the use of memory-mapped log reads (which are implemented in https://github.com/MariaDB/server/pull/3282), the backup will finish basically instantly.

Marko Mäkelä added a comment - 2024-09-25 13:32 - edited There was a possibility that simplifying the log record parsing function in https://github.com/MariaDB/server/pull/3507 (originally filed as something simpler that would only fix a bogus debug assertion MDEV-34907 instead of removing an entire code branch) would make the parsing fast enough even with the minimal innodb_log_buffer_size=2m . It turns out that simplifying the log parsing function is not sufficient for addressing this. I tested a combined branch that contains both fixes and invoked the scenario that I had written down in my first 2024-05-21 comment in this ticket: sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\ --tables=64 --table_size=100000 prepare & mariadb-backup --innodb-log-buffer-size=2m --innodb-log-file-mmap=off -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup In an invocation like this, it still takes an extremely long time to copy the log, long after the sysbench prepare completed: [00] 2024-09-25 16:18:53 Waiting for log copy thread to read lsn 1816982998 [00] 2024-09-25 16:19:05 DDL tracking : create 6 "./test/sbtest2.ibd" [00] 2024-09-25 16:19:18 DDL tracking : create 7 "./test/sbtest3.ibd" … [00] 2024-09-25 16:25:26 DDL tracking : create 34 "./test/sbtest30.ibd" [00] 2024-09-25 16:25:40 DDL tracking : create 35 "./test/sbtest31.ibd" [00] 2024-09-25 16:25:54 DDL tracking : create 36 "./test/sbtest32.ibd" … [00] 2024-09-25 16:28:51 DDL tracking : create 49 "./test/sbtest45.ibd" [00] 2024-09-25 16:29:05 DDL tracking : create 50 "./test/sbtest46.ibd" [00] 2024-09-25 16:29:18 DDL tracking : create 51 "./test/sbtest47.ibd" … [00] 2024-09-25 16:31:07 DDL tracking : create 59 "./test/sbtest55.ibd" [00] 2024-09-25 16:31:21 DDL tracking : create 60 "./test/sbtest56.ibd" [00] 2024-09-25 16:31:35 DDL tracking : create 61 "./test/sbtest57.ibd" … If I specify an innodb_log_buffer_size of 1 GiB or do not disable the use of memory-mapped log reads (which are implemented in https://github.com/MariaDB/server/pull/3282 ), the backup will finish basically instantly.

Marko Mäkelä added a comment - 2024-09-25 13:41

On a positive note, the backup was completed successfully:

[00] 2024-09-25 16:33:25 Copying ./test/sbtest64.ibd to /dev/shm/backup/test/sbtest64.new

[00] 2024-09-25 16:33:25         ...done

[00] 2024-09-25 16:33:25 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001

[00] 2024-09-25 16:33:25 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 0 bytes

[00] 2024-09-25 16:33:25 BACKUP STAGE BLOCK_COMMIT

[00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSM for log table `mysql`.`slow_log`, 35 bytes

[00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSV for log table `mysql`.`slow_log`, 0 bytes

[00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSM for log table `mysql`.`general_log`, 35 bytes

[00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSV for log table `mysql`.`general_log`, 0 bytes

[00] 2024-09-25 16:33:25 mariabackup: The latest check point (for incremental): '47149'

[00] 2024-09-25 16:33:25 Waiting for log copy thread to read lsn 18446744073709551615

[00] 2024-09-25 16:33:30 Start copying statistics aria tables.

[00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAI is copied successfully.

[00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAD is copied successfully.

[00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAI is copied successfully.

[00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAD is copied successfully.

[00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAI is copied successfully.

[00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAD is copied successfully.

[00] 2024-09-25 16:33:30 Stop copying statistics aria tables.

[00] 2024-09-25 16:33:30 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001

[00] 2024-09-25 16:33:30 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 8192 bytes

[00] 2024-09-25 16:33:30 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...

[00] 2024-09-25 16:33:30 BACKUP STAGE END

[00] 2024-09-25 16:33:30 Executing BACKUP STAGE END

[00] 2024-09-25 16:33:30 All tables unlocked

[00] 2024-09-25 16:33:30 Backup created in directory '/dev/shm/backup/'

[00] 2024-09-25 16:33:30 Writing backup-my.cnf

[00] 2024-09-25 16:33:30         ...done

[00] 2024-09-25 16:33:30 Writing xtrabackup_info

[00] 2024-09-25 16:33:30         ...done

[00] 2024-09-25 16:33:30 Redo log (from LSN 47149 to 1816982998) was copied.

[00] 2024-09-25 16:33:30 completed OK!

I think that it is a bit excessive that copying 3.6 GiB of files from RAM disk takes some 15 minutes, while the actual server workload (creating and populating a few tables) was completed in less than a minute.

Marko Mäkelä added a comment - 2024-09-25 13:41 On a positive note, the backup was completed successfully: [00] 2024-09-25 16:33:25 Copying ./test/sbtest64.ibd to /dev/shm/backup/test/sbtest64.new [00] 2024-09-25 16:33:25 ...done [00] 2024-09-25 16:33:25 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001 [00] 2024-09-25 16:33:25 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 0 bytes [00] 2024-09-25 16:33:25 BACKUP STAGE BLOCK_COMMIT [00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSM for log table `mysql`.`slow_log`, 35 bytes [00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSV for log table `mysql`.`slow_log`, 0 bytes [00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSM for log table `mysql`.`general_log`, 35 bytes [00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSV for log table `mysql`.`general_log`, 0 bytes [00] 2024-09-25 16:33:25 mariabackup: The latest check point (for incremental): '47149' [00] 2024-09-25 16:33:25 Waiting for log copy thread to read lsn 18446744073709551615 [00] 2024-09-25 16:33:30 Start copying statistics aria tables. [00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAI is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAD is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAI is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAD is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAI is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAD is copied successfully. [00] 2024-09-25 16:33:30 Stop copying statistics aria tables. [00] 2024-09-25 16:33:30 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001 [00] 2024-09-25 16:33:30 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 8192 bytes [00] 2024-09-25 16:33:30 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... [00] 2024-09-25 16:33:30 BACKUP STAGE END [00] 2024-09-25 16:33:30 Executing BACKUP STAGE END [00] 2024-09-25 16:33:30 All tables unlocked [00] 2024-09-25 16:33:30 Backup created in directory '/dev/shm/backup/' [00] 2024-09-25 16:33:30 Writing backup-my.cnf [00] 2024-09-25 16:33:30 ...done [00] 2024-09-25 16:33:30 Writing xtrabackup_info [00] 2024-09-25 16:33:30 ...done [00] 2024-09-25 16:33:30 Redo log (from LSN 47149 to 1816982998) was copied. [00] 2024-09-25 16:33:30 completed OK! I think that it is a bit excessive that copying 3.6 GiB of files from RAM disk takes some 15 minutes, while the actual server workload (creating and populating a few tables) was completed in less than a minute.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 2 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 2024-05-03 06:51

Updated:: 2025-03-05 10:09

Resolved:: 2024-09-27 04:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.