Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34062

mariadb-backup --backup is extremely slow at copying ib_logfile0

Details

    Description

      While preparing to test MDEV-33817 with a recovery-type workload (MDEV-29911), I noticed that in MariaDB Server 10.11 (but not 10.6), copying the InnoDB write-ahead log (innodb_logfile0) is taking an extremely long time, copying at most 3 MiB/s from a RAM disk. Here are some perf record samples from my local system:

      10.11 9dfef3fb4178e44b3f3da029995ef11df8360235 on Debian linux-image-6.7.12-amd64 (2024-04-24)

        44,19%  mariadb-backup  mariadb-backup     [.] recv_sys_t::parse_mtr_result recv_sys_t::parse<recv_buf, false>(recv_buf&, bool)
        12,35%  mariadb-backup  [kernel.kallsyms]  [k] rep_movs_alternative
         8,14%  mariadb-backup  mariadb-backup     [.] crc32c_3way
         7,12%  mariadb-backup  [kernel.kallsyms]  [k] clear_page_erms
         1,81%  mariadb-backup  [kernel.kallsyms]  [k] filemap_get_entry
         1,78%  mariadb-backup  [kernel.kallsyms]  [k] shmem_file_read_iter
         1,60%  mariadb-backup  [kernel.kallsyms]  [k] mem_cgroup_commit_charge
         1,10%  mariadb-backup  [kernel.kallsyms]  [k] get_mem_cgroup_from_mm
         1,10%  mariadb-backup  [kernel.kallsyms]  [k] shmem_get_folio_gfp
         0,96%  mariadb-backup  [kernel.kallsyms]  [k] xas_descend
         0,92%  mariadb-backup  [kernel.kallsyms]  [k] get_page_from_freelist
         0,92%  mariadb-backup  mariadb-backup     [.] recv_sys_t::parse_mtr_result recv_sys_t::parse_mtr<false>(bool)
         0,86%  mariadb-backup  [kernel.kallsyms]  [k] try_charge_memcg
         0,80%  mariadb-backup  mariadb-backup     [.] my_crc32c
         0,74%  mariadb-backup  [kernel.kallsyms]  [k] __list_del_entry_valid_or_report
         0,61%  mariadb-backup  [kernel.kallsyms]  [k] __rmqueue_pcplist
         0,55%  mariadb-backup  [kernel.kallsyms]  [k] lru_gen_add_folio
         0,55%  mariadb-backup  [kernel.kallsyms]  [k] _copy_to_iter
         0,50%  mariadb-backup  [kernel.kallsyms]  [k] __rcu_read_unlock
         0,49%  mariadb-backup  [kernel.kallsyms]  [k] alloc_pages_mpol
         0,49%  mariadb-backup  mariadb-backup     [.] xtrabackup_copy_logfile()
      

      The problem appears to be the default value of innodb_log_buffer_size (log_sys.buf_size), which is only 2 MiB. If I specify --innodb-log-buffer-size=1g, then xtrabackup_copy_logfile() will process much more data per a system call, the backup would finish in reasonable time during my Sysbench workload, and the top of perf report would look more reasonable as well:

      10.11 9dfef3fb4178e44b3f3da029995ef11df8360235

        31,06%  mariadb-backup  mariadb-backup        [.] recv_sys_t::parse_mtr_result recv_sys_t::parse<recv_buf, false>(recv_buf&, bool)
        20,14%  mariadb-backup  [kernel.kallsyms]     [k] rep_movs_alternative
        10,01%  mariadb-backup  [kernel.kallsyms]     [k] clear_page_erms
         5,72%  mariadb-backup  mariadb-backup        [.] crc32c_3way
         3,86%  mariadb-backup  libc.so.6             [.] __memmove_avx_unaligned_erms_rtm
         1,58%  mariadb-backup  [kernel.kallsyms]     [k] filemap_get_entry
         1,57%  mariadb-backup  libc.so.6             [.] __memcmp_avx2_movbe_rtm
         1,50%  mariadb-backup  [kernel.kallsyms]     [k] shmem_get_folio_gfp
         1,40%  mariadb-backup  [kernel.kallsyms]     [k] xas_descend
         1,04%  mariadb-backup  [kernel.kallsyms]     [k] get_mem_cgroup_from_mm
         1,02%  mariadb-backup  [kernel.kallsyms]     [k] __mem_cgroup_charge
         0,98%  mariadb-backup  [kernel.kallsyms]     [k] __rmqueue_pcplist
         0,96%  mariadb-backup  [kernel.kallsyms]     [k] shmem_file_read_iter
         0,93%  mariadb-backup  [kernel.kallsyms]     [k] fault_in_readable
         0,92%  mariadb-backup  [kernel.kallsyms]     [k] get_page_from_freelist
         0,78%  mariadb-backup  mariadb-backup        [.] recv_sys_t::parse_mtr_result recv_sys_t::parse_mtr<false>(bool)
         0,68%  mariadb-backup  [kernel.kallsyms]     [k] xas_load
         0,59%  mariadb-backup  [kernel.kallsyms]     [k] xas_start
         0,58%  mariadb-backup  [kernel.kallsyms]     [k] try_charge_memcg
         0,57%  mariadb-backup  [kernel.kallsyms]     [k] _copy_to_iter
         0,52%  mariadb-backup  [kernel.kallsyms]     [k] copy_page_from_iter_atomic
         0,52%  mariadb-backup  [kernel.kallsyms]     [k] lru_gen_add_folio
         0,50%  mariadb-backup  [kernel.kallsyms]     [k] __alloc_pages
         0,44%  mariadb-backup  [kernel.kallsyms]     [k] shmem_add_to_page_cache
         0,40%  mariadb-backup  [kernel.kallsyms]     [k] generic_perform_write
         0,39%  mariadb-backup  [kernel.kallsyms]     [k] __mod_memcg_lruvec_state
         0,38%  mariadb-backup  [kernel.kallsyms]     [k] copy_page_to_iter
         0,35%  mariadb-backup  [kernel.kallsyms]     [k] lru_add_fn
         0,34%  mariadb-backup  [kernel.kallsyms]     [k] __rcu_read_lock
         0,34%  mariadb-backup  mariadb-backup        [.] xtrabackup_copy_logfile()
      

      The large share of kernel functions should be due to file I/O. This was on /dev/shm. The function xtrabackup_copy_logfile() is always invoking pread system calls to read the ib_logfile0.

      Before MDEV-14425, the default value of innodb_log_buffer_size in mariadb-backup was 1 MiB, but as far as I understand, the parameter was being ignored and the log was apparently being read RECV_SCAN_SIZE bytes at a time, that is, 4*innodb_page_size (by default, 64 KiB). This must have been acceptable when the log file was organized in blocks of 512 bytes. With the variable-size logical block format that was introduced in MDEV-14425, we’d apparently better read larger chunks at a time.

      I retested this on the 72-thread test system where I originally hit the problem yesterday. At innodb_log_buffer_size=128m the backup would manage to copy the log, but the operation would take longer than 40 seconds, at which point the server had been shut down by the benchmark script.

      [00] FATAL ERROR: 2024-05-02 23:10:26 failed to execute query BACKUP STAGE BLOCK_COMMIT: Server has gone away
      

      At innodb_log_buffer_size=256m it is better:

      [00] 2024-05-02 23:12:55 Finished backing up non-InnoDB tables and files
      [00] 2024-05-02 23:12:55 Waiting for log copy thread to read lsn 2039243060
      …
      [00] 2024-05-02 23:13:03 Redo log (from LSN 46894 to 2645660131) was copied.
      [00] 2024-05-02 23:13:03 completed OK!
      

      On my local system (dual socket Xeon E5-2630 v4, 40 concurrent Sysbench connections to match the 2×2×10 hardware threads), the backup can barely keep up with innodb_log_buffer_size=32m; the backup would finish in 42 seconds. With innodb_log_buffer_size=128m it would finish in 11 seconds. Note that this is an inaccurate test setup; I was manually starting the backup roughly 10 seconds into the write workload, and the InnoDB data files are growing during this workload.

      Possibly we should increase the default value of innodb_log_file_size, based on some more testing in various environments (NVMe, SSD, HDD of different speeds). My test setting is completely artificial; the /dev/shm/*/ib_logfile0 will be written by the server via the memory-mapped log_sys.buf while the backup is using regular file system calls. The original motivation of this test was to measure the performance of a subsequent test (mariadb-backup --prepare).

      This observation is related to MDEV-30000 and MDEV-31410 too. It could be a good idea to SET GLOBAL innodb_log_file_buffering=ON in the server during the backup, and specify a larger innodb_log_buffer_size.

      Ultimately, we should implement the log copying in the server process (MDEV-14992).

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            This may also affect the execution time of mariadb-backup --prepare. In a test that I just conducted on my local system, specifying a log buffer of 1 GiB instead of the default 2 MiB would increase by 3 seconds of the time needed to process a 6.2GiB ib_logfile0 from /dev/shm. It is hard to tell if this was due to the log parsing or some other bottlenecks (see MDEV-34070).

            It occurred to me that maybe mariadb-backup should simply attempt to mmap with MAP_SHARED and PROT_READ to read the log file. In that way, the parameter innodb_log_buffer_size would only be applicable to 32-bit systems or cases where mmap fails with an error such as ENODEV or ENOMEM. A memory-mapped interface could allow mariadb-backup to access the log file contents straight from the operating system’s block cache, without invoking any system calls.

            marko Marko Mäkelä added a comment - - edited This may also affect the execution time of mariadb-backup --prepare . In a test that I just conducted on my local system, specifying a log buffer of 1 GiB instead of the default 2 MiB would increase by 3 seconds of the time needed to process a 6.2GiB ib_logfile0 from /dev/shm . It is hard to tell if this was due to the log parsing or some other bottlenecks (see MDEV-34070 ). It occurred to me that maybe mariadb-backup should simply attempt to mmap with MAP_SHARED and PROT_READ to read the log file. In that way, the parameter innodb_log_buffer_size would only be applicable to 32-bit systems or cases where mmap fails with an error such as ENODEV or ENOMEM . A memory-mapped interface could allow mariadb-backup to access the log file contents straight from the operating system’s block cache, without invoking any system calls.

            As far as I understand, each time when mariadb-backup --backup reaches the end of the memory-mapped log, at an LSN that is less than the backup target LSN, it must invoke msync() with MS_INVALIDATE in order to observe the latest writes to the ib_logfile0 by the mariadbd process.

            Related to this, I plan to experiment with an innodb_log_file_buffering=mmap option in the server. This new option value would make us use mmap for writing the log. Unlike the PMEM-based log, this would require explicit msync() calls, both with MS_ASYNC (roughly equivalent to an asynchronous pwrite(), kind of MDEV-26603) and with MS_SYNC (roughly equivalent to a combination of pwrite() and fdatasync()).

            marko Marko Mäkelä added a comment - As far as I understand, each time when mariadb-backup --backup reaches the end of the memory-mapped log, at an LSN that is less than the backup target LSN, it must invoke msync() with MS_INVALIDATE in order to observe the latest writes to the ib_logfile0 by the mariadbd process. Related to this, I plan to experiment with an innodb_log_file_buffering=mmap option in the server. This new option value would make us use mmap for writing the log. Unlike the PMEM-based log, this would require explicit msync() calls, both with MS_ASYNC (roughly equivalent to an asynchronous pwrite() , kind of MDEV-26603 ) and with MS_SYNC (roughly equivalent to a combination of pwrite() and fdatasync() ).

            For reproducing the slowness, it suffices to start a backup during a larger sysbench prepare:

            sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\
            --tables=64 --table_size=100000 &
            mariadb-backup --innodb-log-buffer-size=1g -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup
            

            If I omit the innodb_log_buffer_size, the Sysbench DDL workload (creating and populating tables) will be paused for an extremely long time, and the mariadb-backup lock will very slowly show some progress of copying the log:

            10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67

            2024-05-21  8:13:50 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
            [00] 2024-05-21 08:13:50 mariabackup: Generating a list of tablespaces
            [00] 2024-05-21 08:13:51 DDL tracking : create 8 "./test/sbtest1.ibd"
            [00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock
            [00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock
            [00] 2024-05-21 08:13:51 BACKUP STAGE START
            [00] 2024-05-21 08:13:51 Acquiring BACKUP LOCKS...
            …
            [00] 2024-05-21 08:13:51 BACKUP STAGE FLUSH
            …
            [00] 2024-05-21 08:13:51 Finished backing up non-InnoDB tables and files
            [00] 2024-05-21 08:13:51 Waiting for log copy thread to read lsn 751079805
            [00] 2024-05-21 08:14:03 DDL tracking : create 9 "./test/sbtest2.ibd"
            [00] 2024-05-21 08:14:16 DDL tracking : create 10 "./test/sbtest3.ibd"
            

            With the larger innodb_log_buffer_size, the entire backup completes in 7 seconds, even though there is much more log to copy:

            10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67

            2024-05-21  8:15:24 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
            [00] 2024-05-21 08:15:24 mariabackup: Generating a list of tablespaces
            [00] 2024-05-21 08:15:25 DDL tracking : create 8 "./test/sbtest1.ibd"
            [00] 2024-05-21 08:15:25 DDL tracking : create 9 "./test/sbtest2.ibd"
            …
            [00] 2024-05-21 08:15:31 Redo log (from LSN 47562 to 1752320328) was copied.
            

            Note: I am using a large enough innodb_log_file_size and innodb_buffer_pool_size on the server, so that no page flushing needs to take place. The checkpoint LSN 47562 is from the shutdown of the server bootstrap.

            marko Marko Mäkelä added a comment - For reproducing the slowness, it suffices to start a backup during a larger sysbench prepare : sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\ --tables=64 --table_size=100000 & mariadb-backup --innodb-log-buffer-size=1g -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup If I omit the innodb_log_buffer_size , the Sysbench DDL workload (creating and populating tables) will be paused for an extremely long time, and the mariadb-backup lock will very slowly show some progress of copying the log: 10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 2024-05-21 8:13:50 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) [00] 2024-05-21 08:13:50 mariabackup: Generating a list of tablespaces [00] 2024-05-21 08:13:51 DDL tracking : create 8 "./test/sbtest1.ibd" [00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock [00] 2024-05-21 08:13:51 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /dev/shm/sbtest/mariadb.sock [00] 2024-05-21 08:13:51 BACKUP STAGE START [00] 2024-05-21 08:13:51 Acquiring BACKUP LOCKS... … [00] 2024-05-21 08:13:51 BACKUP STAGE FLUSH … [00] 2024-05-21 08:13:51 Finished backing up non-InnoDB tables and files [00] 2024-05-21 08:13:51 Waiting for log copy thread to read lsn 751079805 [00] 2024-05-21 08:14:03 DDL tracking : create 9 "./test/sbtest2.ibd" [00] 2024-05-21 08:14:16 DDL tracking : create 10 "./test/sbtest3.ibd" With the larger innodb_log_buffer_size , the entire backup completes in 7 seconds, even though there is much more log to copy: 10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 2024-05-21 8:15:24 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) [00] 2024-05-21 08:15:24 mariabackup: Generating a list of tablespaces [00] 2024-05-21 08:15:25 DDL tracking : create 8 "./test/sbtest1.ibd" [00] 2024-05-21 08:15:25 DDL tracking : create 9 "./test/sbtest2.ibd" … [00] 2024-05-21 08:15:31 Redo log (from LSN 47562 to 1752320328) was copied. Note: I am using a large enough innodb_log_file_size and innodb_buffer_pool_size on the server, so that no page flushing needs to take place. The checkpoint LSN 47562 is from the shutdown of the server bootstrap.

            With mmap based log parsing, it is quite a bit faster while using less memory:

            10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 with patch

            [00] 2024-05-21 15:38:46 mariabackup: Generating a list of tablespaces
            [00] 2024-05-21 15:38:47 DDL tracking : create 8 "./test/sbtest1.ibd"
            [00] 2024-05-21 15:38:47 DDL tracking : create 9 "./test/sbtest2.ibd"
            …
            [00] 2024-05-21 15:38:51 Copying ./test/sbtest63.ibd to /dev/shm/bu/test/sbtest63.new
            …
            [00] 2024-05-21 15:38:52 Redo log (from LSN 47514 to 1780641452) was copied.
            

            This is not yet the final version. Even though this copied 28,321,172 bytes more log than the file I/O based experiment with innodb_log_buffer_size=1g, the operation completed 1 second faster.

            marko Marko Mäkelä added a comment - With mmap based log parsing, it is quite a bit faster while using less memory: 10.11 f01e6503f4151a8d3d3f0495dab22c47ec794e67 with patch [00] 2024-05-21 15:38:46 mariabackup: Generating a list of tablespaces [00] 2024-05-21 15:38:47 DDL tracking : create 8 "./test/sbtest1.ibd" [00] 2024-05-21 15:38:47 DDL tracking : create 9 "./test/sbtest2.ibd" … [00] 2024-05-21 15:38:51 Copying ./test/sbtest63.ibd to /dev/shm/bu/test/sbtest63.new … [00] 2024-05-21 15:38:52 Redo log (from LSN 47514 to 1780641452) was copied. This is not yet the final version. Even though this copied 28,321,172 bytes more log than the file I/O based experiment with innodb_log_buffer_size=1g , the operation completed 1 second faster.

            It seems that at least on Linux 6.8.9, mmap will be updated automatically when an O_DIRECT written log file resides on ext4fs on an NVMe device. I do not know what exactly msync() with MS_INVALIDATE could be needed for, but apparently it is not really necessary, based on my simple testing.

            marko Marko Mäkelä added a comment - It seems that at least on Linux 6.8.9, mmap will be updated automatically when an O_DIRECT written log file resides on ext4fs on an NVMe device. I do not know what exactly msync() with MS_INVALIDATE could be needed for, but apparently it is not really necessary, based on my simple testing.

            https://github.com/MariaDB/server/pull/3282 implements a settable Boolean configuration parameter innodb_log_file_mmap that would be set ON by default in mariadb-backup and OFF by default in mariadbd. The use of memory-mapped log writes should allow direct access to the circular log file from the operating system kernel’s block cache, so mariadb-backup would only copy the log once, directly from the memory that backs the server’s ib_logfile0 to the backup copy of the log.

            marko Marko Mäkelä added a comment - https://github.com/MariaDB/server/pull/3282 implements a settable Boolean configuration parameter innodb_log_file_mmap that would be set ON by default in mariadb-backup and OFF by default in mariadbd . The use of memory-mapped log writes should allow direct access to the circular log file from the operating system kernel’s block cache, so mariadb-backup would only copy the log once, directly from the memory that backs the server’s ib_logfile0 to the backup copy of the log.
            mleich Matthias Leich added a comment - - edited

            origin/10.11-MDEV-34062 617bac1d86f5c3d6351746d7000582a2ce5f9efb 2024-05-29T14:04:00+03:00
            behaved well in RQG testing. No new problems.
            Setup:

            • innodb_log_file_mmap was not set
            • filesystems used for storage
              tmpfs
              ext4 within some container located in tmpfs
            mleich Matthias Leich added a comment - - edited origin/10.11- MDEV-34062 617bac1d86f5c3d6351746d7000582a2ce5f9efb 2024-05-29T14:04:00+03:00 behaved well in RQG testing. No new problems. Setup: innodb_log_file_mmap was not set filesystems used for storage tmpfs ext4 within some container located in tmpfs
            marko Marko Mäkelä added a comment - - edited

            I ran some performance tests on a SATA 3.0 HDD (Western Digital Blue WDC WD20EZRZ-00Z5HB0) with the following parameters:

            cat /sys/block/sdb/queue/hw_sector_size /sys/block/sdb/queue/physical_block_size

            512
            4096
            

            InnoDB is referring to the latter when it comes to pwrite.

            The best test setup that I came up with for this problem is to "store" the database in /dev/shm and only point innodb_log_group_home_dir to the HDD, so that the only real I/O will be for the ib_logfile0.

            With this setup, the fallocate(FALLOC_FL_ZERO_RANGE) calls turn out to actually reduce performance. Here are some results:

            innodb_log_file_buffering innodb_log_file_mmap throughput/tps
            ON OFF 26758.98
            OFF OFF 29163.18
            ON ON without MADV_DONTNEED 26700.93
            ON ON with MADV_DONTNEED, with fallocate 19817.50
            ON ON with MADV_DONTNEED, without fallocate 27504.67
            OFF ON without MADV_DONTNEED 26891.79
            OFF ON with MADV_DONTNEED, with fallocate 18609.23
            OFF ON with MADV_DONTNEED, without fallocate 28082.70

            The test workload was 60-second Sysbench oltp_write_only, 5×1000 tables, large enough buffer pool, innodb_log_file_size=100m, end LSN>18 times the log file size.

            The results are rather consistent: the fallocate() will cause more work for a subsequent fdatasync() (the "these blocks pretend to be all zero" metadata must be updated), and they do not seem to help with the read-before-write problem that we should observe when running with innodb_log_file_buffering=OFF (O_DIRECT). By far, we get the fastest log writes when using pwrite and O_DIRECT. The MADV_DONTDEED, which a ‘well behaving software citizen’ should invoke seems to help a little in the innodb_log_file_buffering=ON case.

            The conclusion of this experiment is that implementing mmap interface for log writes will unavoidably suffer from the MDEV-33894 phenomenon, but it does not seem too bad. With innodb_log_file_buffering=ON (without O_DIRECT) and with a rather small innodb_log_file_size this interface seems to be faster, because the read-before-write could be satisfied from the operating system kernel’s block cache. Also with innodb_log_file_buffering=OFF the page fault penalty compared to plain pwrite does not seem to be too bad in this particular test.

            Note: for reads (such as in mariadb-backup or in InnoDB recovery), I don’t think that there is any question whether enabling the mmap interface makes sense.

            marko Marko Mäkelä added a comment - - edited I ran some performance tests on a SATA 3.0 HDD (Western Digital Blue WDC WD20EZRZ-00Z5HB0) with the following parameters: cat /sys/block/sdb/queue/hw_sector_size /sys/block/sdb/queue/physical_block_size 512 4096 InnoDB is referring to the latter when it comes to pwrite . The best test setup that I came up with for this problem is to "store" the database in /dev/shm and only point innodb_log_group_home_dir to the HDD, so that the only real I/O will be for the ib_logfile0 . With this setup, the fallocate(FALLOC_FL_ZERO_RANGE) calls turn out to actually reduce performance. Here are some results: innodb_log_file_buffering innodb_log_file_mmap throughput/tps ON OFF 26758.98 OFF OFF 29163.18 ON ON without MADV_DONTNEED 26700.93 ON ON with MADV_DONTNEED , with fallocate 19817.50 ON ON with MADV_DONTNEED , without fallocate 27504.67 OFF ON without MADV_DONTNEED 26891.79 OFF ON with MADV_DONTNEED , with fallocate 18609.23 OFF ON with MADV_DONTNEED , without fallocate 28082.70 The test workload was 60-second Sysbench oltp_write_only , 5×1000 tables, large enough buffer pool, innodb_log_file_size=100m , end LSN>18 times the log file size. The results are rather consistent: the fallocate() will cause more work for a subsequent fdatasync() (the "these blocks pretend to be all zero" metadata must be updated), and they do not seem to help with the read-before-write problem that we should observe when running with innodb_log_file_buffering=OFF  ( O_DIRECT ). By far, we get the fastest log writes when using pwrite and O_DIRECT . The MADV_DONTDEED , which a ‘well behaving software citizen’ should invoke seems to help a little in the innodb_log_file_buffering=ON case. The conclusion of this experiment is that implementing mmap interface for log writes will unavoidably suffer from the MDEV-33894 phenomenon, but it does not seem too bad. With innodb_log_file_buffering=ON (without O_DIRECT ) and with a rather small innodb_log_file_size this interface seems to be faster, because the read-before-write could be satisfied from the operating system kernel’s block cache. Also with innodb_log_file_buffering=OFF the page fault penalty compared to plain pwrite does not seem to be too bad in this particular test. Note: for reads (such as in mariadb-backup or in InnoDB recovery), I don’t think that there is any question whether enabling the mmap interface makes sense.

            I reran a variation of my backup benchmark after retaining the MADV_DONTNEED but removing the FALLOC_FL_ZERO_RANGE from the log checkpoint:

            sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\
            --tables=64 --table_size=100000 prepare &
            sleep 10
            mariadb-backup -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup
            

            On my system, the backup would complete while the 6th or 7th table is being loaded. This is with

            sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
            

            and https://github.com/akopytov/sysbench/pull/419/ applied.

            The server is set up so that only the ib_logfile0 resides on the same HDD as in my previous comment; the rest of the data directory is in /dev/shm. The idea is to emphasize the log file operations. This is using the default innodb_log_buffer_size, and the innodb_log_file_size as well as innodb_buffer_pool_size are large enough so that there should be no data page writes during the workload.

            server innodb_log_file_mmap server innodb_log_file_buffering backup innodb_log_file_mmap backup time/ms backup LSN
            OFF OFF OFF 4495 196,412,051
            OFF OFF ON 4155 196,413,992
            OFF ON OFF 2873 168,020,291
            OFF ON ON 649 170,372,505
            ON OFF OFF 1686 194,734,268
            ON OFF ON 639 170,371,573
            ON ON OFF 1686 170,372,963
            ON ON ON 4035 196,141,694

            We can see that the biggest time savings can be obtained when innodb_log_file_mmap is enabled both on the server and in backup during the test. However, this does not yet demonstrate the original problem where the mariadb-backup --backup was practically hanging in a busy loop trying to parse gigabytes of log via the tiny innodb_log_buffer_size=2m peep hole. Let me add a longer sleep after initiating the sysbench prepare, so that there will be more tables to copy and we can see the original reported problem.

            A test with sleep 100 would interrupt the data load around table 60, and with the mmap and file system cache enabled, backup would complete in 5196ms to proceed to LSN=1,701,067,693.

            Let me use a shorter sleep 60 in order to make the original reported scenario (without mmap) a little less excessive and to let the test complete a little faster. The data load would be interrupted somewhere between the 35th and the 40th table:

            server innodb_log_file_mmap server innodb_log_file_buffering backup innodb_log_file_mmap backup time/ms backup LSN
            OFF OFF OFF 20212 1,135,622,987
            OFF OFF ON 3587 990,945,004
            OFF ON OFF 3444 1,019,715,163
            OFF ON ON 3499 1,019,722,199
            ON OFF OFF 18793 1,104,883,436
            ON OFF ON 3567 990,947,981
            ON ON OFF 3498 1,019,620,980
            ON ON ON 3606 1,022,071,879

            We should keep in mind that because this is DDL workload, the backup locks will block the sysbench prepare while it is creating the next table or index. Nevertheless, there is some variation in the end LSN that was reached, based on the server’s log writing speed.

            Based on this, it seems to me that accessing a file via mmap in any process will cause the file to remain in the file system cache. The database was reinitialized between individual runs. Whether the server writes log via mmap does not seem to have much impact on this.

            By far, the slowest cases are when the server is using O_DIRECT writes to ib_logfile0 and no mmap is being used. We should also keep in mind that we observe the fastest server write performance when O_DIRECT writes to ib_logfile0 are enabled. So, enabling mmap reads in backup seems to make sense.

            This test was conducted on ext4fs and Linux 6.8.9.

            marko Marko Mäkelä added a comment - I reran a variation of my backup benchmark after retaining the MADV_DONTNEED but removing the FALLOC_FL_ZERO_RANGE from the log checkpoint: sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\ --tables=64 --table_size=100000 prepare & sleep 10 mariadb-backup -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup On my system, the backup would complete while the 6th or 7th table is being loaded. This is with sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3) and https://github.com/akopytov/sysbench/pull/419/ applied. The server is set up so that only the ib_logfile0 resides on the same HDD as in my previous comment; the rest of the data directory is in /dev/shm . The idea is to emphasize the log file operations. This is using the default innodb_log_buffer_size , and the innodb_log_file_size as well as innodb_buffer_pool_size are large enough so that there should be no data page writes during the workload. server innodb_log_file_mmap server innodb_log_file_buffering backup innodb_log_file_mmap backup time/ms backup LSN OFF OFF OFF 4495 196,412,051 OFF OFF ON 4155 196,413,992 OFF ON OFF 2873 168,020,291 OFF ON ON 649 170,372,505 ON OFF OFF 1686 194,734,268 ON OFF ON 639 170,371,573 ON ON OFF 1686 170,372,963 ON ON ON 4035 196,141,694 We can see that the biggest time savings can be obtained when innodb_log_file_mmap is enabled both on the server and in backup during the test. However, this does not yet demonstrate the original problem where the mariadb-backup --backup was practically hanging in a busy loop trying to parse gigabytes of log via the tiny innodb_log_buffer_size=2m peep hole. Let me add a longer sleep after initiating the sysbench prepare , so that there will be more tables to copy and we can see the original reported problem. A test with sleep 100 would interrupt the data load around table 60, and with the mmap and file system cache enabled, backup would complete in 5196ms to proceed to LSN=1,701,067,693. Let me use a shorter sleep 60 in order to make the original reported scenario (without mmap ) a little less excessive and to let the test complete a little faster. The data load would be interrupted somewhere between the 35th and the 40th table: server innodb_log_file_mmap server innodb_log_file_buffering backup innodb_log_file_mmap backup time/ms backup LSN OFF OFF OFF 20212 1,135,622,987 OFF OFF ON 3587 990,945,004 OFF ON OFF 3444 1,019,715,163 OFF ON ON 3499 1,019,722,199 ON OFF OFF 18793 1,104,883,436 ON OFF ON 3567 990,947,981 ON ON OFF 3498 1,019,620,980 ON ON ON 3606 1,022,071,879 We should keep in mind that because this is DDL workload, the backup locks will block the sysbench prepare while it is creating the next table or index. Nevertheless, there is some variation in the end LSN that was reached, based on the server’s log writing speed. Based on this, it seems to me that accessing a file via mmap in any process will cause the file to remain in the file system cache. The database was reinitialized between individual runs. Whether the server writes log via mmap does not seem to have much impact on this. By far, the slowest cases are when the server is using O_DIRECT writes to ib_logfile0 and no mmap is being used. We should also keep in mind that we observe the fastest server write performance when O_DIRECT writes to ib_logfile0 are enabled. So, enabling mmap reads in backup seems to make sense. This test was conducted on ext4fs and Linux 6.8.9.

            RQG testing on origin/10.11-MDEV-34062 3d2ec55fd92078b4023ec39af50fcd30aaf8667c 2024-05-30T12:52:26+03:00

            Setup:
            -   innodb_log_file_mmap with on/off for DB server/mariadb-backup
            -   filesystems used for storage
                - ext4 located on HDD
                - ext4 within some container located in tmpfs
            Scenario used in all test runs:
            1. Start the source DB server
            2. Several connections run a DDL/DML mix concurrent
            3. During 2. is ongoing run in a loop
                 3.1 mariadb-backup --backup <to location A>
                 3.2 mariadb-backup --prepare <on location A>
                 3.3 Atempt to start a DB server on location A and runs some integrity checks
            Results:
            - a serious fraction of the tests storing data on the HDD failed during bootstrap because exceeding some RQG builtin timeout
               Reason: The HDD was overloaded and the timeout fits rather to data stored on filesystems in RAM. ==> No error ins DB server or mariadb-backup.
            - pluto:/data/results/1717145612/MB-2134
               mariadb-backup --prepare aborts after [ERROR] [FATAL] InnoDB: Page [page id: space=0, page number=622] still fixed or dirty
            (rr) bt
            #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:44
            #1  __pthread_kill_internal (signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:78
            #2  __GI___pthread_kill (threadid=140727483848960, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
            #3  0x00006a2639e19476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
            #4  0x00006a2639dff7f3 in __GI_abort () at ./stdlib/abort.c:79
            #5  0x00005654f22c7390 in ib::fatal::~fatal (this=this@entry=0x7ffdabae6b50, __in_chrg=<optimized out>) at /data/Server/10.11-MDEV-34062/storage/innobase/ut/ut0ut.cc:488
            #6  0x00005654f1f2141a in buf_pool_t::assert_all_freed (this=this@entry=0x5654f308c680 <buf_pool>) at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:3986
            #7  0x00005654f1f21470 in buf_pool_invalidate () at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:4009
            #8  0x00005654f20fd594 in recv_sys_t::apply (this=this@entry=0x5654f3098080 <recv_sys>, last_batch=last_batch@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3997
            #9  0x00005654f2118199 in recv_sys_t::parse<recv_buf, true> (this=this@entry=0x5654f3098080 <recv_sys>, l=..., if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:2944
            #10 0x00005654f211894e in recv_sys_t::parse_mtr<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3112
            #11 0x00005654f211aea2 in recv_sys_t::parse_mmap<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3123
            #12 0x00005654f21029cc in recv_scan_log (last_phase=last_phase@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4161
            #13 0x00005654f21037e0 in recv_recovery_from_checkpoint_start () at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4678
            #14 0x00005654f2270fb2 in srv_start (create_new_db=create_new_db@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/srv/srv0start.cc:1458
            #15 0x00005654f16f9935 in innodb_init () at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:2627
            #16 0x00005654f170b768 in xtrabackup_prepare_func (argv=argv@entry=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:6819
            #17 0x00005654f17135c6 in main_low (argv=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7754
            #18 0x00005654f1713819 in main (argc=10, argv=0x7ffdabaf0688) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7538
            (rr)
            _RR_TRACE_DIR=./1/rr rr replay --mark-stdio                                        # Fate of the source DB server
            _RR_TRACE_DIR=./1/rr rr replay --mark-stdio  mariadb-backup-0   # mariadb-backup --backup
                                                                                                                                           # My "rr replay" failed because of maybe
                                                                                                                                           # - problem with rr    or
                                                                                                                                           # - the RQG test campaign running concurrent consumes too much ressources
            _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio                            # mariadb-backup --prepare
            

            mleich Matthias Leich added a comment - RQG testing on origin/10.11- MDEV-34062 3d2ec55fd92078b4023ec39af50fcd30aaf8667c 2024-05-30T12:52:26+03:00 Setup: - innodb_log_file_mmap with on/off for DB server/mariadb-backup - filesystems used for storage - ext4 located on HDD - ext4 within some container located in tmpfs Scenario used in all test runs: 1. Start the source DB server 2. Several connections run a DDL/DML mix concurrent 3. During 2. is ongoing run in a loop 3.1 mariadb-backup --backup <to location A> 3.2 mariadb-backup --prepare <on location A> 3.3 Atempt to start a DB server on location A and runs some integrity checks Results: - a serious fraction of the tests storing data on the HDD failed during bootstrap because exceeding some RQG builtin timeout Reason: The HDD was overloaded and the timeout fits rather to data stored on filesystems in RAM. ==> No error ins DB server or mariadb-backup. - pluto:/data/results/1717145612/MB-2134 mariadb-backup --prepare aborts after [ERROR] [FATAL] InnoDB: Page [page id: space=0, page number=622] still fixed or dirty (rr) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140727483848960) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140727483848960, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00006a2639e19476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00006a2639dff7f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00005654f22c7390 in ib::fatal::~fatal (this=this@entry=0x7ffdabae6b50, __in_chrg=<optimized out>) at /data/Server/10.11-MDEV-34062/storage/innobase/ut/ut0ut.cc:488 #6 0x00005654f1f2141a in buf_pool_t::assert_all_freed (this=this@entry=0x5654f308c680 <buf_pool>) at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:3986 #7 0x00005654f1f21470 in buf_pool_invalidate () at /data/Server/10.11-MDEV-34062/storage/innobase/buf/buf0buf.cc:4009 #8 0x00005654f20fd594 in recv_sys_t::apply (this=this@entry=0x5654f3098080 <recv_sys>, last_batch=last_batch@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3997 #9 0x00005654f2118199 in recv_sys_t::parse<recv_buf, true> (this=this@entry=0x5654f3098080 <recv_sys>, l=..., if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:2944 #10 0x00005654f211894e in recv_sys_t::parse_mtr<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3112 #11 0x00005654f211aea2 in recv_sys_t::parse_mmap<true> (if_exists=if_exists@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:3123 #12 0x00005654f21029cc in recv_scan_log (last_phase=last_phase@entry=true) at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4161 #13 0x00005654f21037e0 in recv_recovery_from_checkpoint_start () at /data/Server/10.11-MDEV-34062/storage/innobase/log/log0recv.cc:4678 #14 0x00005654f2270fb2 in srv_start (create_new_db=create_new_db@entry=false) at /data/Server/10.11-MDEV-34062/storage/innobase/srv/srv0start.cc:1458 #15 0x00005654f16f9935 in innodb_init () at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:2627 #16 0x00005654f170b768 in xtrabackup_prepare_func (argv=argv@entry=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:6819 #17 0x00005654f17135c6 in main_low (argv=0x5654f4a7ab98) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7754 #18 0x00005654f1713819 in main (argc=10, argv=0x7ffdabaf0688) at /data/Server/10.11-MDEV-34062/extra/mariabackup/xtrabackup.cc:7538 (rr) _RR_TRACE_DIR=./1/rr rr replay --mark-stdio # Fate of the source DB server _RR_TRACE_DIR=./1/rr rr replay --mark-stdio mariadb-backup-0 # mariadb-backup --backup # My "rr replay" failed because of maybe # - problem with rr or # - the RQG test campaign running concurrent consumes too much ressources _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio # mariadb-backup --prepare

            mleich, thank you. As far as I understand, none of the observed problems are directly related to this change. I filed MDEV-34307 for the still fixed or dirty crash.

            marko Marko Mäkelä added a comment - mleich , thank you. As far as I understand, none of the observed problems are directly related to this change. I filed MDEV-34307 for the still fixed or dirty crash.

            I have an idea how to check for the read-before-write anomaly:

            perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 120
            

            The perf report would show R for any reads. I just tested this, "storing" the data directory in /dev/shm and pointing innodb_log_group_home_dir to actual storage. There were only W events, with two different innodb_log_file_size that I tested. I assume that the larger log file did not have a chance to wrap around (start overwriting itself) and the memory-mapped buffer for smaller log file was never swapped out. I would seem to need a longer test or to repeat this test in a machine with smaller RAM.

            I also have an idea how to avoid both page faults and the related read-before-write: Issue a pwrite to zero out the next page in the underlying file. Yes, this will incur a system call overhead, but the kernel should allocate a page and update the page table, so that on the next access from the user space the page will already be valid, and no read from the storage will be necessary.

            marko Marko Mäkelä added a comment - I have an idea how to check for the read-before-write anomaly: perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 120 The perf report would show R for any reads. I just tested this, "storing" the data directory in /dev/shm and pointing innodb_log_group_home_dir to actual storage. There were only W events, with two different innodb_log_file_size that I tested. I assume that the larger log file did not have a chance to wrap around (start overwriting itself) and the memory-mapped buffer for smaller log file was never swapped out. I would seem to need a longer test or to repeat this test in a machine with smaller RAM. I also have an idea how to avoid both page faults and the related read-before-write: Issue a pwrite to zero out the next page in the underlying file. Yes, this will incur a system call overhead, but the kernel should allocate a page and update the page table, so that on the next access from the user space the page will already be valid, and no read from the storage will be necessary.

            I updated https://github.com/MariaDB/server/pull/3282; there had been some conflicts due to MDEV-33894. I did not yet test the above pwrite idea. On Microsoft Windows, the memory-mapped interface would be disabled by default also in mariadb-backup.

            marko Marko Mäkelä added a comment - I updated https://github.com/MariaDB/server/pull/3282 ; there had been some conflicts due to MDEV-33894 . I did not yet test the above pwrite idea. On Microsoft Windows, the memory-mapped interface would be disabled by default also in mariadb-backup .

            I tried again to reproduce the read-before-write anomaly. If I invoke perf record -e major-faults for the experiment where only the ib_logfile0 resides in persistent storage, then 100% of the samples will be attributed to __memmove_avx_unaligned_erms_rtm or mtr_t::finish_writer inside mtr_t::commit. The faults inside mtr_t::finish_writer are for writing the sequence bit or the CRC-32C.

            If I reduce the size of the workload so that the end LSN will be significantly below innodb_log_file_size (the file will not wrap around), there will still be major-faults for the log writes. I tested further with innodb_log_file_buffering=ON (not enabling O_DIRECT access) and changing innodb_log_file_size on the server startup. Also in this case, major-faults were recorded at exactly the same spots.

            To test whether some pwrite (of garbage from the buffer pool) would prevent the major faults, I applied the following quick and dirty patch, which is based on my earlier fallocate(FALLOC_FL_ZERO_RANGE attempt, which I found to be detrimental to performance:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index 6ff7a64409e..fac13ce5064 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -1942,21 +1942,29 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept
                 const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)};
                 if (offset == start_offset);
                 else if (offset < start_offset)
            +    {
                   IF_WIN(VirtualAlloc(buf + offset, start_offset - offset,
                                       MEM_RESET, PAGE_READWRITE),
                          madvise(reinterpret_cast<char*>(buf) + offset,
                                  start_offset - offset, MADV_DONTNEED));
            +      pwrite(log.m_file, buf_pool.chunks->mem, start_offset - offset, offset);
            +    }
                 else
                 {
                   IF_WIN(VirtualAlloc(buf + START_OFFSET, start_offset - START_OFFSET,
                                       MEM_RESET, PAGE_READWRITE),
                          madvise(reinterpret_cast<char*>(buf) + START_OFFSET,
                                  start_offset - START_OFFSET, MADV_DONTNEED));
            +      pwrite(log.m_file, buf_pool.chunks->mem, start_offset - START_OFFSET,
            +             START_OFFSET);
                   offset+= ps;
                   if (size_t end_size= file_size - offset)
            +      {
                     IF_WIN(VirtualAlloc(buf + offset, end_size, MEM_RESET, PAGE_READWRITE),
                            madvise(reinterpret_cast<char*>(buf) + offset, end_size,
                                    MADV_DONTNEED));
            +        pwrite(log.m_file, buf_pool.chunks->mem, end_size, offset);
            +      }
                 }
               }
             #endif
            

            I still got roughly the same amount of major-faults events during the 2-minute run (about 40,000), all of them in mtr_t::commit. This was without any O_DIRECT and with log resizing at the start of the run:

            10.11-MDEV-34062 e8dfe1fcc695baa61ac6ed263b22cca88fe0ff92 and the above

            2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes)
            2024-08-01 16:49:58 0 [Note] InnoDB: End of log at LSN=46830
            2024-08-01 16:49:58 0 [Note] InnoDB: Resizing redo log from 1.000GiB to 2.000GiB; LSN=46830
            2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes)
            

            What I have been unable to conclusively prove or disprove so far is an actual read from the block level storage when a new page of the log buffer is being written. That said, it could be least risky to limit the memory-mapped access to reading the log in mariadb-backup or in the InnoDB startup. On 64-bit systems, InnoDB could be started with memory-mapped log recovery, but at the end of recovery it would switch to regular log file I/O.

            marko Marko Mäkelä added a comment - I tried again to reproduce the read-before-write anomaly. If I invoke perf record -e major-faults for the experiment where only the ib_logfile0 resides in persistent storage, then 100% of the samples will be attributed to __memmove_avx_unaligned_erms_rtm or mtr_t::finish_writer inside mtr_t::commit . The faults inside mtr_t::finish_writer are for writing the sequence bit or the CRC-32C. If I reduce the size of the workload so that the end LSN will be significantly below innodb_log_file_size (the file will not wrap around), there will still be major-faults for the log writes. I tested further with innodb_log_file_buffering=ON (not enabling O_DIRECT access) and changing innodb_log_file_size on the server startup. Also in this case, major-faults were recorded at exactly the same spots. To test whether some pwrite (of garbage from the buffer pool) would prevent the major faults, I applied the following quick and dirty patch, which is based on my earlier fallocate(FALLOC_FL_ZERO_RANGE attempt, which I found to be detrimental to performance: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 6ff7a64409e..fac13ce5064 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1942,21 +1942,29 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)}; if (offset == start_offset); else if (offset < start_offset) + { IF_WIN(VirtualAlloc(buf + offset, start_offset - offset, MEM_RESET, PAGE_READWRITE), madvise(reinterpret_cast<char*>(buf) + offset, start_offset - offset, MADV_DONTNEED)); + pwrite(log.m_file, buf_pool.chunks->mem, start_offset - offset, offset); + } else { IF_WIN(VirtualAlloc(buf + START_OFFSET, start_offset - START_OFFSET, MEM_RESET, PAGE_READWRITE), madvise(reinterpret_cast<char*>(buf) + START_OFFSET, start_offset - START_OFFSET, MADV_DONTNEED)); + pwrite(log.m_file, buf_pool.chunks->mem, start_offset - START_OFFSET, + START_OFFSET); offset+= ps; if (size_t end_size= file_size - offset) + { IF_WIN(VirtualAlloc(buf + offset, end_size, MEM_RESET, PAGE_READWRITE), madvise(reinterpret_cast<char*>(buf) + offset, end_size, MADV_DONTNEED)); + pwrite(log.m_file, buf_pool.chunks->mem, end_size, offset); + } } } #endif I still got roughly the same amount of major-faults events during the 2-minute run (about 40,000), all of them in mtr_t::commit . This was without any O_DIRECT and with log resizing at the start of the run: 10.11-MDEV-34062 e8dfe1fcc695baa61ac6ed263b22cca88fe0ff92 and the above 2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes) 2024-08-01 16:49:58 0 [Note] InnoDB: End of log at LSN=46830 2024-08-01 16:49:58 0 [Note] InnoDB: Resizing redo log from 1.000GiB to 2.000GiB; LSN=46830 2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes) What I have been unable to conclusively prove or disprove so far is an actual read from the block level storage when a new page of the log buffer is being written. That said, it could be least risky to limit the memory-mapped access to reading the log in mariadb-backup or in the InnoDB startup. On 64-bit systems, InnoDB could be started with memory-mapped log recovery, but at the end of recovery it would switch to regular log file I/O.

            Testing this is currently blocked by MDEV-34750.

            marko Marko Mäkelä added a comment - Testing this is currently blocked by MDEV-34750 .

            Clarification: Initially, I was experimenting with allowing memory-mapped log writes. But, to fix this regression we would only need log reads to be memory-mapped. On 2024-08-05 I had revised the suggested fix accordingly. By default, we would use memory-mapped read in InnoDB recovery and in mariadb-backup on 64-bit Linux and FreeBSD. On other systems, it would have to be enabled explicitly by a new option innodb_log_file_mmap=ON. Log writes would not be memory-mapped except on 64-bit Linux if persistent memory is present or being emulated on /dev/shm.

            The regression occurs, because starting with MDEV-14425 the log is no longer divided into fixed-size (512-byte) blocks, but into mini-transaction-size blocks. Therefore, upon reading number of blocks from the ib_logfile0 we cannot quickly compute checksums of 512-byte blocks but must parse the individual log records and verify the checksums of each individual mini-transaction. A small innodb_log_buffer_size would force frequent costly re-parsing. Before MDEV-14425, the buffer size for reading 512-byte blocks was hard-coded as 4*innodb_page_size (by default, 64 KiB). That was not a problem, because there was no need to re-read partially the same blocks over and over again.

            With a memory-mapped log file read interface, which has been covered on our CI thanks to the PMEM emulation on /dev/shm, the log parsing buffer size is basically infinite, and no costly re-parsing will be necessary.

            marko Marko Mäkelä added a comment - Clarification: Initially, I was experimenting with allowing memory-mapped log writes . But, to fix this regression we would only need log reads to be memory-mapped. On 2024-08-05 I had revised the suggested fix accordingly. By default, we would use memory-mapped read in InnoDB recovery and in mariadb-backup on 64-bit Linux and FreeBSD. On other systems, it would have to be enabled explicitly by a new option innodb_log_file_mmap=ON . Log writes would not be memory-mapped except on 64-bit Linux if persistent memory is present or being emulated on /dev/shm . The regression occurs, because starting with MDEV-14425 the log is no longer divided into fixed-size (512-byte) blocks, but into mini-transaction-size blocks. Therefore, upon reading number of blocks from the ib_logfile0 we cannot quickly compute checksums of 512-byte blocks but must parse the individual log records and verify the checksums of each individual mini-transaction. A small innodb_log_buffer_size would force frequent costly re-parsing. Before MDEV-14425 , the buffer size for reading 512-byte blocks was hard-coded as 4* innodb_page_size (by default, 64 KiB). That was not a problem, because there was no need to re-read partially the same blocks over and over again. With a memory-mapped log file read interface, which has been covered on our CI thanks to the PMEM emulation on /dev/shm , the log parsing buffer size is basically infinite, and no costly re-parsing will be necessary.
            marko Marko Mäkelä added a comment - - edited

            There was a possibility that simplifying the log record parsing function in https://github.com/MariaDB/server/pull/3507 (originally filed as something simpler that would only fix a bogus debug assertion MDEV-34907 instead of removing an entire code branch) would make the parsing fast enough even with the minimal innodb_log_buffer_size=2m.

            It turns out that simplifying the log parsing function is not sufficient for addressing this. I tested a combined branch that contains both fixes and invoked the scenario that I had written down in my first 2024-05-21 comment in this ticket:

            sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\
            --tables=64 --table_size=100000 prepare &
            mariadb-backup --innodb-log-buffer-size=2m --innodb-log-file-mmap=off -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup
            

            In an invocation like this, it still takes an extremely long time to copy the log, long after the sysbench prepare completed:

            [00] 2024-09-25 16:18:53 Waiting for log copy thread to read lsn 1816982998
            [00] 2024-09-25 16:19:05 DDL tracking : create 6 "./test/sbtest2.ibd"
            [00] 2024-09-25 16:19:18 DDL tracking : create 7 "./test/sbtest3.ibd"
            …
            [00] 2024-09-25 16:25:26 DDL tracking : create 34 "./test/sbtest30.ibd"
            [00] 2024-09-25 16:25:40 DDL tracking : create 35 "./test/sbtest31.ibd"
            [00] 2024-09-25 16:25:54 DDL tracking : create 36 "./test/sbtest32.ibd"
            …
            [00] 2024-09-25 16:28:51 DDL tracking : create 49 "./test/sbtest45.ibd"
            [00] 2024-09-25 16:29:05 DDL tracking : create 50 "./test/sbtest46.ibd"
            [00] 2024-09-25 16:29:18 DDL tracking : create 51 "./test/sbtest47.ibd"
            …
            [00] 2024-09-25 16:31:07 DDL tracking : create 59 "./test/sbtest55.ibd"
            [00] 2024-09-25 16:31:21 DDL tracking : create 60 "./test/sbtest56.ibd"
            [00] 2024-09-25 16:31:35 DDL tracking : create 61 "./test/sbtest57.ibd"
            …
            

            If I specify an innodb_log_buffer_size of 1 GiB or do not disable the use of memory-mapped log reads (which are implemented in https://github.com/MariaDB/server/pull/3282), the backup will finish basically instantly.

            marko Marko Mäkelä added a comment - - edited There was a possibility that simplifying the log record parsing function in https://github.com/MariaDB/server/pull/3507 (originally filed as something simpler that would only fix a bogus debug assertion MDEV-34907 instead of removing an entire code branch) would make the parsing fast enough even with the minimal innodb_log_buffer_size=2m . It turns out that simplifying the log parsing function is not sufficient for addressing this. I tested a combined branch that contains both fixes and invoked the scenario that I had written down in my first 2024-05-21 comment in this ticket: sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test\ --tables=64 --table_size=100000 prepare & mariadb-backup --innodb-log-buffer-size=2m --innodb-log-file-mmap=off -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup In an invocation like this, it still takes an extremely long time to copy the log, long after the sysbench prepare completed: [00] 2024-09-25 16:18:53 Waiting for log copy thread to read lsn 1816982998 [00] 2024-09-25 16:19:05 DDL tracking : create 6 "./test/sbtest2.ibd" [00] 2024-09-25 16:19:18 DDL tracking : create 7 "./test/sbtest3.ibd" … [00] 2024-09-25 16:25:26 DDL tracking : create 34 "./test/sbtest30.ibd" [00] 2024-09-25 16:25:40 DDL tracking : create 35 "./test/sbtest31.ibd" [00] 2024-09-25 16:25:54 DDL tracking : create 36 "./test/sbtest32.ibd" … [00] 2024-09-25 16:28:51 DDL tracking : create 49 "./test/sbtest45.ibd" [00] 2024-09-25 16:29:05 DDL tracking : create 50 "./test/sbtest46.ibd" [00] 2024-09-25 16:29:18 DDL tracking : create 51 "./test/sbtest47.ibd" … [00] 2024-09-25 16:31:07 DDL tracking : create 59 "./test/sbtest55.ibd" [00] 2024-09-25 16:31:21 DDL tracking : create 60 "./test/sbtest56.ibd" [00] 2024-09-25 16:31:35 DDL tracking : create 61 "./test/sbtest57.ibd" … If I specify an innodb_log_buffer_size of 1 GiB or do not disable the use of memory-mapped log reads (which are implemented in https://github.com/MariaDB/server/pull/3282 ), the backup will finish basically instantly.

            On a positive note, the backup was completed successfully:

            [00] 2024-09-25 16:33:25 Copying ./test/sbtest64.ibd to /dev/shm/backup/test/sbtest64.new
            [00] 2024-09-25 16:33:25         ...done
            [00] 2024-09-25 16:33:25 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001
            [00] 2024-09-25 16:33:25 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 0 bytes
            [00] 2024-09-25 16:33:25 BACKUP STAGE BLOCK_COMMIT
            [00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSM for log table `mysql`.`slow_log`, 35 bytes
            [00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSV for log table `mysql`.`slow_log`, 0 bytes
            [00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSM for log table `mysql`.`general_log`, 35 bytes
            [00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSV for log table `mysql`.`general_log`, 0 bytes
            [00] 2024-09-25 16:33:25 mariabackup: The latest check point (for incremental): '47149'
            [00] 2024-09-25 16:33:25 Waiting for log copy thread to read lsn 18446744073709551615
            [00] 2024-09-25 16:33:30 Start copying statistics aria tables.
            [00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAI is copied successfully.
            [00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAD is copied successfully.
            [00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAI is copied successfully.
            [00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAD is copied successfully.
            [00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAI is copied successfully.
            [00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAD is copied successfully.
            [00] 2024-09-25 16:33:30 Stop copying statistics aria tables.
            [00] 2024-09-25 16:33:30 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001
            [00] 2024-09-25 16:33:30 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 8192 bytes
            [00] 2024-09-25 16:33:30 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
            [00] 2024-09-25 16:33:30 BACKUP STAGE END
            [00] 2024-09-25 16:33:30 Executing BACKUP STAGE END
            [00] 2024-09-25 16:33:30 All tables unlocked
            [00] 2024-09-25 16:33:30 Backup created in directory '/dev/shm/backup/'
            [00] 2024-09-25 16:33:30 Writing backup-my.cnf
            [00] 2024-09-25 16:33:30         ...done
            [00] 2024-09-25 16:33:30 Writing xtrabackup_info
            [00] 2024-09-25 16:33:30         ...done
            [00] 2024-09-25 16:33:30 Redo log (from LSN 47149 to 1816982998) was copied.
            [00] 2024-09-25 16:33:30 completed OK!
            

            I think that it is a bit excessive that copying 3.6 GiB of files from RAM disk takes some 15 minutes, while the actual server workload (creating and populating a few tables) was completed in less than a minute.

            marko Marko Mäkelä added a comment - On a positive note, the backup was completed successfully: [00] 2024-09-25 16:33:25 Copying ./test/sbtest64.ibd to /dev/shm/backup/test/sbtest64.new [00] 2024-09-25 16:33:25 ...done [00] 2024-09-25 16:33:25 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001 [00] 2024-09-25 16:33:25 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 0 bytes [00] 2024-09-25 16:33:25 BACKUP STAGE BLOCK_COMMIT [00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSM for log table `mysql`.`slow_log`, 35 bytes [00] 2024-09-25 16:33:25 Copied file ./mysql/slow_log.CSV for log table `mysql`.`slow_log`, 0 bytes [00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSM for log table `mysql`.`general_log`, 35 bytes [00] 2024-09-25 16:33:25 Copied file ./mysql/general_log.CSV for log table `mysql`.`general_log`, 0 bytes [00] 2024-09-25 16:33:25 mariabackup: The latest check point (for incremental): '47149' [00] 2024-09-25 16:33:25 Waiting for log copy thread to read lsn 18446744073709551615 [00] 2024-09-25 16:33:30 Start copying statistics aria tables. [00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAI is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/index_stats.MAD is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAI is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/column_stats.MAD is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAI is copied successfully. [00] 2024-09-25 16:33:30 aria table file ./mysql/table_stats.MAD is copied successfully. [00] 2024-09-25 16:33:30 Stop copying statistics aria tables. [00] 2024-09-25 16:33:30 Start copying aria log file tail: /dev/shm/duuba//aria_log.00000001 [00] 2024-09-25 16:33:30 Stop copying aria log file tail: /dev/shm/duuba//aria_log.00000001, copied 8192 bytes [00] 2024-09-25 16:33:30 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... [00] 2024-09-25 16:33:30 BACKUP STAGE END [00] 2024-09-25 16:33:30 Executing BACKUP STAGE END [00] 2024-09-25 16:33:30 All tables unlocked [00] 2024-09-25 16:33:30 Backup created in directory '/dev/shm/backup/' [00] 2024-09-25 16:33:30 Writing backup-my.cnf [00] 2024-09-25 16:33:30 ...done [00] 2024-09-25 16:33:30 Writing xtrabackup_info [00] 2024-09-25 16:33:30 ...done [00] 2024-09-25 16:33:30 Redo log (from LSN 47149 to 1816982998) was copied. [00] 2024-09-25 16:33:30 completed OK! I think that it is a bit excessive that copying 3.6 GiB of files from RAM disk takes some 15 minutes, while the actual server workload (creating and populating a few tables) was completed in less than a minute.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.