Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.11
Description
While preparing to test MDEV-33817 with a recovery-type workload (MDEV-29911), I noticed that in MariaDB Server 10.11 (but not 10.6), copying the InnoDB write-ahead log (innodb_logfile0) is taking an extremely long time, copying at most 3 MiB/s from a RAM disk. Here are some perf record samples from my local system:
10.11 9dfef3fb4178e44b3f3da029995ef11df8360235 on Debian linux-image-6.7.12-amd64 (2024-04-24) |
44,19% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse<recv_buf, false>(recv_buf&, bool)
|
12,35% mariadb-backup [kernel.kallsyms] [k] rep_movs_alternative
|
8,14% mariadb-backup mariadb-backup [.] crc32c_3way
|
7,12% mariadb-backup [kernel.kallsyms] [k] clear_page_erms
|
1,81% mariadb-backup [kernel.kallsyms] [k] filemap_get_entry
|
1,78% mariadb-backup [kernel.kallsyms] [k] shmem_file_read_iter
|
1,60% mariadb-backup [kernel.kallsyms] [k] mem_cgroup_commit_charge
|
1,10% mariadb-backup [kernel.kallsyms] [k] get_mem_cgroup_from_mm
|
1,10% mariadb-backup [kernel.kallsyms] [k] shmem_get_folio_gfp
|
0,96% mariadb-backup [kernel.kallsyms] [k] xas_descend
|
0,92% mariadb-backup [kernel.kallsyms] [k] get_page_from_freelist
|
0,92% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse_mtr<false>(bool)
|
0,86% mariadb-backup [kernel.kallsyms] [k] try_charge_memcg
|
0,80% mariadb-backup mariadb-backup [.] my_crc32c
|
0,74% mariadb-backup [kernel.kallsyms] [k] __list_del_entry_valid_or_report
|
0,61% mariadb-backup [kernel.kallsyms] [k] __rmqueue_pcplist
|
0,55% mariadb-backup [kernel.kallsyms] [k] lru_gen_add_folio
|
0,55% mariadb-backup [kernel.kallsyms] [k] _copy_to_iter
|
0,50% mariadb-backup [kernel.kallsyms] [k] __rcu_read_unlock
|
0,49% mariadb-backup [kernel.kallsyms] [k] alloc_pages_mpol
|
0,49% mariadb-backup mariadb-backup [.] xtrabackup_copy_logfile()
|
The problem appears to be the default value of innodb_log_buffer_size (log_sys.buf_size), which is only 2 MiB. If I specify --innodb-log-buffer-size=1g, then xtrabackup_copy_logfile() will process much more data per a system call, the backup would finish in reasonable time during my Sysbench workload, and the top of perf report would look more reasonable as well:
10.11 9dfef3fb4178e44b3f3da029995ef11df8360235 |
31,06% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse<recv_buf, false>(recv_buf&, bool)
|
20,14% mariadb-backup [kernel.kallsyms] [k] rep_movs_alternative
|
10,01% mariadb-backup [kernel.kallsyms] [k] clear_page_erms
|
5,72% mariadb-backup mariadb-backup [.] crc32c_3way
|
3,86% mariadb-backup libc.so.6 [.] __memmove_avx_unaligned_erms_rtm
|
1,58% mariadb-backup [kernel.kallsyms] [k] filemap_get_entry
|
1,57% mariadb-backup libc.so.6 [.] __memcmp_avx2_movbe_rtm
|
1,50% mariadb-backup [kernel.kallsyms] [k] shmem_get_folio_gfp
|
1,40% mariadb-backup [kernel.kallsyms] [k] xas_descend
|
1,04% mariadb-backup [kernel.kallsyms] [k] get_mem_cgroup_from_mm
|
1,02% mariadb-backup [kernel.kallsyms] [k] __mem_cgroup_charge
|
0,98% mariadb-backup [kernel.kallsyms] [k] __rmqueue_pcplist
|
0,96% mariadb-backup [kernel.kallsyms] [k] shmem_file_read_iter
|
0,93% mariadb-backup [kernel.kallsyms] [k] fault_in_readable
|
0,92% mariadb-backup [kernel.kallsyms] [k] get_page_from_freelist
|
0,78% mariadb-backup mariadb-backup [.] recv_sys_t::parse_mtr_result recv_sys_t::parse_mtr<false>(bool)
|
0,68% mariadb-backup [kernel.kallsyms] [k] xas_load
|
0,59% mariadb-backup [kernel.kallsyms] [k] xas_start
|
0,58% mariadb-backup [kernel.kallsyms] [k] try_charge_memcg
|
0,57% mariadb-backup [kernel.kallsyms] [k] _copy_to_iter
|
0,52% mariadb-backup [kernel.kallsyms] [k] copy_page_from_iter_atomic
|
0,52% mariadb-backup [kernel.kallsyms] [k] lru_gen_add_folio
|
0,50% mariadb-backup [kernel.kallsyms] [k] __alloc_pages
|
0,44% mariadb-backup [kernel.kallsyms] [k] shmem_add_to_page_cache
|
0,40% mariadb-backup [kernel.kallsyms] [k] generic_perform_write
|
0,39% mariadb-backup [kernel.kallsyms] [k] __mod_memcg_lruvec_state
|
0,38% mariadb-backup [kernel.kallsyms] [k] copy_page_to_iter
|
0,35% mariadb-backup [kernel.kallsyms] [k] lru_add_fn
|
0,34% mariadb-backup [kernel.kallsyms] [k] __rcu_read_lock
|
0,34% mariadb-backup mariadb-backup [.] xtrabackup_copy_logfile()
|
The large share of kernel functions should be due to file I/O. This was on /dev/shm. The function xtrabackup_copy_logfile() is always invoking pread system calls to read the ib_logfile0.
Before MDEV-14425, the default value of innodb_log_buffer_size in mariadb-backup was 1 MiB, but as far as I understand, the parameter was being ignored and the log was apparently being read RECV_SCAN_SIZE bytes at a time, that is, 4*innodb_page_size (by default, 64 KiB). This must have been acceptable when the log file was organized in blocks of 512 bytes. With the variable-size logical block format that was introduced in MDEV-14425, we’d apparently better read larger chunks at a time.
I retested this on the 72-thread test system where I originally hit the problem yesterday. At innodb_log_buffer_size=128m the backup would manage to copy the log, but the operation would take longer than 40 seconds, at which point the server had been shut down by the benchmark script.
[00] FATAL ERROR: 2024-05-02 23:10:26 failed to execute query BACKUP STAGE BLOCK_COMMIT: Server has gone away
|
At innodb_log_buffer_size=256m it is better:
[00] 2024-05-02 23:12:55 Finished backing up non-InnoDB tables and files
|
[00] 2024-05-02 23:12:55 Waiting for log copy thread to read lsn 2039243060
|
…
|
[00] 2024-05-02 23:13:03 Redo log (from LSN 46894 to 2645660131) was copied.
|
[00] 2024-05-02 23:13:03 completed OK!
|
On my local system (dual socket Xeon E5-2630 v4, 40 concurrent Sysbench connections to match the 2×2×10 hardware threads), the backup can barely keep up with innodb_log_buffer_size=32m; the backup would finish in 42 seconds. With innodb_log_buffer_size=128m it would finish in 11 seconds. Note that this is an inaccurate test setup; I was manually starting the backup roughly 10 seconds into the write workload, and the InnoDB data files are growing during this workload.
Possibly we should increase the default value of innodb_log_file_size, based on some more testing in various environments (NVMe, SSD, HDD of different speeds). My test setting is completely artificial; the /dev/shm/*/ib_logfile0 will be written by the server via the memory-mapped log_sys.buf while the backup is using regular file system calls. The original motivation of this test was to measure the performance of a subsequent test (mariadb-backup --prepare).
This observation is related to MDEV-30000 and MDEV-31410 too. It could be a good idea to SET GLOBAL innodb_log_file_buffering=ON in the server during the backup, and specify a larger innodb_log_buffer_size.
Ultimately, we should implement the log copying in the server process (MDEV-14992).
Attachments
Issue Links
- blocks
-
MDEV-34850 Unable to create backup with mariabackup: try increasing innodb_log_file_size
-
- Closed
-
- causes
-
MDEV-35608 Fake PMEM on /dev/shm no longer works
-
- Closed
-
-
MDEV-35785 innodb_log_file_mmap is not defined on 32-bit systems
-
- Closed
-
- is blocked by
-
MDEV-34750 SET GLOBAL innodb_log_file_size is not crash safe
-
- Closed
-
- relates to
-
MDEV-31410 mariadb-backup prepare crash with InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT
-
- Closed
-
-
MDEV-33894 MariaDB does unexpected storage read IO for the redo log
-
- Closed
-
-
MDEV-34070 mariadb-backup --prepare with larger --use-memory may be slower
-
- Confirmed
-
-
MDEV-34307 Recovery or mariadb-backup --prepare debug failure [FATAL] InnoDB: Page ... still fixed or dirty
-
- Closed
-
-
MDEV-14425 Change the InnoDB redo log format to reduce write amplification
-
- Closed
-
-
MDEV-14992 BACKUP: in-server backup
-
- Open
-
-
MDEV-19492 Mariabackup hangs if table populated with INSERT... SELECT while it runs
-
- Stalled
-
-
MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress
-
- Closed
-
-
MDEV-30000 make mariadb-backup to force an innodb checkpoint
-
- Closed
-
-
MDEV-31410 mariadb-backup prepare crash with InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT
-
- Closed
-
-
MDEV-33817 Implement AVX512BW and VPCLMULQDQ based CRC-32 algorithms
-
- Closed
-
-
MDEV-36159 mariabackup failed after upgrade 10.11.10
-
- In Progress
-
I tried again to reproduce the read-before-write anomaly. If I invoke perf record -e major-faults for the experiment where only the ib_logfile0 resides in persistent storage, then 100% of the samples will be attributed to __memmove_avx_unaligned_erms_rtm or mtr_t::finish_writer inside mtr_t::commit. The faults inside mtr_t::finish_writer are for writing the sequence bit or the CRC-32C.
If I reduce the size of the workload so that the end LSN will be significantly below innodb_log_file_size (the file will not wrap around), there will still be major-faults for the log writes. I tested further with innodb_log_file_buffering=ON (not enabling O_DIRECT access) and changing innodb_log_file_size on the server startup. Also in this case, major-faults were recorded at exactly the same spots.
To test whether some pwrite (of garbage from the buffer pool) would prevent the major faults, I applied the following quick and dirty patch, which is based on my earlier fallocate(FALLOC_FL_ZERO_RANGE attempt, which I found to be detrimental to performance:
diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index 6ff7a64409e..fac13ce5064 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1942,21 +1942,29 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept
const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)};
if (offset == start_offset);
else if (offset < start_offset)
+ {
IF_WIN(VirtualAlloc(buf + offset, start_offset - offset,
MEM_RESET, PAGE_READWRITE),
madvise(reinterpret_cast<char*>(buf) + offset,
start_offset - offset, MADV_DONTNEED));
+ pwrite(log.m_file, buf_pool.chunks->mem, start_offset - offset, offset);
+ }
else
{
IF_WIN(VirtualAlloc(buf + START_OFFSET, start_offset - START_OFFSET,
MEM_RESET, PAGE_READWRITE),
madvise(reinterpret_cast<char*>(buf) + START_OFFSET,
start_offset - START_OFFSET, MADV_DONTNEED));
+ pwrite(log.m_file, buf_pool.chunks->mem, start_offset - START_OFFSET,
+ START_OFFSET);
offset+= ps;
if (size_t end_size= file_size - offset)
+ {
IF_WIN(VirtualAlloc(buf + offset, end_size, MEM_RESET, PAGE_READWRITE),
madvise(reinterpret_cast<char*>(buf) + offset, end_size,
MADV_DONTNEED));
+ pwrite(log.m_file, buf_pool.chunks->mem, end_size, offset);
+ }
}
}
I still got roughly the same amount of major-faults events during the 2-minute run (about 40,000), all of them in mtr_t::commit. This was without any O_DIRECT and with log resizing at the start of the run:
10.11-MDEV-34062 e8dfe1fcc695baa61ac6ed263b22cca88fe0ff92 and the above
2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes)
2024-08-01 16:49:58 0 [Note] InnoDB: End of log at LSN=46830
2024-08-01 16:49:58 0 [Note] InnoDB: Resizing redo log from 1.000GiB to 2.000GiB; LSN=46830
2024-08-01 16:49:58 0 [Note] InnoDB: Memory-mapped log (block size=512 bytes)
What I have been unable to conclusively prove or disprove so far is an actual read from the block level storage when a new page of the log buffer is being written. That said, it could be least risky to limit the memory-mapped access to reading the log in mariadb-backup or in the InnoDB startup. On 64-bit systems, InnoDB could be started with memory-mapped log recovery, but at the end of recovery it would switch to regular log file I/O.