[MDEV-21923] LSN allocation is a bottleneck - Jira

Marko Mäkelä added a comment - 2020-03-12 12:20

I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops).

In ~~MDEV-14425~~, we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes).

~~MDEV-14462~~ mentions another idea to try: on mtr_t::commit(), do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially after ~~MDEV-14425~~ has been implemented.

~~MDEV-12353~~ and ~~MDEV-21724~~ redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have before ~~MDEV-14425~~ has been implemented, even a small change to the redo log volume makes a large difference.

Marko Mäkelä added a comment - 2020-03-12 12:20 I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops). In MDEV-14425 , we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes). MDEV-14462 mentions another idea to try: on mtr_t::commit() , do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially after MDEV-14425 has been implemented. MDEV-12353 and MDEV-21724 redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have before MDEV-14425 has been implemented, even a small change to the redo log volume makes a large difference.

peng added a comment - 2020-03-12 13:09

How much performance improvement is expected by MDEV-14425?

peng added a comment - 2020-03-12 13:09 How much performance improvement is expected by MDEV-14425 ?

Vladislav Vaintroub added a comment - 2020-03-12 17:25

I think reducing pressure on log_sys.mutex would be fine, and parallel copy to redo log buffer is not a bad thing. I do not think this is particularly hard, but you need to wait for all copies to complete prior to writing to the disk. my first guess, it could be done with just a single atomic counter.

Vladislav Vaintroub added a comment - 2020-03-12 17:25 I think reducing pressure on log_sys.mutex would be fine, and parallel copy to redo log buffer is not a bad thing. I do not think this is particularly hard, but you need to wait for all copies to complete prior to writing to the disk. my first guess, it could be done with just a single atomic counter.

Marko Mäkelä added a comment - 2020-03-12 18:36

The durability of a mini-transaction would only be guaranteed when there are no ‘gaps’ in the stream. Say, if the log for mini-transactions 101,102,103,104,105 is ‘in flight’, but a part of the log segment that was reserved for mini-transaction 102 was not written to durable storage before the server was killed, then we could only recover everything up to the end of mini-transaction 101.

Implementing the ~~MDEV-14425~~ format does not prevent allowing multiple concurrent writers. We might even allow it to PMEM a.k.a. NVDIMM a.k.a. DCPMM. In that case, log_sys.buf would point to the memory-mapped circular log file on a mount -o dax file system, and the ‘block size’ would likely be 64 bytes, corresponding to the cache line width. For PMEM, durability should be achieved by executing instructions that flush the CPU cache line(s) corresponding to the byte range.

For higher-latency storage, such as hard disks or SSD, supporting multiple concurrent writers could be beneficial even when using a more flexible file format. There are some noteworthy ideas in the MySQL 8.0 design, but I would prefer fewer ‘coordinator’ or ‘maintenance’ threads and generally something event-based instead of polling or timeouts.

Marko Mäkelä added a comment - 2020-03-12 18:36 The durability of a mini-transaction would only be guaranteed when there are no ‘gaps’ in the stream. Say, if the log for mini-transactions 101,102,103,104,105 is ‘in flight’, but a part of the log segment that was reserved for mini-transaction 102 was not written to durable storage before the server was killed, then we could only recover everything up to the end of mini-transaction 101. Implementing the MDEV-14425 format does not prevent allowing multiple concurrent writers. We might even allow it to PMEM a.k.a. NVDIMM a.k.a. DCPMM. In that case, log_sys.buf would point to the memory-mapped circular log file on a mount -o dax file system, and the ‘block size’ would likely be 64 bytes, corresponding to the cache line width. For PMEM, durability should be achieved by executing instructions that flush the CPU cache line(s) corresponding to the byte range. For higher-latency storage, such as hard disks or SSD, supporting multiple concurrent writers could be beneficial even when using a more flexible file format. There are some noteworthy ideas in the MySQL 8.0 design, but I would prefer fewer ‘coordinator’ or ‘maintenance’ threads and generally something event-based instead of polling or timeouts.

Vladislav Vaintroub added a comment - 2020-03-12 19:00

well, finding gaps is something that both recovery , and log_write_up_to() should take care of.
if we tweak current design, without multiple writers, but with parallel memcpy to the redo log buffer, only log_write_up_to() (for both "followers" that wait, and the "leader" who writes and flushes) would have to care of the gaps.

Vladislav Vaintroub added a comment - 2020-03-12 19:00 well, finding gaps is something that both recovery , and log_write_up_to() should take care of. if we tweak current design, without multiple writers, but with parallel memcpy to the redo log buffer, only log_write_up_to() (for both "followers" that wait, and the "leader" who writes and flushes) would have to care of the gaps.

Vladislav Vaintroub added a comment - 2020-03-12 19:19

As or the MySQL 8.0 design, yes, multiplying threads that only do 1 thing, and threads that only wake ups other threads, and threads and so on is much, much too involved.

I think for parallel memcpy to the redo log buffer, no extra threads are necessary

I think one can do fine with (at most 1) background task if we want less wait latency (e.g if log_write_up to would start async task to flush the redo buffer , and the current foreground user thread would check is his lsn completed later, just prior to writing network packet). This can even be improved upon for the threadpool, where foreground thread would not have to wait at all , but instead take over work from other users,.

Vladislav Vaintroub added a comment - 2020-03-12 19:19 As or the MySQL 8.0 design, yes, multiplying threads that only do 1 thing, and threads that only wake ups other threads, and threads and so on is much, much too involved. I think for parallel memcpy to the redo log buffer, no extra threads are necessary I think one can do fine with (at most 1) background task if we want less wait latency (e.g if log_write_up to would start async task to flush the redo buffer , and the current foreground user thread would check is his lsn completed later, just prior to writing network packet). This can even be improved upon for the threadpool, where foreground thread would not have to wait at all , but instead take over work from other users,.

Marko Mäkelä added a comment - 2024-05-03 06:42

Wasn’t this adequately addressed by ~~MDEV-27774~~ and ~~MDEV-33515~~ in MariaDB Server 10.11? We do need a short-duration mutex or spinlock for allocating an LSN range for writing the current mini-transaction. Multiple memcpy() from mtr_t::m_log to log_sys.buf can run concurrently while the threads are holding a shared log_sys.latch. An exclusive log_sys.latch will be held during some DDL operations as well as a log checkpoint, to ensure that everything has contiguously been written to the log_sys.buf.

Marko Mäkelä added a comment - 2024-05-03 06:42 Wasn’t this adequately addressed by MDEV-27774 and MDEV-33515 in MariaDB Server 10.11? We do need a short-duration mutex or spinlock for allocating an LSN range for writing the current mini-transaction. Multiple memcpy() from mtr_t::m_log to log_sys.buf can run concurrently while the threads are holding a shared log_sys.latch . An exclusive log_sys.latch will be held during some DDL operations as well as a log checkpoint, to ensure that everything has contiguously been written to the log_sys.buf .

Marko Mäkelä added a comment - 2025-03-07 11:12

Steve Shaw has pointed out that a severe bottleneck remains in this area. ~~MDEV-33515~~ would basically just be a work-around for some CPU implementations.

The MariaDB 10.11 function log_t::append_prepare() is close to the MySQL 8.0 function log_buffer_reserve(). In MariaDB, log_sys.lock_lsn() covers several data fields. In MySQL 8.0, there is a fast path that basically just advances log_sys.lsn with the requested length by a std::atomic::fetch_add() operation. I think that it should be possible to do something similar, by refactoring mtr_t::finish_write(size_t) and its callers in such a way that when the fast path (covered by a shared log_sys.latch) cannot be used due to running out of buffer capacity, we will acquire exclusive log_sys.latch and repeat the procedure.

Marko Mäkelä added a comment - 2025-03-07 11:12 Steve Shaw has pointed out that a severe bottleneck remains in this area. MDEV-33515 would basically just be a work-around for some CPU implementations. The MariaDB 10.11 function log_t::append_prepare() is close to the MySQL 8.0 function log_buffer_reserve() . In MariaDB, log_sys.lock_lsn() covers several data fields. In MySQL 8.0, there is a fast path that basically just advances log_sys.lsn with the requested length by a std::atomic::fetch_add() operation. I think that it should be possible to do something similar, by refactoring mtr_t::finish_write(size_t) and its callers in such a way that when the fast path (covered by a shared log_sys.latch ) cannot be used due to running out of buffer capacity, we will acquire exclusive log_sys.latch and repeat the procedure.

Vladislav Vaintroub added a comment - 2025-03-07 11:19 - edited

As always , "severe" needs flamegraphs , and they are best obtained from multiple machines. Some with numa, some without. We have had severe regressions back in the day, which would be regressions just on a mainstream hardware or different OS, which would be improvements on Steve's hardware or OS.

Vladislav Vaintroub added a comment - 2025-03-07 11:19 - edited As always , "severe" needs flamegraphs , and they are best obtained from multiple machines. Some with numa, some without. We have had severe regressions back in the day, which would be regressions just on a mainstream hardware or different OS, which would be improvements on Steve's hardware or OS.

Steve Shaw added a comment - 2025-03-07 16:11

The hardware recently tested was Intel 6787p 2 x 86 cores both single and dual socket with Ubuntu 22.
For single socket the top functions running a HammerDB TPROC-C workload was:

54.10% [kernel] [k] native_queued_spin_lock_slowpath.part.0
6.96% mariadbd [.] ssux_lock_impl<true>::rd_wait

Adding parameter innodb_log_spin_wait_delay=50 improved performance by 12% and changed the top functions as follows:

34.87% mariadbd [.] lsn_delay
17.94% [kernel] [k] native_queued_spin_lock_slowpath.part.0

This would indicate where the workload is spending a lot of its time. Historically OLTP testing on recent generations of Intel systems with MariaDB has shown native_queued_spin_lock_slowpath.part.0 as the top function and previous investigations have suggested that it is lsn mutex related.

Steve Shaw added a comment - 2025-03-07 16:11 The hardware recently tested was Intel 6787p 2 x 86 cores both single and dual socket with Ubuntu 22. For single socket the top functions running a HammerDB TPROC-C workload was: 54.10% [kernel] [k] native_queued_spin_lock_slowpath.part.0 6.96% mariadbd [.] ssux_lock_impl<true>::rd_wait Adding parameter innodb_log_spin_wait_delay=50 improved performance by 12% and changed the top functions as follows: 34.87% mariadbd [.] lsn_delay 17.94% [kernel] [k] native_queued_spin_lock_slowpath.part.0 This would indicate where the workload is spending a lot of its time. Historically OLTP testing on recent generations of Intel systems with MariaDB has shown native_queued_spin_lock_slowpath.part.0 as the top function and previous investigations have suggested that it is lsn mutex related.

Vladislav Vaintroub added a comment - 2025-03-07 16:23 - edited

54% is not great, and 34% is not great either. Adding spins is usually bad, on mainstream hardware.
How much those functions take on non-NUMA?

Also, flamegraph usually gives a better overview, with full callstack of the bottleneck functions, and gives better idea on how to improve locking, ideally without adding spins for every possible problem

I think Intel's vtune was pretty good at showing bottlenecks, but also standard perf-based flamegraphs are already good enough.

Vladislav Vaintroub added a comment - 2025-03-07 16:23 - edited 54% is not great, and 34% is not great either. Adding spins is usually bad, on mainstream hardware. How much those functions take on non-NUMA? Also, flamegraph usually gives a better overview, with full callstack of the bottleneck functions, and gives better idea on how to improve locking, ideally without adding spins for every possible problem I think Intel's vtune was pretty good at showing bottlenecks, but also standard perf-based flamegraphs are already good enough.

Marko Mäkelä added a comment - 2025-03-11 13:07

I think that it should be feasible to remove log_sys.buf_free (which we currently update in the same atomic critical section with log_sys.lsn) and introduce a new field log_sys.buf_start_lsn, which would reflect the log sequence number corresponding to the start of log_sys.buf.

In this way, instead of having a "memory transaction" consisting of at least 4 atomic operations, we would have only one log_sys.lsn.fetch_add(size) in the "fast path". This should benefit all systems. I’d like to point out to wlad that it is increasingly more common to have multiple DRAM buses in modern CPUs. For years there have been CPUs that feature multiple "chiplets" per package and up to 8 NUMA nodes per socket. I know of such implementations of x86-64 and ARMv8, and I would expect them to exist for other ISA as well.

The new field log_sys.buf_start_lsn that I am proposing would only be updated when the log_sys.buf is being "shifted" or replaced during a write to a file. Such operations are covered by an exclusive log_sys.latch. In this way, we should be able to allocate LSN and log buffer for each mtr_t::commit() thread by invoking a rather simple log_sys.lsn.fetch_add(size) (80486 lock xadd) while holding log_sys.latch in shared or exclusive mode.

If the write position that we derive from log_sys.lsn and log_sys.buf_start_lsn would reside outside the bounds of log_sys.buf, then some back-off logic would release log_sys.latch, trigger a log write or checkpoint, reacquire the latch, and finally use the already allocated LSN and "shifted" buffer for the write. We may need one more field to ensure that log_sys.write_lsn will be advanced exactly once while any threads are inside such a back-off wait. That field would only be accessed under exclusive log_sys.latch or inside the back-off code path; it would not be part of the "fast path".

Marko Mäkelä added a comment - 2025-03-11 13:07 I think that it should be feasible to remove log_sys.buf_free (which we currently update in the same atomic critical section with log_sys.lsn ) and introduce a new field log_sys.buf_start_lsn , which would reflect the log sequence number corresponding to the start of log_sys.buf . In this way, instead of having a "memory transaction" consisting of at least 4 atomic operations, we would have only one log_sys.lsn.fetch_add(size) in the "fast path". This should benefit all systems. I’d like to point out to wlad that it is increasingly more common to have multiple DRAM buses in modern CPUs. For years there have been CPUs that feature multiple "chiplets" per package and up to 8 NUMA nodes per socket. I know of such implementations of x86-64 and ARMv8, and I would expect them to exist for other ISA as well. The new field log_sys.buf_start_lsn that I am proposing would only be updated when the log_sys.buf is being "shifted" or replaced during a write to a file. Such operations are covered by an exclusive log_sys.latch . In this way, we should be able to allocate LSN and log buffer for each mtr_t::commit() thread by invoking a rather simple log_sys.lsn.fetch_add(size) (80486 lock xadd ) while holding log_sys.latch in shared or exclusive mode. If the write position that we derive from log_sys.lsn and log_sys.buf_start_lsn would reside outside the bounds of log_sys.buf , then some back-off logic would release log_sys.latch , trigger a log write or checkpoint, reacquire the latch, and finally use the already allocated LSN and "shifted" buffer for the write. We may need one more field to ensure that log_sys.write_lsn will be advanced exactly once while any threads are inside such a back-off wait. That field would only be accessed under exclusive log_sys.latch or inside the back-off code path; it would not be part of the "fast path".

Vladislav Vaintroub added a comment - 2025-03-12 13:25 - edited

Now, the "Description" field is a bit out of place, especially since the title changed. To answer yaojiapeng , yes, there are multiple improvements to writing and flushing the log, back in 10.5 already . Once locking around file write and file flush (especially Innodb group commit) were fixed in ~~MDEV-21534~~, it shifted to other bottlenecks arouse - first to copy-to-redo-log-bufer in parallel, and then when that was fixed, to reserving space in redo log, aka LSN allocation, which is a tiny function.
On big machines, with NUMA and many-many cores. In a benchmark, which emphasizes TPS numbers over durability ( innodb_flush_log_at_trx_commit=0, no doublewrite, and all other tricks to avoid fsync) this function is claimed to be a bottleneck, although flamegraphs that would prove it are missing still.

marko's current work it to get rid of the locks in this tiny function, in common fast-path case.

Vladislav Vaintroub added a comment - 2025-03-12 13:25 - edited Now, the "Description" field is a bit out of place, especially since the title changed. To answer yaojiapeng , yes, there are multiple improvements to writing and flushing the log, back in 10.5 already . Once locking around file write and file flush (especially Innodb group commit) were fixed in MDEV-21534 , it shifted to other bottlenecks arouse - first to copy-to-redo-log-bufer in parallel, and then when that was fixed, to reserving space in redo log, aka LSN allocation, which is a tiny function. On big machines, with NUMA and many-many cores. In a benchmark, which emphasizes TPS numbers over durability ( innodb_flush_log_at_trx_commit=0, no doublewrite, and all other tricks to avoid fsync) this function is claimed to be a bottleneck, although flamegraphs that would prove it are missing still. marko 's current work it to get rid of the locks in this tiny function, in common fast-path case.

Vladislav Vaintroub added a comment - 2025-03-12 13:43 - edited

marko, yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

Vladislav Vaintroub added a comment - 2025-03-12 13:43 - edited marko , yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

Marko Mäkelä added a comment - 2025-03-13 13:33

I think I may have figured out a solution. The fast path of log_t::append_prepare() would:

atomically increment the performance counter log_sys.write_to_buf
- atomically, because normally it is only protected by a shared log_sys.latch
- I wish we did not have so many counters
If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following:
1. acquire a new log_sys.wrap_mutex
2. increment log_sys.waits (another performance counter, which could actually be useful)
3. prepare to set the back-off flag in log_sys.lsn
  - MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility.
  - We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits.
  - We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex.
4. log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed)
5. release the log_sys.wrap_mutex
6. poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off)
7. temporarily release log_sys.latch and invoke log_write_up_to(), which would clear the flag while holding exclusive log_sys.latch

The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch. Only the back-off flag would remain visible to some subsystems until it is reset.

Marko Mäkelä added a comment - 2025-03-13 13:33 I think I may have figured out a solution. The fast path of log_t::append_prepare() would: atomically increment the performance counter log_sys.write_to_buf atomically, because normally it is only protected by a shared log_sys.latch I wish we did not have so many counters If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following: acquire a new log_sys.wrap_mutex increment log_sys.waits (another performance counter, which could actually be useful) prepare to set the back-off flag in log_sys.lsn MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility. We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits. We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex . log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed) release the log_sys.wrap_mutex poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off) temporarily release log_sys.latch and invoke log_write_up_to() , which would clear the flag while holding exclusive log_sys.latch The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch . Only the back-off flag would remain visible to some subsystems until it is reset.

Marko Mäkelä added a comment - 4 days ago

I believe that I got the basic idea to work for the memory-mapped log writes (mount -o dax or /dev/shm). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity(). For regular log file writes, we should be able to refer to log_sys.write_lsn.

Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

Marko Mäkelä added a comment - 4 days ago I believe that I got the basic idea to work for the memory-mapped log writes ( mount -o dax or /dev/shm ). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn ; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity() . For regular log file writes, we should be able to refer to log_sys.write_lsn . Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

Marko Mäkelä added a comment - 3 days ago

The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr, but some do.

Marko Mäkelä added a comment - 3 days ago The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr , but some do.

Marko Mäkelä added a comment - 2 days ago

I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.

Marko Mäkelä added a comment - 2 days ago I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.

MariaDB Server

LSN allocation is a bottleneck

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration