Details
-
Bug
-
Status: In Review (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.11, 11.4, 11.8
-
Server 12.1 dev sprint
Description
MySQL #WL10310 optimizes the redo log(unlock and write concurrently). Does MariaDB plan to optimize redo log?
Reference material - https://dev.mysql.com/blog-archive/mysql-8-0-new-lock-free-scalable-wal-design/
Attachments
Issue Links
- relates to
-
MDEV-14425 Change the InnoDB redo log format to reduce write amplification
-
- Closed
-
-
MDEV-14462 Confusing error message: ib_logfiles are too small for innodb_thread_concurrency=0
-
- Closed
-
-
MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()
-
- Closed
-
-
MDEV-33515 log_sys.lsn_lock causes excessive context switching
-
- Closed
-
- links to
Activity
I think reducing pressure on log_sys.mutex would be fine, and parallel copy to redo log buffer is not a bad thing. I do not think this is particularly hard, but you need to wait for all copies to complete prior to writing to the disk. my first guess, it could be done with just a single atomic counter.
The durability of a mini-transaction would only be guaranteed when there are no ‘gaps’ in the stream. Say, if the log for mini-transactions 101,102,103,104,105 is ‘in flight’, but a part of the log segment that was reserved for mini-transaction 102 was not written to durable storage before the server was killed, then we could only recover everything up to the end of mini-transaction 101.
Implementing the MDEV-14425 format does not prevent allowing multiple concurrent writers. We might even allow it to PMEM a.k.a. NVDIMM a.k.a. DCPMM. In that case, log_sys.buf would point to the memory-mapped circular log file on a mount -o dax file system, and the ‘block size’ would likely be 64 bytes, corresponding to the cache line width. For PMEM, durability should be achieved by executing instructions that flush the CPU cache line(s) corresponding to the byte range.
For higher-latency storage, such as hard disks or SSD, supporting multiple concurrent writers could be beneficial even when using a more flexible file format. There are some noteworthy ideas in the MySQL 8.0 design, but I would prefer fewer ‘coordinator’ or ‘maintenance’ threads and generally something event-based instead of polling or timeouts.
well, finding gaps is something that both recovery , and log_write_up_to() should take care of.
if we tweak current design, without multiple writers, but with parallel memcpy to the redo log buffer, only log_write_up_to() (for both "followers" that wait, and the "leader" who writes and flushes) would have to care of the gaps.
As or the MySQL 8.0 design, yes, multiplying threads that only do 1 thing, and threads that only wake ups other threads, and threads and so on is much, much too involved.
I think for parallel memcpy to the redo log buffer, no extra threads are necessary
I think one can do fine with (at most 1) background task if we want less wait latency (e.g if log_write_up to would start async task to flush the redo buffer , and the current foreground user thread would check is his lsn completed later, just prior to writing network packet). This can even be improved upon for the threadpool, where foreground thread would not have to wait at all , but instead take over work from other users,.
Wasn’t this adequately addressed by MDEV-27774 and MDEV-33515 in MariaDB Server 10.11? We do need a short-duration mutex or spinlock for allocating an LSN range for writing the current mini-transaction. Multiple memcpy() from mtr_t::m_log to log_sys.buf can run concurrently while the threads are holding a shared log_sys.latch. An exclusive log_sys.latch will be held during some DDL operations as well as a log checkpoint, to ensure that everything has contiguously been written to the log_sys.buf.
Steve Shaw has pointed out that a severe bottleneck remains in this area. MDEV-33515 would basically just be a work-around for some CPU implementations.
The MariaDB 10.11 function log_t::append_prepare() is close to the MySQL 8.0 function log_buffer_reserve(). In MariaDB, log_sys.lock_lsn() covers several data fields. In MySQL 8.0, there is a fast path that basically just advances log_sys.lsn with the requested length by a std::atomic::fetch_add() operation. I think that it should be possible to do something similar, by refactoring mtr_t::finish_write(size_t) and its callers in such a way that when the fast path (covered by a shared log_sys.latch) cannot be used due to running out of buffer capacity, we will acquire exclusive log_sys.latch and repeat the procedure.
As always , "severe" needs flamegraphs , and they are best obtained from multiple machines. Some with numa, some without. We have had severe regressions back in the day, which would be regressions just on a mainstream hardware or different OS, which would be improvements on Steve's hardware or OS.
The hardware recently tested was Intel 6787p 2 x 86 cores both single and dual socket with Ubuntu 22.
For single socket the top functions running a HammerDB TPROC-C workload was:
54.10% [kernel] [k] native_queued_spin_lock_slowpath.part.0
6.96% mariadbd [.] ssux_lock_impl<true>::rd_wait
Adding parameter innodb_log_spin_wait_delay=50 improved performance by 12% and changed the top functions as follows:
34.87% mariadbd [.] lsn_delay
17.94% [kernel] [k] native_queued_spin_lock_slowpath.part.0
This would indicate where the workload is spending a lot of its time. Historically OLTP testing on recent generations of Intel systems with MariaDB has shown native_queued_spin_lock_slowpath.part.0 as the top function and previous investigations have suggested that it is lsn mutex related.
54% is not great, and 34% is not great either. Adding spins is usually bad, on mainstream hardware.
How much those functions take on non-NUMA?
Also, flamegraph usually gives a better overview, with full callstack of the bottleneck functions, and gives better idea on how to improve locking, ideally without adding spins for every possible problem
I think Intel's vtune was pretty good at showing bottlenecks, but also standard perf-based flamegraphs are already good enough.
I think that it should be feasible to remove log_sys.buf_free (which we currently update in the same atomic critical section with log_sys.lsn) and introduce a new field log_sys.buf_start_lsn, which would reflect the log sequence number corresponding to the start of log_sys.buf.
In this way, instead of having a "memory transaction" consisting of at least 4 atomic operations, we would have only one log_sys.lsn.fetch_add(size) in the "fast path". This should benefit all systems. I’d like to point out to wlad that it is increasingly more common to have multiple DRAM buses in modern CPUs. For years there have been CPUs that feature multiple "chiplets" per package and up to 8 NUMA nodes per socket. I know of such implementations of x86-64 and ARMv8, and I would expect them to exist for other ISA as well.
The new field log_sys.buf_start_lsn that I am proposing would only be updated when the log_sys.buf is being "shifted" or replaced during a write to a file. Such operations are covered by an exclusive log_sys.latch. In this way, we should be able to allocate LSN and log buffer for each mtr_t::commit() thread by invoking a rather simple log_sys.lsn.fetch_add(size) (80486 lock xadd) while holding log_sys.latch in shared or exclusive mode.
If the write position that we derive from log_sys.lsn and log_sys.buf_start_lsn would reside outside the bounds of log_sys.buf, then some back-off logic would release log_sys.latch, trigger a log write or checkpoint, reacquire the latch, and finally use the already allocated LSN and "shifted" buffer for the write. We may need one more field to ensure that log_sys.write_lsn will be advanced exactly once while any threads are inside such a back-off wait. That field would only be accessed under exclusive log_sys.latch or inside the back-off code path; it would not be part of the "fast path".
Now, the "Description" field is a bit out of place, especially since the title changed. To answer yaojiapeng , yes, there are multiple improvements to writing and flushing the log, back in 10.5 already . Once locking around file write and file flush (especially Innodb group commit) were fixed in MDEV-21534, it shifted to other bottlenecks arouse - first to copy-to-redo-log-bufer in parallel, and then when that was fixed, to reserving space in redo log, aka LSN allocation, which is a tiny function.
On big machines, with NUMA and many-many cores. In a benchmark, which emphasizes TPS numbers over durability ( innodb_flush_log_at_trx_commit=0, no doublewrite, and all other tricks to avoid fsync) this function is claimed to be a bottleneck, although flamegraphs that would prove it are missing still.
marko's current work it to get rid of the locks in this tiny function, in common fast-path case.
marko, yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.
I think I may have figured out a solution. The fast path of log_t::append_prepare() would:
- atomically increment the performance counter log_sys.write_to_buf
- atomically, because normally it is only protected by a shared log_sys.latch
- I wish we did not have so many counters
- If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following:
- acquire a new log_sys.wrap_mutex
- increment log_sys.waits (another performance counter, which could actually be useful)
- prepare to set the back-off flag in log_sys.lsn
- MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility.
- We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits.
- We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex.
- log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed)
- release the log_sys.wrap_mutex
- poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off)
- temporarily release log_sys.latch and invoke log_write_up_to(), which would clear the flag while holding exclusive log_sys.latch
The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch. Only the back-off flag would remain visible to some subsystems until it is reset.
I believe that I got the basic idea to work for the memory-mapped log writes (mount -o dax or /dev/shm). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity(). For regular log file writes, we should be able to refer to log_sys.write_lsn.
Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.
The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr, but some do.
I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.
It turns out that we really must have log_sys.buf_start_lsn that is updated in sync with log_sys.buf and log_sys.write_lsn around pwrite(2) calls. The test innodb.undo_truncate still occasionally fails. That code path is using exclusive log_sys.latch when writing log records. Also, some code will need to be adjusted to for the fact that the most significant bit of log_sys.lsn may be flipped. A possible fix to that might be to replace the atomic log_sys.lsn with a 32-bit atomic field consisting of a flag and a 31-bit displacement relative from log_sys.buf_start_lsn. Reserving the most significant bit for a flag would work, because log_sys.buf_start_lsn may lag behind the current LSN by at most log_sys.buf_size_max (0x7ffff000) bytes.
I realized that we actually need an extra bit to avoid any wrap-around or overflow when we are close to the maximum log_sys.buf_size. It turns out that we can maintain three related things in a single std::atomic<uint64_t> field:
- a 33-bit counter relative to log_sys.buf_start_lsn (up to 0x7ffff000, plus 1 bit safety against overflow)
- a back-off indicator flag, which will be cleared whenever log_sys.buf_start_lsn is updated
- 30 bits of buffered updates to log_sys.write_to_buf, which will be applied and cleared whenever log_sys.buf_start_lsn is updated
In this way, the fast path of log_t::append_prepare() will comprise one std::atomic::fetch_add() and some addition and subtraction. This also allows a straightforward implementation of log_sys.get_lsn(), which must from now on be protected by an exclusive log_sys.latch. That function can mask the back-off indicator.
This revision allows all tests in the regression suite to pass, with both values of cmake -DWITH_INNODB_PMEM.
I ended up renaming the log_sys.buf_start_lsn to log_sys.base_lsn. Furthermore, I figured out that at the start of the back-off code path log_t::append_prepare_wait() we can avoid the use of an additional mutex, and simply use atomic read-modify-write instructions to set the back-off flag and subtract the overshoot. These would be loop-free instructions on x86: 80386 lock bts and 80486 lock xadd. Any reasonably recent compiler (which includes GCC 7) will translate single-bit std::atomic::fetch_or(1<<x) & 1<<x into the former. On many other ISA, atomic read-modify-write operations will translate into load-acquire/store-conditional loops. Updating one atomic field should still be much more efficient than updating 3 or 4 in sync.
Some calculations in buf_flush_page_cleaner() really need to know the current log sequence number. I added log_sys.get_lsn_approx() for this purpose. It relies on Release-Acquire ordering to guarantee that the returned value is never less than what log_sys.get_lsn() would have returned. The approximation may be ahead of log_sys.get_lsn() if a back-off is in progress. For determining how many pages should be written out, this approximation should be fine.
I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops).
In
MDEV-14425, we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes).MDEV-14462mentions another idea to try: on mtr_t::commit(), do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially afterMDEV-14425has been implemented.MDEV-12353andMDEV-21724redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have beforeMDEV-14425has been implemented, even a small change to the redo log volume makes a large difference.