I think that it should be feasible to remove log_sys.buf_free (which we currently update in the same atomic critical section with log_sys.lsn) and introduce a new field log_sys.buf_start_lsn, which would reflect the log sequence number corresponding to the start of log_sys.buf.
In this way, instead of having a "memory transaction" consisting of at least 4 atomic operations, we would have only one log_sys.lsn.fetch_add(size) in the "fast path". This should benefit all systems. I’d like to point out to wlad that it is increasingly more common to have multiple DRAM buses in modern CPUs. For years there have been CPUs that feature multiple "chiplets" per package and up to 8 NUMA nodes per socket. I know of such implementations of x86-64 and ARMv8, and I would expect them to exist for other ISA as well.
The new field log_sys.buf_start_lsn that I am proposing would only be updated when the log_sys.buf is being "shifted" or replaced during a write to a file. Such operations are covered by an exclusive log_sys.latch. In this way, we should be able to allocate LSN and log buffer for each mtr_t::commit() thread by invoking a rather simple log_sys.lsn.fetch_add(size) (80486 lock xadd) while holding log_sys.latch in shared or exclusive mode.
If the write position that we derive from log_sys.lsn and log_sys.buf_start_lsn would reside outside the bounds of log_sys.buf, then some back-off logic would release log_sys.latch, trigger a log write or checkpoint, reacquire the latch, and finally use the already allocated LSN and "shifted" buffer for the write. We may need one more field to ensure that log_sys.write_lsn will be advanced exactly once while any threads are inside such a back-off wait. That field would only be accessed under exclusive log_sys.latch or inside the back-off code path; it would not be part of the "fast path".
I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops).
In
MDEV-14425, we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes).MDEV-14462mentions another idea to try: on mtr_t::commit(), do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially afterMDEV-14425has been implemented.MDEV-12353andMDEV-21724redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have beforeMDEV-14425has been implemented, even a small change to the redo log volume makes a large difference.