[MDEV-21923] LSN allocation is a bottleneck - Jira

Details

Type: Bug
Status: In Progress (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.11, 11.4, 11.8
Fix Version/s: 10.11, 11.4, 11.8
Component/s: Storage Engine - InnoDB
Labels:
- Sprint
- performance

Description

MySQL #WL10310 optimizes the redo log(unlock and write concurrently). Does MariaDB plan to optimize redo log?

Reference material - https://dev.mysql.com/blog-archive/mysql-8-0-new-lock-free-scalable-wal-design/

Attachments

Issue Links

relates to

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-14462 Confusing error message: ib_logfiles are too small for innodb_thread_concurrency=0

Closed

MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()

Closed

MDEV-33515 log_sys.lsn_lock causes excessive context switching

Closed

links to

MySQL 8.0: New Lock free, scalable WAL design

WL#10310: Redo log optimization: dedicated threads and concurrent log buffer

(1 links to)

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Vladislav Vaintroub added a comment - 2025-03-12 13:43 - edited

marko, yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

Vladislav Vaintroub added a comment - 2025-03-12 13:43 - edited marko , yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

Marko Mäkelä added a comment - 2025-03-13 13:33

I think I may have figured out a solution. The fast path of log_t::append_prepare() would:

atomically increment the performance counter log_sys.write_to_buf
- atomically, because normally it is only protected by a shared log_sys.latch
- I wish we did not have so many counters
If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following:
1. acquire a new log_sys.wrap_mutex
2. increment log_sys.waits (another performance counter, which could actually be useful)
3. prepare to set the back-off flag in log_sys.lsn
  - MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility.
  - We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits.
  - We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex.
4. log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed)
5. release the log_sys.wrap_mutex
6. poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off)
7. temporarily release log_sys.latch and invoke log_write_up_to(), which would clear the flag while holding exclusive log_sys.latch

The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch. Only the back-off flag would remain visible to some subsystems until it is reset.

Marko Mäkelä added a comment - 2025-03-13 13:33 I think I may have figured out a solution. The fast path of log_t::append_prepare() would: atomically increment the performance counter log_sys.write_to_buf atomically, because normally it is only protected by a shared log_sys.latch I wish we did not have so many counters If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following: acquire a new log_sys.wrap_mutex increment log_sys.waits (another performance counter, which could actually be useful) prepare to set the back-off flag in log_sys.lsn MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility. We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits. We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex . log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed) release the log_sys.wrap_mutex poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off) temporarily release log_sys.latch and invoke log_write_up_to() , which would clear the flag while holding exclusive log_sys.latch The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch . Only the back-off flag would remain visible to some subsystems until it is reset.

Marko Mäkelä added a comment - 4 days ago

I believe that I got the basic idea to work for the memory-mapped log writes (mount -o dax or /dev/shm). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity(). For regular log file writes, we should be able to refer to log_sys.write_lsn.

Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

Marko Mäkelä added a comment - 4 days ago I believe that I got the basic idea to work for the memory-mapped log writes ( mount -o dax or /dev/shm ). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn ; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity() . For regular log file writes, we should be able to refer to log_sys.write_lsn . Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

Marko Mäkelä added a comment - 3 days ago

The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr, but some do.

Marko Mäkelä added a comment - 3 days ago The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr , but some do.

Marko Mäkelä added a comment - 2 days ago

I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.

Marko Mäkelä added a comment - 2 days ago I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.

MariaDB Server

LSN allocation is a bottleneck

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration