Details

    Description

      MySQL #WL10310 optimizes the redo log(unlock and write concurrently). Does MariaDB plan to optimize redo log?

      Reference material - https://dev.mysql.com/blog-archive/mysql-8-0-new-lock-free-scalable-wal-design/

      Attachments

        Issue Links

          Activity

            I think that it should be feasible to remove log_sys.buf_free (which we currently update in the same atomic critical section with log_sys.lsn) and introduce a new field log_sys.buf_start_lsn, which would reflect the log sequence number corresponding to the start of log_sys.buf.

            In this way, instead of having a "memory transaction" consisting of at least 4 atomic operations, we would have only one log_sys.lsn.fetch_add(size) in the "fast path". This should benefit all systems. I’d like to point out to wlad that it is increasingly more common to have multiple DRAM buses in modern CPUs. For years there have been CPUs that feature multiple "chiplets" per package and up to 8 NUMA nodes per socket. I know of such implementations of x86-64 and ARMv8, and I would expect them to exist for other ISA as well.

            The new field log_sys.buf_start_lsn that I am proposing would only be updated when the log_sys.buf is being "shifted" or replaced during a write to a file. Such operations are covered by an exclusive log_sys.latch. In this way, we should be able to allocate LSN and log buffer for each mtr_t::commit() thread by invoking a rather simple log_sys.lsn.fetch_add(size) (80486 lock xadd) while holding log_sys.latch in shared or exclusive mode.

            If the write position that we derive from log_sys.lsn and log_sys.buf_start_lsn would reside outside the bounds of log_sys.buf, then some back-off logic would release log_sys.latch, trigger a log write or checkpoint, reacquire the latch, and finally use the already allocated LSN and "shifted" buffer for the write. We may need one more field to ensure that log_sys.write_lsn will be advanced exactly once while any threads are inside such a back-off wait. That field would only be accessed under exclusive log_sys.latch or inside the back-off code path; it would not be part of the "fast path".

            marko Marko Mäkelä added a comment - I think that it should be feasible to remove log_sys.buf_free (which we currently update in the same atomic critical section with log_sys.lsn ) and introduce a new field log_sys.buf_start_lsn , which would reflect the log sequence number corresponding to the start of log_sys.buf . In this way, instead of having a "memory transaction" consisting of at least 4 atomic operations, we would have only one log_sys.lsn.fetch_add(size) in the "fast path". This should benefit all systems. I’d like to point out to wlad that it is increasingly more common to have multiple DRAM buses in modern CPUs. For years there have been CPUs that feature multiple "chiplets" per package and up to 8 NUMA nodes per socket. I know of such implementations of x86-64 and ARMv8, and I would expect them to exist for other ISA as well. The new field log_sys.buf_start_lsn that I am proposing would only be updated when the log_sys.buf is being "shifted" or replaced during a write to a file. Such operations are covered by an exclusive log_sys.latch . In this way, we should be able to allocate LSN and log buffer for each mtr_t::commit() thread by invoking a rather simple log_sys.lsn.fetch_add(size) (80486 lock xadd ) while holding log_sys.latch in shared or exclusive mode. If the write position that we derive from log_sys.lsn and log_sys.buf_start_lsn would reside outside the bounds of log_sys.buf , then some back-off logic would release log_sys.latch , trigger a log write or checkpoint, reacquire the latch, and finally use the already allocated LSN and "shifted" buffer for the write. We may need one more field to ensure that log_sys.write_lsn will be advanced exactly once while any threads are inside such a back-off wait. That field would only be accessed under exclusive log_sys.latch or inside the back-off code path; it would not be part of the "fast path".
            wlad Vladislav Vaintroub added a comment - - edited

            Now, the "Description" field is a bit out of place, especially since the title changed. To answer yaojiapeng , yes, there are multiple improvements to writing and flushing the log, back in 10.5 already . Once locking around file write and file flush (especially Innodb group commit) were fixed in MDEV-21534, it shifted to other bottlenecks arouse - first to copy-to-redo-log-bufer in parallel, and then when that was fixed, to reserving space in redo log, aka LSN allocation, which is a tiny function.
            On big machines, with NUMA and many-many cores. In a benchmark, which emphasizes TPS numbers over durability ( innodb_flush_log_at_trx_commit=0, no doublewrite, and all other tricks to avoid fsync) this function is claimed to be a bottleneck, although flamegraphs that would prove it are missing still.

            marko's current work it to get rid of the locks in this tiny function, in common fast-path case.

            wlad Vladislav Vaintroub added a comment - - edited Now, the "Description" field is a bit out of place, especially since the title changed. To answer yaojiapeng , yes, there are multiple improvements to writing and flushing the log, back in 10.5 already . Once locking around file write and file flush (especially Innodb group commit) were fixed in MDEV-21534 , it shifted to other bottlenecks arouse - first to copy-to-redo-log-bufer in parallel, and then when that was fixed, to reserving space in redo log, aka LSN allocation, which is a tiny function. On big machines, with NUMA and many-many cores. In a benchmark, which emphasizes TPS numbers over durability ( innodb_flush_log_at_trx_commit=0, no doublewrite, and all other tricks to avoid fsync) this function is claimed to be a bottleneck, although flamegraphs that would prove it are missing still. marko 's current work it to get rid of the locks in this tiny function, in common fast-path case.
            wlad Vladislav Vaintroub added a comment - - edited

            marko, yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

            wlad Vladislav Vaintroub added a comment - - edited marko , yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

            I think I may have figured out a solution. The fast path of log_t::append_prepare() would:

            1. atomically increment the performance counter log_sys.write_to_buf
              • atomically, because normally it is only protected by a shared log_sys.latch
              • I wish we did not have so many counters
            2. If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following:
              1. acquire a new log_sys.wrap_mutex
              2. increment log_sys.waits (another performance counter, which could actually be useful)
              3. prepare to set the back-off flag in log_sys.lsn
                • MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility.
                • We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits.
                • We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex.
              4. log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed)
              5. release the log_sys.wrap_mutex
              6. poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off)
              7. temporarily release log_sys.latch and invoke log_write_up_to(), which would clear the flag while holding exclusive log_sys.latch

            The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch. Only the back-off flag would remain visible to some subsystems until it is reset.

            marko Marko Mäkelä added a comment - I think I may have figured out a solution. The fast path of log_t::append_prepare() would: atomically increment the performance counter log_sys.write_to_buf atomically, because normally it is only protected by a shared log_sys.latch I wish we did not have so many counters If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following: acquire a new log_sys.wrap_mutex increment log_sys.waits (another performance counter, which could actually be useful) prepare to set the back-off flag in log_sys.lsn MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility. We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits. We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex . log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed) release the log_sys.wrap_mutex poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off) temporarily release log_sys.latch and invoke log_write_up_to() , which would clear the flag while holding exclusive log_sys.latch The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch . Only the back-off flag would remain visible to some subsystems until it is reset.
            marko Marko Mäkelä added a comment -

            I believe that I got the basic idea to work for the memory-mapped log writes (mount -o dax or /dev/shm). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity(). For regular log file writes, we should be able to refer to log_sys.write_lsn.

            Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

            marko Marko Mäkelä added a comment - I believe that I got the basic idea to work for the memory-mapped log writes ( mount -o dax or /dev/shm ). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn ; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity() . For regular log file writes, we should be able to refer to log_sys.write_lsn . Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

            People

              marko Marko Mäkelä
              yaojiapeng peng
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.