Details

    Description

      MySQL #WL10310 optimizes the redo log(unlock and write concurrently). Does MariaDB plan to optimize redo log?

      Reference material - https://dev.mysql.com/blog-archive/mysql-8-0-new-lock-free-scalable-wal-design/

      Attachments

        Issue Links

          Activity

            wlad Vladislav Vaintroub added a comment - - edited

            marko, yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

            wlad Vladislav Vaintroub added a comment - - edited marko , yes, I do not blame NUMA specifically. I do recall soft-numa on AMDs (I think ever since Opterons), where memory accesses on foreign node were cheap. I blame Intel NUMA , but especially Linux/libc, it is only because Linux can't come up with a mutex/futex/anything that works well on those machines, software engineers elsewhere are forced to create their NIH spinning mutexes, and it does not really work really well.

            I think I may have figured out a solution. The fast path of log_t::append_prepare() would:

            1. atomically increment the performance counter log_sys.write_to_buf
              • atomically, because normally it is only protected by a shared log_sys.latch
              • I wish we did not have so many counters
            2. If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following:
              1. acquire a new log_sys.wrap_mutex
              2. increment log_sys.waits (another performance counter, which could actually be useful)
              3. prepare to set the back-off flag in log_sys.lsn
                • MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility.
                • We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits.
                • We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex.
              4. log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed)
              5. release the log_sys.wrap_mutex
              6. poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off)
              7. temporarily release log_sys.latch and invoke log_write_up_to(), which would clear the flag while holding exclusive log_sys.latch

            The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch. Only the back-off flag would remain visible to some subsystems until it is reset.

            marko Marko Mäkelä added a comment - I think I may have figured out a solution. The fast path of log_t::append_prepare() would: atomically increment the performance counter log_sys.write_to_buf atomically, because normally it is only protected by a shared log_sys.latch I wish we did not have so many counters If lsn.fetch_add(size, std::memory_order_relaxed) would cause a buffer overflow or indicate that a back-off is in progress, then we would keep retrying after invoking a back-off logic that would do the following: acquire a new log_sys.wrap_mutex increment log_sys.waits (another performance counter, which could actually be useful) prepare to set the back-off flag in log_sys.lsn MySQL 8.0 seems to reserve 1 bit for something, seemingly limiting LSN from 64 to 63 bits, which could break compatibility. We could use some clever logic that inverts the most significant bit as part of the fetch_sub() below, so that all 64 bits will remain available for payload; this will work as long as innodb_log_file_size fits in 63 bits. We could read the current value of the flag with log_sys.lsn.load(std::memory_order_relaxed) and declare that its changes are protected by log_sys.wrap_mutex . log_sys.lsn.fetch_sub(size + flag /* see above */, std::memory_order_relaxed) release the log_sys.wrap_mutex poll log_sys.lsn until the overflow condition no longer holds (wait for other concurrent threads to complete their back-off) temporarily release log_sys.latch and invoke log_write_up_to() , which would clear the flag while holding exclusive log_sys.latch The back-off flag will prevent the successful execution of the fast path while back-off is in progress. I believe that such an execution could otherwise result in acquiring an invalid LSN. Invalid LSNs would not be visible to other subsystems , because the back-off would always be completed before releasing log_sys.latch . Only the back-off flag would remain visible to some subsystems until it is reset.
            marko Marko Mäkelä added a comment -

            I believe that I got the basic idea to work for the memory-mapped log writes (mount -o dax or /dev/shm). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity(). For regular log file writes, we should be able to refer to log_sys.write_lsn.

            Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.

            marko Marko Mäkelä added a comment - I believe that I got the basic idea to work for the memory-mapped log writes ( mount -o dax or /dev/shm ). I should have it working for the regular pwrite(2) based log writes soon too. It turns out that we do not need any new field log_sys.buf_start_lsn ; the removed field log_sys.buf_free was basically redundant. For memory-mapped writes we can determine the offset relative to log_sys.first_lsn and log_sys.capacity() . For regular log file writes, we should be able to refer to log_sys.write_lsn . Once I have sorted this out, I will start implementing the back-off logic. That logic should not be sufficiently exercised by our regression test suite; as always, some additional stress testing will be needed.
            marko Marko Mäkelä added a comment -

            The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr, but some do.

            marko Marko Mäkelä added a comment - The occasionally broken crash recovery when using pwrite(2) based log writes may be related to the not yet implemented back-off logic. Most of the easily failing tests would not fail under rr , but some do.
            marko Marko Mäkelä added a comment -

            I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.

            marko Marko Mäkelä added a comment - I think I figured out how to implement the back-off logic. That would fix a few test failures. Unrelated to that, we have a few regression tests failing when using the pwrite(2) based log writes.

            People

              marko Marko Mäkelä
              yaojiapeng peng
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.