I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops).
In MDEV-14425, we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes).
MDEV-14462 mentions another idea to try: on mtr_t::commit(), do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially after MDEV-14425 has been implemented.
MDEV-12353 and MDEV-21724 redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have before MDEV-14425 has been implemented, even a small change to the redo log volume makes a large difference.
Marko Mäkelä
added a comment - I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops).
In MDEV-14425 , we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes).
MDEV-14462 mentions another idea to try: on mtr_t::commit() , do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially after MDEV-14425 has been implemented.
MDEV-12353 and MDEV-21724 redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have before MDEV-14425 has been implemented, even a small change to the redo log volume makes a large difference.
I think reducing pressure on log_sys.mutex would be fine, and parallel copy to redo log buffer is not a bad thing. I do not think this is particularly hard, but you need to wait for all copies to complete prior to writing to the disk. my first guess, it could be done with just a single atomic counter.
Vladislav Vaintroub
added a comment - I think reducing pressure on log_sys.mutex would be fine, and parallel copy to redo log buffer is not a bad thing. I do not think this is particularly hard, but you need to wait for all copies to complete prior to writing to the disk. my first guess, it could be done with just a single atomic counter.
The durability of a mini-transaction would only be guaranteed when there are no ‘gaps’ in the stream. Say, if the log for mini-transactions 101,102,103,104,105 is ‘in flight’, but a part of the log segment that was reserved for mini-transaction 102 was not written to durable storage before the server was killed, then we could only recover everything up to the end of mini-transaction 101.
Implementing the MDEV-14425 format does not prevent allowing multiple concurrent writers. We might even allow it to PMEM a.k.a. NVDIMM a.k.a. DCPMM. In that case, log_sys.buf would point to the memory-mapped circular log file on a mount -o dax file system, and the ‘block size’ would likely be 64 bytes, corresponding to the cache line width. For PMEM, durability should be achieved by executing instructions that flush the CPU cache line(s) corresponding to the byte range.
For higher-latency storage, such as hard disks or SSD, supporting multiple concurrent writers could be beneficial even when using a more flexible file format. There are some noteworthy ideas in the MySQL 8.0 design, but I would prefer fewer ‘coordinator’ or ‘maintenance’ threads and generally something event-based instead of polling or timeouts.
Marko Mäkelä
added a comment - The durability of a mini-transaction would only be guaranteed when there are no ‘gaps’ in the stream. Say, if the log for mini-transactions 101,102,103,104,105 is ‘in flight’, but a part of the log segment that was reserved for mini-transaction 102 was not written to durable storage before the server was killed, then we could only recover everything up to the end of mini-transaction 101.
Implementing the MDEV-14425 format does not prevent allowing multiple concurrent writers. We might even allow it to PMEM a.k.a. NVDIMM a.k.a. DCPMM. In that case, log_sys.buf would point to the memory-mapped circular log file on a mount -o dax file system, and the ‘block size’ would likely be 64 bytes, corresponding to the cache line width. For PMEM, durability should be achieved by executing instructions that flush the CPU cache line(s) corresponding to the byte range.
For higher-latency storage, such as hard disks or SSD, supporting multiple concurrent writers could be beneficial even when using a more flexible file format. There are some noteworthy ideas in the MySQL 8.0 design, but I would prefer fewer ‘coordinator’ or ‘maintenance’ threads and generally something event-based instead of polling or timeouts.
well, finding gaps is something that both recovery , and log_write_up_to() should take care of.
if we tweak current design, without multiple writers, but with parallel memcpy to the redo log buffer, only log_write_up_to() (for both "followers" that wait, and the "leader" who writes and flushes) would have to care of the gaps.
Vladislav Vaintroub
added a comment - well, finding gaps is something that both recovery , and log_write_up_to() should take care of.
if we tweak current design, without multiple writers, but with parallel memcpy to the redo log buffer, only log_write_up_to() (for both "followers" that wait, and the "leader" who writes and flushes) would have to care of the gaps.
As or the MySQL 8.0 design, yes, multiplying threads that only do 1 thing, and threads that only wake ups other threads, and threads and so on is much, much too involved.
I think for parallel memcpy to the redo log buffer, no extra threads are necessary
I think one can do fine with (at most 1) background task if we want less wait latency (e.g if log_write_up to would start async task to flush the redo buffer , and the current foreground user thread would check is his lsn completed later, just prior to writing network packet). This can even be improved upon for the threadpool, where foreground thread would not have to wait at all , but instead take over work from other users,.
Vladislav Vaintroub
added a comment - As or the MySQL 8.0 design, yes, multiplying threads that only do 1 thing, and threads that only wake ups other threads, and threads and so on is much, much too involved.
I think for parallel memcpy to the redo log buffer, no extra threads are necessary
I think one can do fine with (at most 1) background task if we want less wait latency (e.g if log_write_up to would start async task to flush the redo buffer , and the current foreground user thread would check is his lsn completed later, just prior to writing network packet). This can even be improved upon for the threadpool, where foreground thread would not have to wait at all , but instead take over work from other users,.
Wasn’t this adequately addressed by MDEV-27774 and MDEV-33515 in MariaDB Server 10.11? We do need a short-duration mutex or spinlock for allocating an LSN range for writing the current mini-transaction. Multiple memcpy() from mtr_t::m_log to log_sys.buf can run concurrently while the threads are holding a shared log_sys.latch. An exclusive log_sys.latch will be held during some DDL operations as well as a log checkpoint, to ensure that everything has contiguously been written to the log_sys.buf.
Marko Mäkelä
added a comment - Wasn’t this adequately addressed by MDEV-27774 and MDEV-33515 in MariaDB Server 10.11? We do need a short-duration mutex or spinlock for allocating an LSN range for writing the current mini-transaction. Multiple memcpy() from mtr_t::m_log to log_sys.buf can run concurrently while the threads are holding a shared log_sys.latch . An exclusive log_sys.latch will be held during some DDL operations as well as a log checkpoint, to ensure that everything has contiguously been written to the log_sys.buf .
People
Marko Mäkelä
peng
Votes:
1Vote for this issue
Watchers:
5Start watching this issue
Dates
Created:
Updated:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":1762,"ttfb":288.30000000447035,"pageVisibility":"visible","entityId":84050,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"9a07dd0d-3549-43d2-b48d-43100560f75b","navigationType":0,"readyForUser":1894.6000000014901,"redirectCount":0,"resourceLoadedEnd":1682.4000000059605,"resourceLoadedStart":321.30000000447035,"resourceTiming":[{"duration":392.8999999985099,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":321.30000000447035,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":321.30000000447035,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":714.2000000029802,"responseStart":0,"secureConnectionStart":0},{"duration":392.80000000447035,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/2bf333562ca6724060a9d5f1535471f6/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true","startTime":321.6000000014901,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":321.6000000014901,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":714.4000000059605,"responseStart":0,"secureConnectionStart":0},{"duration":506.79999999701977,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":321.70000000298023,"connectEnd":321.70000000298023,"connectStart":321.70000000298023,"domainLookupEnd":321.70000000298023,"domainLookupStart":321.70000000298023,"fetchStart":321.70000000298023,"redirectEnd":0,"redirectStart":0,"requestStart":715.1000000014901,"responseEnd":828.5,"responseStart":727.4000000059605,"secureConnectionStart":321.70000000298023},{"duration":787.6000000014901,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/2bf333562ca6724060a9d5f1535471f6/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true","startTime":322,"connectEnd":726.8000000044703,"connectStart":726.8000000044703,"domainLookupEnd":726.8000000044703,"domainLookupStart":726.8000000044703,"fetchStart":322,"redirectEnd":0,"redirectStart":0,"requestStart":728,"responseEnd":1109.6000000014901,"responseStart":766.1000000014901,"secureConnectionStart":726.8000000044703},{"duration":447.80000000447035,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":322.1000000014901,"connectEnd":322.1000000014901,"connectStart":322.1000000014901,"domainLookupEnd":322.1000000014901,"domainLookupStart":322.1000000014901,"fetchStart":322.1000000014901,"redirectEnd":0,"redirectStart":0,"requestStart":728.7000000029802,"responseEnd":769.9000000059605,"responseStart":753.5,"secureConnectionStart":322.1000000014901},{"duration":447.8999999985099,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":322.30000000447035,"connectEnd":322.30000000447035,"connectStart":322.30000000447035,"domainLookupEnd":322.30000000447035,"domainLookupStart":322.30000000447035,"fetchStart":322.30000000447035,"redirectEnd":0,"redirectStart":0,"requestStart":728.8000000044703,"responseEnd":770.2000000029802,"responseStart":755.5,"secureConnectionStart":322.30000000447035},{"duration":498.90000000596046,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":322.5,"connectEnd":322.5,"connectStart":322.5,"domainLookupEnd":322.5,"domainLookupStart":322.5,"fetchStart":322.5,"redirectEnd":0,"redirectStart":0,"requestStart":767.9000000059605,"responseEnd":821.4000000059605,"responseStart":791.4000000059605,"secureConnectionStart":322.5},{"duration":424.70000000298023,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":322.70000000298023,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":322.70000000298023,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":747.4000000059605,"responseStart":0,"secureConnectionStart":0},{"duration":499,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":322.90000000596046,"connectEnd":322.90000000596046,"connectStart":322.90000000596046,"domainLookupEnd":322.90000000596046,"domainLookupStart":322.90000000596046,"fetchStart":322.90000000596046,"redirectEnd":0,"redirectStart":0,"requestStart":768.6000000014901,"responseEnd":821.9000000059605,"responseStart":820.2000000029802,"secureConnectionStart":322.90000000596046},{"duration":427.30000000447035,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":323,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":323,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":750.3000000044703,"responseStart":0,"secureConnectionStart":0},{"duration":498.79999999701977,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":323.20000000298023,"connectEnd":323.20000000298023,"connectStart":323.20000000298023,"domainLookupEnd":323.20000000298023,"domainLookupStart":323.20000000298023,"fetchStart":323.20000000298023,"redirectEnd":0,"redirectStart":0,"requestStart":769.1000000014901,"responseEnd":822,"responseStart":820.6000000014901,"secureConnectionStart":323.20000000298023},{"duration":1354.7999999970198,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":324.90000000596046,"connectEnd":324.90000000596046,"connectStart":324.90000000596046,"domainLookupEnd":324.90000000596046,"domainLookupStart":324.90000000596046,"fetchStart":324.90000000596046,"redirectEnd":0,"redirectStart":0,"requestStart":1667.4000000059605,"responseEnd":1679.7000000029802,"responseStart":1679.2000000029802,"secureConnectionStart":324.90000000596046},{"duration":1357.4000000059605,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":325,"connectEnd":325,"connectStart":325,"domainLookupEnd":325,"domainLookupStart":325,"fetchStart":325,"redirectEnd":0,"redirectStart":0,"requestStart":1669.7000000029802,"responseEnd":1682.4000000059605,"responseStart":1680.9000000059605,"secureConnectionStart":325},{"duration":688.8000000044703,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":1002.5,"connectEnd":1002.5,"connectStart":1002.5,"domainLookupEnd":1002.5,"domainLookupStart":1002.5,"fetchStart":1002.5,"redirectEnd":0,"redirectStart":0,"requestStart":1654.5,"responseEnd":1691.3000000044703,"responseStart":1686.8000000044703,"secureConnectionStart":1002.5}],"fetchStart":1,"domainLookupStart":89,"domainLookupEnd":98,"connectStart":98,"connectEnd":125,"secureConnectionStart":107,"requestStart":125,"responseStart":289,"responseEnd":319,"domLoading":295,"domInteractive":2033,"domContentLoadedEventStart":2033,"domContentLoadedEventEnd":2092,"domComplete":3536,"loadEventStart":3536,"loadEventEnd":3537,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":2007.1000000014901},{"name":"bigPipe.sidebar-id.end","time":2009.4000000059605},{"name":"bigPipe.activity-panel-pipe-id.start","time":2009.6000000014901},{"name":"bigPipe.activity-panel-pipe-id.end","time":2011.9000000059605},{"name":"activityTabFullyLoaded","time":2105.8000000044703}],"measures":[],"correlationId":"a818340f8ba4a0","effectiveType":"4g","downlink":9,"rtt":0,"serverDuration":105,"dbReadsTimeInMs":16,"dbConnsTimeInMs":22,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
I am not convinced that a lock-free algorithm is always better than one that uses mutexes. It could lead to lots of busy work (wasted CPU cycles in polling loops).
In
MDEV-14425, we plan to modify the InnoDB redo log file format in a way that minimizes the work done while holding a mutex (encrypting data and computing checksums). The new file format would also be compatible with any physical block size, with anything between the smallest write size of persistent memory (64 bytes?) to the optimal write size on an SSD (supposedly at least up to 4096 bytes).MDEV-14462mentions another idea to try: on mtr_t::commit(), do not write log, but pass the work to a dedicated log writer task. We would have to validate this idea by prototyping; I cannot guarantee that it would help much, especially afterMDEV-14425has been implemented.MDEV-12353andMDEV-21724redefined the redo log record format in MariaDB 10.5.2. Because of the mutex contention that we have beforeMDEV-14425has been implemented, even a small change to the redo log volume makes a large difference.