steve.shaw@intel.com is reporting that write intensive workloads on a NUMA system end up spending a lot of time in native_queued_spin_lock_slowpath.part.0 in the Linux kernel. He has provided a patch that adds a user-space spinlock around the calls to mtr_t::do_write() and is significantly improving throughput at larger numbers of concurrent connections in his test environment.
As far as I can tell, that patch would only allow one mtr_t::do_write() call to proceed at a time, and thus make waits on log_sys.latch extremely unlikely. But that would also seem to ruin part of what MDEV-27774 achieved.
If I understood it correctly, the idea would be better implemented at a slightly lower level, to allow maximum concurrency:
if (UNIV_UNLIKELY(m_user_space && !m_user_space->max_lsn &&
!is_predefined_tablespace(m_user_space->id)))
The to-be-written member function rd_lock_spin() would avoid invoking futex_wait(), and instead keep invoking MY_RELAX_CPU() in the spin loop.
An exclusive log_sys.latch will be acquired rarely and held for rather short time, during DDL operations, undo tablespace truncation, as well as around log checkpoints.
Some experimentation will be needed to find something that scales well across the board (from embedded systems to high-end servers).
I created two patches. On my Haswell microarchitecture dual Intel Xeon E5-2630 v4, both result in significantly worse throughput with 256-thread Sysbench oltp_update_index (actually I intended to test oltp_update_non_index) than the 10.11 baseline. With the baseline, the Linux kernel function native_queued_spin_lock_slowpath is the busiest one; with either fix, it would end up in the second place, behind the new function lsn_delay().
The first variant more closely resembles what steve.shaw@intel.com did. It uses std::atomic<bool> for the lock word; the lock acquisition would be xchg.
The second variant merges log_sys.lsn_lock to the most significant bit of log_sys.buf_free. Lock acquisition will be a loop around lock cmpxchg. It yields better throughput on my system than the first variant. One further thing that could be tried would be a combination of lock bts and a separate mov to load the log_sys.buf_free value. On ARMv8, POWER, RISC-V or other modern ISA, we could probably use simple std::atomic::fetch_or().
I think that some further testing on newer Intel microarchitectures is needed to determine whether I am on the right track with these.
Marko Mäkelä
added a comment - I created two patches. On my Haswell microarchitecture dual Intel Xeon E5-2630 v4, both result in significantly worse throughput with 256-thread Sysbench oltp_update_index (actually I intended to test oltp_update_non_index ) than the 10.11 baseline. With the baseline, the Linux kernel function native_queued_spin_lock_slowpath is the busiest one; with either fix, it would end up in the second place, behind the new function lsn_delay() .
The first variant more closely resembles what steve.shaw@intel.com did. It uses std::atomic<bool> for the lock word; the lock acquisition would be xchg .
The second variant merges log_sys.lsn_lock to the most significant bit of log_sys.buf_free . Lock acquisition will be a loop around lock cmpxchg . It yields better throughput on my system than the first variant. One further thing that could be tried would be a combination of lock bts and a separate mov to load the log_sys.buf_free value. On ARMv8, POWER, RISC-V or other modern ISA, we could probably use simple std::atomic::fetch_or() .
I think that some further testing on newer Intel microarchitectures is needed to determine whether I am on the right track with these.
In short, spinlocks do make performance worse, e.g TPS for 1 minute update_index x256 threads x10 tables x 1mio rows (in memory) looks like this
baseline
spinlock
spinflag
235146.7
151019.13
147738.52
I also added spinflag.svg and baseline.svg flamegraphs, so one can see what's going on.
Apparently, the append_prepare/lsn_delay takes half of the time 48%, in "spin" variation, on Alder Lake, while in baseline, append_prepare is barely noticable, with 0.5%
It is not just Haswell, that performs badly, and this sorta matches what we had seen 2 years ago.
Vladislav Vaintroub
added a comment - - edited I tried the patch on AlderLake (server running on P-cores, sysbench on E-cores)
update_index_256threads_x_10tables_x_1mio_rows.svg
In short, spinlocks do make performance worse, e.g TPS for 1 minute update_index x256 threads x10 tables x 1mio rows (in memory) looks like this
baseline
spinlock
spinflag
235146.7
151019.13
147738.52
I also added spinflag.svg and baseline.svg flamegraphs, so one can see what's going on.
Apparently, the append_prepare/lsn_delay takes half of the time 48%, in "spin" variation, on Alder Lake, while in baseline, append_prepare is barely noticable, with 0.5%
It is not just Haswell, that performs badly, and this sorta matches what we had seen 2 years ago.
https://github.com/MariaDB/server/pull/3148 introduces SET GLOBAL innodb_log_spin_wait_delay, which can be used to enable or disable the spin lock while the server is running. The value 50 should roughly correspond to what the previous spinflag patch did. The default value innodb_log_spin_wait_delay=0 means that the log_sys.lsn_lock will be used. I think that we must rely on steve.shaw@intel.com to test this on Emerald Rapids.
Marko Mäkelä
added a comment - https://github.com/MariaDB/server/pull/3148 introduces SET GLOBAL innodb_log_spin_wait_delay , which can be used to enable or disable the spin lock while the server is running. The value 50 should roughly correspond to what the previous spinflag patch did. The default value innodb_log_spin_wait_delay=0 means that the log_sys.lsn_lock will be used. I think that we must rely on steve.shaw@intel.com to test this on Emerald Rapids.
I have attached on a test Emerald Rapids system with 56 core CPUs with both 1 and 2 socket tests, this improves performance by 9% (1 socket) and 12% (2 socket) respectively and the highest MariaDB throughput we have measured from any release so far with very stable performance measured throughout each of this tests.
Steve Shaw
added a comment - I have attached on a test Emerald Rapids system with 56 core CPUs with both 1 and 2 socket tests, this improves performance by 9% (1 socket) and 12% (2 socket) respectively and the highest MariaDB throughput we have measured from any release so far with very stable performance measured throughout each of this tests.
I think that it would be interesting to know the limits on the number of concurrent threads on more recent microarchitectures. Common sense would suggest that spinning would work better when the number of concurrent threads is limited to a fraction of the number of hardware threads. I did not test such low concurrency on my Haswell microarchitecture system.
In any case, it is easy to tune this if it does not work, simply by SET GLOBAL innodb_log_spin_wait_delay.
Marko Mäkelä
added a comment - I think that it would be interesting to know the limits on the number of concurrent threads on more recent microarchitectures. Common sense would suggest that spinning would work better when the number of concurrent threads is limited to a fraction of the number of hardware threads. I did not test such low concurrency on my Haswell microarchitecture system.
In any case, it is easy to tune this if it does not work, simply by SET GLOBAL innodb_log_spin_wait_delay .
People
Marko Mäkelä
Marko Mäkelä
Votes:
1Vote for this issue
Watchers:
4Start watching this issue
Dates
Created:
Updated:
Resolved:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":944.7999997138977,"ttfb":241.5,"pageVisibility":"visible","entityId":128021,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"98f62208-7d11-4669-8c75-156ac5c7880b","navigationType":0,"readyForUser":1138.2999997138977,"redirectCount":0,"resourceLoadedEnd":1030.3999996185303,"resourceLoadedStart":248,"resourceTiming":[{"duration":50.39999961853027,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":248,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":248,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":298.3999996185303,"responseStart":0,"secureConnectionStart":0},{"duration":50.59999990463257,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":248.2999997138977,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":248.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":298.8999996185303,"responseStart":0,"secureConnectionStart":0},{"duration":111.09999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":248.5,"connectEnd":248.5,"connectStart":248.5,"domainLookupEnd":248.5,"domainLookupStart":248.5,"fetchStart":248.5,"redirectEnd":0,"redirectStart":0,"requestStart":248.5,"responseEnd":359.59999990463257,"responseStart":359.59999990463257,"secureConnectionStart":248.5},{"duration":241.40000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":248.69999980926514,"connectEnd":248.69999980926514,"connectStart":248.69999980926514,"domainLookupEnd":248.69999980926514,"domainLookupStart":248.69999980926514,"fetchStart":248.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":248.69999980926514,"responseEnd":490.09999990463257,"responseStart":490.09999990463257,"secureConnectionStart":248.69999980926514},{"duration":245,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":248.89999961853027,"connectEnd":248.89999961853027,"connectStart":248.89999961853027,"domainLookupEnd":248.89999961853027,"domainLookupStart":248.89999961853027,"fetchStart":248.89999961853027,"redirectEnd":0,"redirectStart":0,"requestStart":248.89999961853027,"responseEnd":493.8999996185303,"responseStart":493.8999996185303,"secureConnectionStart":248.89999961853027},{"duration":245.2999997138977,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":249.09999990463257,"connectEnd":249.09999990463257,"connectStart":249.09999990463257,"domainLookupEnd":249.09999990463257,"domainLookupStart":249.09999990463257,"fetchStart":249.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":249.09999990463257,"responseEnd":494.3999996185303,"responseStart":494.3999996185303,"secureConnectionStart":249.09999990463257},{"duration":245.40000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":249.2999997138977,"connectEnd":249.2999997138977,"connectStart":249.2999997138977,"domainLookupEnd":249.2999997138977,"domainLookupStart":249.2999997138977,"fetchStart":249.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":249.2999997138977,"responseEnd":494.69999980926514,"responseStart":494.69999980926514,"secureConnectionStart":249.2999997138977},{"duration":276.5,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":249.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":249.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":526,"responseStart":0,"secureConnectionStart":0},{"duration":245.5,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":249.69999980926514,"connectEnd":249.69999980926514,"connectStart":249.69999980926514,"domainLookupEnd":249.69999980926514,"domainLookupStart":249.69999980926514,"fetchStart":249.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":249.69999980926514,"responseEnd":495.19999980926514,"responseStart":495.19999980926514,"secureConnectionStart":249.69999980926514},{"duration":276.40000009536743,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":249.89999961853027,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":249.89999961853027,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":526.2999997138977,"responseStart":0,"secureConnectionStart":0},{"duration":245.59999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":250.09999990463257,"connectEnd":250.09999990463257,"connectStart":250.09999990463257,"domainLookupEnd":250.09999990463257,"domainLookupStart":250.09999990463257,"fetchStart":250.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":250.09999990463257,"responseEnd":495.69999980926514,"responseStart":495.69999980926514,"secureConnectionStart":250.09999990463257},{"duration":741.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":251.2999997138977,"connectEnd":251.2999997138977,"connectStart":251.2999997138977,"domainLookupEnd":251.2999997138977,"domainLookupStart":251.2999997138977,"fetchStart":251.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":251.2999997138977,"responseEnd":992.8999996185303,"responseStart":992.8999996185303,"secureConnectionStart":251.2999997138977},{"duration":774.6999998092651,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":255.69999980926514,"connectEnd":255.69999980926514,"connectStart":255.69999980926514,"domainLookupEnd":255.69999980926514,"domainLookupStart":255.69999980926514,"fetchStart":255.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":255.69999980926514,"responseEnd":1030.3999996185303,"responseStart":1030.3999996185303,"secureConnectionStart":255.69999980926514},{"duration":434.59999990463257,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":558.5999999046326,"connectEnd":558.5999999046326,"connectStart":558.5999999046326,"domainLookupEnd":558.5999999046326,"domainLookupStart":558.5999999046326,"fetchStart":558.5999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":558.5999999046326,"responseEnd":993.1999998092651,"responseStart":993.1999998092651,"secureConnectionStart":558.5999999046326},{"duration":393,"initiatorType":"script","name":"https://www.google-analytics.com/analytics.js","startTime":936.1999998092651,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":936.1999998092651,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1329.1999998092651,"responseStart":0,"secureConnectionStart":0}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":54,"responseStart":242,"responseEnd":252,"domLoading":246,"domInteractive":1408,"domContentLoadedEventStart":1408,"domContentLoadedEventEnd":1461,"domComplete":2144,"loadEventStart":2144,"loadEventEnd":2144,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":1344.6999998092651},{"name":"bigPipe.sidebar-id.end","time":1345.5},{"name":"bigPipe.activity-panel-pipe-id.start","time":1345.5999999046326},{"name":"bigPipe.activity-panel-pipe-id.end","time":1348.2999997138977},{"name":"activityTabFullyLoaded","time":1481.3999996185303}],"measures":[],"correlationId":"2d03ec93f7be0a","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":121,"dbReadsTimeInMs":21,"dbConnsTimeInMs":30,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
I created two patches. On my Haswell microarchitecture dual Intel Xeon E5-2630 v4, both result in significantly worse throughput with 256-thread Sysbench oltp_update_index (actually I intended to test oltp_update_non_index) than the 10.11 baseline. With the baseline, the Linux kernel function native_queued_spin_lock_slowpath is the busiest one; with either fix, it would end up in the second place, behind the new function lsn_delay().
The first variant more closely resembles what steve.shaw@intel.com did. It uses std::atomic<bool> for the lock word; the lock acquisition would be xchg.
The second variant merges log_sys.lsn_lock to the most significant bit of log_sys.buf_free. Lock acquisition will be a loop around lock cmpxchg. It yields better throughput on my system than the first variant. One further thing that could be tried would be a combination of lock bts and a separate mov to load the log_sys.buf_free value. On ARMv8, POWER, RISC-V or other modern ISA, we could probably use simple std::atomic::fetch_or().
I think that some further testing on newer Intel microarchitectures is needed to determine whether I am on the right track with these.