Hi,
week ago our production database cluster (1 master, 4 replicas, maxscale as proxy) started to deadlock master approx. every 12 hours. We are still looking for trigger but without any success. No obvious problematic query in PROCESSLIST, nothing 
Finally today we were able to get decent coredump, using quay.io/mariadb-foundation/mariadb-debug:10.6 image. Exact version is 10.6.16-MariaDB-1:10.6.16+maria~ubu2004-log source revision: 07494006dd0887ebfb31564a8fd4c59cf1b299e9, exact image version docker.io/library/mariadb@sha256:fcbe381e5fef20c7a2932b52a070f58987b770c651aedf705332e54d1dfd465f
SELECTs seems to be running OK, DML queries are blocked. Some in "opening table" some in "sending data".
I'm attaching both server log, full backtrace and I also have coredump, but it is 700MB bzipped so not attaching but is available.
- is caused by
-
MDEV-30753
Possible corruption due to trx_purge_free_segment()
-
-
Closed
{"report":{"fcp":2738,"ttfb":743.0999999046326,"pageVisibility":"visible","entityId":124729,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"7f04ba82-60b6-4c6e-bb32-444b6924e182","navigationType":0,"readyForUser":2498.9000000953674,"redirectCount":0,"resourceLoadedEnd":3353.2999997138977,"resourceLoadedStart":777,"resourceTiming":[{"duration":1020,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":777,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":777,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1797,"responseStart":0,"secureConnectionStart":0},{"duration":1020.8999996185303,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":777.4000000953674,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":777.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1798.2999997138977,"responseStart":0,"secureConnectionStart":0},{"duration":1029.4000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":777.5999999046326,"connectEnd":777.5999999046326,"connectStart":777.5999999046326,"domainLookupEnd":777.5999999046326,"domainLookupStart":777.5999999046326,"fetchStart":777.5999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":777.5999999046326,"responseEnd":1807,"responseStart":1806.9000000953674,"secureConnectionStart":777.5999999046326},{"duration":1242,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":778.2999997138977,"connectEnd":778.2999997138977,"connectStart":778.2999997138977,"domainLookupEnd":778.2999997138977,"domainLookupStart":778.2999997138977,"fetchStart":778.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":778.2999997138977,"responseEnd":2020.2999997138977,"responseStart":2020.2999997138977,"secureConnectionStart":778.2999997138977},{"duration":1245.4000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":778.5,"connectEnd":778.5,"connectStart":778.5,"domainLookupEnd":778.5,"domainLookupStart":778.5,"fetchStart":778.5,"redirectEnd":0,"redirectStart":0,"requestStart":778.5,"responseEnd":2023.9000000953674,"responseStart":2023.9000000953674,"secureConnectionStart":778.5},{"duration":1246.0999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":778.6999998092651,"connectEnd":778.6999998092651,"connectStart":778.6999998092651,"domainLookupEnd":778.6999998092651,"domainLookupStart":778.6999998092651,"fetchStart":778.6999998092651,"redirectEnd":0,"redirectStart":0,"requestStart":778.6999998092651,"responseEnd":2024.7999997138977,"responseStart":2024.7999997138977,"secureConnectionStart":778.6999998092651},{"duration":1275.6999998092651,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":778.9000000953674,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":778.9000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":2054.5999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":1246.6999998092651,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":778.9000000953674,"connectEnd":778.9000000953674,"connectStart":778.9000000953674,"domainLookupEnd":778.9000000953674,"domainLookupStart":778.9000000953674,"fetchStart":778.9000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":778.9000000953674,"responseEnd":2025.5999999046326,"responseStart":2025.5999999046326,"secureConnectionStart":778.9000000953674},{"duration":1247.3000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":779.0999999046326,"connectEnd":779.0999999046326,"connectStart":779.0999999046326,"domainLookupEnd":779.0999999046326,"domainLookupStart":779.0999999046326,"fetchStart":779.0999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":779.0999999046326,"responseEnd":2026.4000000953674,"responseStart":2026.4000000953674,"secureConnectionStart":779.0999999046326},{"duration":1275.9000000953674,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":779.2999997138977,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":779.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":2055.199999809265,"responseStart":0,"secureConnectionStart":0},{"duration":1247.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":779.5,"connectEnd":779.5,"connectStart":779.5,"domainLookupEnd":779.5,"domainLookupStart":779.5,"fetchStart":779.5,"redirectEnd":0,"redirectStart":0,"requestStart":779.5,"responseEnd":2027.0999999046326,"responseStart":2027.0999999046326,"secureConnectionStart":779.5},{"duration":1867.3999996185303,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":782.9000000953674,"connectEnd":782.9000000953674,"connectStart":782.9000000953674,"domainLookupEnd":782.9000000953674,"domainLookupStart":782.9000000953674,"fetchStart":782.9000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":782.9000000953674,"responseEnd":2650.2999997138977,"responseStart":2650.2999997138977,"secureConnectionStart":782.9000000953674},{"duration":2470,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":783,"connectEnd":783,"connectStart":783,"domainLookupEnd":783,"domainLookupStart":783,"fetchStart":783,"redirectEnd":0,"redirectStart":0,"requestStart":783,"responseEnd":3253,"responseStart":3253,"secureConnectionStart":783},{"duration":629.5,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":2089.0999999046326,"connectEnd":2089.0999999046326,"connectStart":2089.0999999046326,"domainLookupEnd":2089.0999999046326,"domainLookupStart":2089.0999999046326,"fetchStart":2089.0999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":2089.0999999046326,"responseEnd":2718.5999999046326,"responseStart":2718.5999999046326,"secureConnectionStart":2089.0999999046326},{"duration":954.7000002861023,"initiatorType":"script","name":"https://www.google-analytics.com/analytics.js","startTime":2408.2999997138977,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":2408.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":3363,"responseStart":0,"secureConnectionStart":0},{"duration":631.0999999046326,"initiatorType":"link","name":"https://jira.mariadb.org/s/d5715adaadd168a9002b108b2b039b50-CDN/lu2cib/820016/12ta74/be4b45e9cec53099498fa61c8b7acba4/_/download/contextbatch/css/jira.project.sidebar,-_super,-project.issue.navigator,-jira.general,-jira.browse.project,-jira.view.issue,-jira.global,-atl.general,-com.atlassian.jira.projects.sidebar.init/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":2722.199999809265,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":2722.199999809265,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":3353.2999997138977,"responseStart":0,"secureConnectionStart":0},{"duration":584.6999998092651,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/e65b778d185daf5aee24936755b43da6/_/download/contextbatch/js/browser-metrics-plugin.contrib,-_super,-project.issue.navigator,-jira.view.issue,-atl.general/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":2723.0999999046326,"connectEnd":2723.0999999046326,"connectStart":2723.0999999046326,"domainLookupEnd":2723.0999999046326,"domainLookupStart":2723.0999999046326,"fetchStart":2723.0999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":2723.0999999046326,"responseEnd":3307.7999997138977,"responseStart":3307.7999997138977,"secureConnectionStart":2723.0999999046326},{"duration":588.3000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/s/097ae97cb8fbec7d6ea4bbb1f26955b9-CDN/lu2cib/820016/12ta74/be4b45e9cec53099498fa61c8b7acba4/_/download/contextbatch/js/jira.project.sidebar,-_super,-project.issue.navigator,-jira.general,-jira.browse.project,-jira.view.issue,-jira.global,-atl.general,-com.atlassian.jira.projects.sidebar.init/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":2723.699999809265,"connectEnd":2723.699999809265,"connectStart":2723.699999809265,"domainLookupEnd":2723.699999809265,"domainLookupStart":2723.699999809265,"fetchStart":2723.699999809265,"redirectEnd":0,"redirectStart":0,"requestStart":2723.699999809265,"responseEnd":3312,"responseStart":3312,"secureConnectionStart":2723.699999809265}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":545,"responseStart":743,"responseEnd":780,"domLoading":755,"domInteractive":3381,"domContentLoadedEventStart":3381,"domContentLoadedEventEnd":3449,"domComplete":4239,"loadEventStart":4239,"loadEventEnd":4239,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":3264},{"name":"bigPipe.sidebar-id.end","time":3265.199999809265},{"name":"bigPipe.activity-panel-pipe-id.start","time":3265.4000000953674},{"name":"bigPipe.activity-panel-pipe-id.end","time":3290.5},{"name":"activityTabFullyLoaded","time":3478.5}],"measures":[],"correlationId":"72664fb7f28ea","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":134,"dbReadsTimeInMs":15,"dbConnsTimeInMs":24,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
Thank you for an excellent report, with a useful mariadbd_full_bt_all_threads.txt
to start with. Many threads are waiting for dict_sys.latch, or the caller of a syscall() was invoke with this=dict_sys+72. If I ignore those, I should hopefully find the thread that is blocked while holding an exclusive dict_sys.latch. The first interesting thread is this one:
10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9
Thread 236 (Thread 0x7fea195a2700 (LWP 216099)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
…
#5 srw_lock_impl<true>::wr_lock (line=1435, file=0x561122d9c240 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/trx/trx0undo.cc", this=0x561123344bc8 <trx_sys+16648>) at ./storage/innobase/include/srw_lock.h:528
No locals.
#6 trx_undo_assign_low<false> (trx=trx@entry=0x7feb52822580, rseg=rseg@entry=0x561123344bc0 <trx_sys+16640>, undo=undo@entry=0x7feb52822ed0, mtr=mtr@entry=0x7fea1959e790, err=err@entry=0x7fea1959e754) at ./storage/innobase/trx/trx0undo.cc:1435
block = <optimized out>
This is executing an INSERT, so it can’t be holding an exclusive dict_sys.latch. It is waiting for an exclusive rseg->latch so that it can assign the first undo log page for the transaction. We have also Thread 234 and Thread 230 waiting for the same rseg->latch == trx_sys+16648 at the same spot. This should still be normal; the interesting threads would be ones that are waiting for something else while holding dict_sys.latch or the rseg->latch.
In the end, I filtered out these less interesting waits from the output of
grep -B1 -A3 -w syscall mariadbd_full_bt_all_threads.txt
using replace-regexp in GNU Emacs and got the following:
10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9
Thread 196 (Thread 0x7fea185a0700 (LWP 215877)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7feb2c090398) at ./storage/innobase/sync/srw_lock.cc:238
No locals.
--
Thread 159 (Thread 0x7fea402d9700 (LWP 1624)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00005611229ac028 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x7fea00030470) at ./storage/innobase/sync/srw_lock.cc:238
No locals.
--
Thread 149 (Thread 0x7feb52817700 (LWP 57)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x0000561122aa3b6d in my_getevents (min_nr=1, nr=256, ev=0x7feb52814d60, ctx=<optimized out>) at ./tpool/aio_linux.cc:105
saved_errno = <optimized out>
--
Thread 44 (Thread 0x7fea19188700 (LWP 216113)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fea9c096658) at ./storage/innobase/sync/srw_lock.cc:238
No locals.
--
We can ignore Thread 149, which is invoking the io_getevents system call; this image does not use the newer io_uring (
MDEV-24883). The other three threads are waiting inside btr_cur_t::search_leaf(). This could be a case that was missed inMDEV-29835. The ticket MDEV-31815 had been filed, but because the test case involved ROW_FORMAT=COMPRESSED tables, it hit other bugs too easily.I will check the mtr_t::m_memo in all threads deeper, to check if this is related to MDEV-31815.