For some reason, I did not manage to get a core dump for this problem, even after I removed calls to sigaction(2). Finally, I was able to figure out the cause by applying the following patch:
I got 4 hung mariabackup processes near the end of the run. I attached gdb to one of them. Sure enough, one of the threads was executing the kill(). We had write_slots.m_cache.m_pos=1, corresponding to the following thread:
#4 0x000055d34d10fb1a in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__msan::AP64> >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__msan::AP64> >::PerClass*, __sanitizer::SizeClassAllocator64<__msan::AP64>*, unsigned long) ()
#5 0x000055d34d10f7f8 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__msan::AP64>, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__s--Type <RET> for more, q to quit, c to continue without paging--c
#6 0x000055d34d10ebf9 in __msan::MsanAllocate(__sanitizer::StackTrace*, unsigned long, unsigned long, bool) ()
#7 0x000055d34d10eb5a in __msan::msan_malloc(unsigned long, __sanitizer::StackTrace*) ()
#8 0x000055d34d119023 in malloc ()
#9 0x000055d3505522b7 in DbugMalloc (size=568) at /mariadb/10.11/dbug/dbug.c:2074
#10 PushState (cs=0x704000008000) at /mariadb/10.11/dbug/dbug.c:1599
#11 0x000055d350555989 in _db_push_ (control=0x55d34c788b7d "") at /mariadb/10.11/dbug/dbug.c:867
#12 0x000055d3504e02ef in safe_mutex_lock (mp=0x55d3533a8fd8 <fil_system+40>, my_flags=0, file=0x55d34c784541 "/mariadb/10.11/storage/innobase/fil/fil0fil.cc", line=2682) at /mariadb/10.11/mysys/thr_mutex.c:236
#13 0x000055d34f76acc6 in inline_mysql_mutex_lock (src_line=2682, that=<optimized out>, src_file=<optimized out>) at /mariadb/10.11/include/mysql/psi/mysql_thread.h:750
#14 fil_node_t::complete_write (this=0x707000000328) at /mariadb/10.11/storage/innobase/fil/fil0fil.cc:2682
#15 0x000055d34f75917f in IORequest::write_complete (this=<optimized out>, io_error=<optimized out>) at /mariadb/10.11/storage/innobase/fil/fil0fil.cc:2819
#16 0x000055d34fcfb791 in write_io_callback (c=0x7fc44d450000) at /mariadb/10.11/storage/innobase/os/os0file.cc:3465
#17 0x000055d350267d46 in tpool::task_group::execute (this=0x7130000002e0, t=0x7fc44d450048) at /mariadb/10.11/tpool/task_group.cc:70
#18 0x000055d3502580ef in tpool::thread_pool_generic::worker_main (this=0x718000000000, thread_var=0x730000010080) at /mariadb/10.11/tpool/tpool_generic.cc:583
This thread had already released the page latch, which allowed the buf_flush_page_cleaner() to finish. It had not yet invoked the following:
node->complete_write();
node->space->release();
This could mean that during a normal operation, a log checkpoint would fail to invoke fdatasync() or fsync() to make durable the last writes on a file. The two calls should probably be made part of buf_page_write_complete() right before the page latch release.
The assertion expression would fail until write_slots->release(cb) was called in write_io_callback(). Perhaps we should replace the debug assertions with some wait loops.
The debug assertions in mariabackup as well as create_log_file() and srv_start() are best replaced with calls to os_aio_wait_until_no_pending_writes(false), to ensure that the debug build would hang in case there is some problem. In non-debug builds, the buf_page_t::lock should be held until the write has been properly registered.
Marko Mäkelä
added a comment - I think that the following fixes the race condition between log checkpoint and page write completion:
diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
index 27f9a35fb4c..bcd18e73926 100644
--- a/storage/innobase/fil/fil0fil.cc
+++ b/storage/innobase/fil/fil0fil.cc
@@ -2804,6 +2804,7 @@ void IORequest::write_complete(int io_error) const
ut_ad(fil_validate_skip());
ut_ad(node);
ut_ad(is_write());
+ node->complete_write();
if (!bpage)
{
@@ -2816,7 +2817,6 @@ void IORequest::write_complete(int io_error) const
else
buf_page_write_complete(*this, io_error);
- node->complete_write();
node->space->release();
}
The debug assertions in mariabackup as well as create_log_file() and srv_start() are best replaced with calls to os_aio_wait_until_no_pending_writes(false) , to ensure that the debug build would hang in case there is some problem. In non-debug builds, the buf_page_t::lock should be held until the write has been properly registered.
The race condition on write completion affects 10.5 as well. The debug assertion is only present in 10.6 onwards, related to MDEV-26827.
Marko Mäkelä
added a comment - The race condition on write completion affects 10.5 as well. The debug assertion is only present in 10.6 onwards, related to MDEV-26827 .
People
Marko Mäkelä
Marko Mäkelä
Votes:
0Vote for this issue
Watchers:
1Start watching this issue
Dates
Created:
Updated:
Resolved:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":903.7999997138977,"ttfb":267.59999990463257,"pageVisibility":"visible","entityId":125846,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":1,"journeyId":"ebbd9f9a-a8fd-4ad4-b9d3-80b99d3b55de","navigationType":0,"readyForUser":997.5,"redirectCount":0,"resourceLoadedEnd":600.7999997138977,"resourceLoadedStart":275.90000009536743,"resourceTiming":[{"duration":86.2999997138977,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":275.90000009536743,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":275.90000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":362.19999980926514,"responseStart":0,"secureConnectionStart":0},{"duration":86.40000009536743,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":276.19999980926514,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":276.19999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":362.59999990463257,"responseStart":0,"secureConnectionStart":0},{"duration":170.2999997138977,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":276.40000009536743,"connectEnd":276.40000009536743,"connectStart":276.40000009536743,"domainLookupEnd":276.40000009536743,"domainLookupStart":276.40000009536743,"fetchStart":276.40000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":364.5,"responseEnd":446.69999980926514,"responseStart":413.90000009536743,"secureConnectionStart":276.40000009536743},{"duration":323.7999997138977,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":277,"connectEnd":365.40000009536743,"connectStart":365.40000009536743,"domainLookupEnd":365.40000009536743,"domainLookupStart":365.40000009536743,"fetchStart":277,"redirectEnd":0,"redirectStart":0,"requestStart":365.69999980926514,"responseEnd":600.7999997138977,"responseStart":382.59999990463257,"secureConnectionStart":365.40000009536743},{"duration":107.60000038146973,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":277.2999997138977,"connectEnd":277.2999997138977,"connectStart":277.2999997138977,"domainLookupEnd":277.2999997138977,"domainLookupStart":277.2999997138977,"fetchStart":277.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":366.90000009536743,"responseEnd":384.90000009536743,"responseStart":381.90000009536743,"secureConnectionStart":277.2999997138977},{"duration":106.89999961853027,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":277.40000009536743,"connectEnd":277.40000009536743,"connectStart":277.40000009536743,"domainLookupEnd":277.40000009536743,"domainLookupStart":277.40000009536743,"fetchStart":277.40000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":366.59999990463257,"responseEnd":384.2999997138977,"responseStart":381.5,"secureConnectionStart":277.40000009536743},{"duration":111.30000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":277.59999990463257,"connectEnd":277.59999990463257,"connectStart":277.59999990463257,"domainLookupEnd":277.59999990463257,"domainLookupStart":277.59999990463257,"fetchStart":277.59999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":368.90000009536743,"responseEnd":388.90000009536743,"responseStart":387,"secureConnectionStart":277.59999990463257},{"duration":89.90000009536743,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":277.69999980926514,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":277.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":367.59999990463257,"responseStart":0,"secureConnectionStart":0},{"duration":116.2999997138977,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":277.90000009536743,"connectEnd":277.90000009536743,"connectStart":277.90000009536743,"domainLookupEnd":277.90000009536743,"domainLookupStart":277.90000009536743,"fetchStart":277.90000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":374.59999990463257,"responseEnd":394.19999980926514,"responseStart":391.19999980926514,"secureConnectionStart":277.90000009536743},{"duration":91.19999980926514,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":278,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":278,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":369.19999980926514,"responseStart":0,"secureConnectionStart":0},{"duration":123.59999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":278.09999990463257,"connectEnd":278.09999990463257,"connectStart":278.09999990463257,"domainLookupEnd":278.09999990463257,"domainLookupStart":278.09999990463257,"fetchStart":278.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":375,"responseEnd":401.69999980926514,"responseStart":393,"secureConnectionStart":278.09999990463257},{"duration":297.80000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":289.2999997138977,"connectEnd":289.2999997138977,"connectStart":289.2999997138977,"domainLookupEnd":289.2999997138977,"domainLookupStart":289.2999997138977,"fetchStart":289.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":433.09999990463257,"responseEnd":587.0999999046326,"responseStart":584.0999999046326,"secureConnectionStart":289.2999997138977},{"duration":296.40000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":293,"connectEnd":293,"connectStart":293,"domainLookupEnd":293,"domainLookupStart":293,"fetchStart":293,"redirectEnd":0,"redirectStart":0,"requestStart":479.09999990463257,"responseEnd":589.4000000953674,"responseStart":586.5999999046326,"secureConnectionStart":293},{"duration":80,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":594.7999997138977,"connectEnd":594.7999997138977,"connectStart":594.7999997138977,"domainLookupEnd":594.7999997138977,"domainLookupStart":594.7999997138977,"fetchStart":594.7999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":634,"responseEnd":674.7999997138977,"responseStart":674,"secureConnectionStart":594.7999997138977},{"duration":75.5,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":822,"connectEnd":822,"connectStart":822,"domainLookupEnd":822,"domainLookupStart":822,"fetchStart":822,"redirectEnd":0,"redirectStart":0,"requestStart":860.6999998092651,"responseEnd":897.5,"responseStart":896.1999998092651,"secureConnectionStart":822},{"duration":142.2999997138977,"initiatorType":"script","name":"https://www.google-analytics.com/analytics.js","startTime":897.4000000953674,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":897.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1039.6999998092651,"responseStart":0,"secureConnectionStart":0},{"duration":148.19999980926514,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/e65b778d185daf5aee24936755b43da6/_/download/contextbatch/js/browser-metrics-plugin.contrib,-_super,-project.issue.navigator,-jira.view.issue,-atl.general/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":921.0999999046326,"connectEnd":921.0999999046326,"connectStart":921.0999999046326,"domainLookupEnd":921.0999999046326,"domainLookupStart":921.0999999046326,"fetchStart":921.0999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":1055.5999999046326,"responseEnd":1069.2999997138977,"responseStart":1068.5999999046326,"secureConnectionStart":921.0999999046326}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":79,"responseStart":268,"responseEnd":293,"domLoading":271,"domInteractive":1065,"domContentLoadedEventStart":1065,"domContentLoadedEventEnd":1123,"domComplete":1529,"loadEventStart":1529,"loadEventEnd":1530,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":1041.9000000953674},{"name":"bigPipe.sidebar-id.end","time":1042.6999998092651},{"name":"bigPipe.activity-panel-pipe-id.start","time":1042.7999997138977},{"name":"bigPipe.activity-panel-pipe-id.end","time":1044.4000000953674},{"name":"activityTabFullyLoaded","time":1138.2999997138977}],"measures":[],"correlationId":"bda7a18b389743","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":102,"dbReadsTimeInMs":14,"dbConnsTimeInMs":24,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
I think that the following fixes the race condition between log checkpoint and page write completion:
diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
index 27f9a35fb4c..bcd18e73926 100644
--- a/storage/innobase/fil/fil0fil.cc
+++ b/storage/innobase/fil/fil0fil.cc
@@ -2804,6 +2804,7 @@ void IORequest::write_complete(int io_error) const
ut_ad(fil_validate_skip());
ut_ad(node);
ut_ad(is_write());
+ node->complete_write();
if (!bpage)
{
@@ -2816,7 +2817,6 @@ void IORequest::write_complete(int io_error) const
else
buf_page_write_complete(*this, io_error);
- node->complete_write();
node->space->release();
}
The debug assertions in mariabackup as well as create_log_file() and srv_start() are best replaced with calls to os_aio_wait_until_no_pending_writes(false), to ensure that the debug build would hang in case there is some problem. In non-debug builds, the buf_page_t::lock should be held until the write has been properly registered.