The issue is not repeatable but here is a snapshot of the stack of user case when it crashed.
1. srv_monitor_task: Raises fatal error as dict sys latch wait found to exceed 600 sec (10 minutes)
#5 0x000056327cfbad2f in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>)
|
#6 0x000056327cfb222f in srv_monitor_task.cold ()
|
#7 0x000056327d865268 in tpool::thread_pool_generic::timer_generic::run (this=0x56327f13be50)
|
2. DML threads waiting on dict sys latch while trying to open a table
#5 dict_sys_t::freeze (line=1039, file="storage/innobase/dict/dict0dict.cc" at /usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:1006
|
#6 0x000056327d7e9f88 in dict_table_open_on_name (table_name="zabbix/applications", DICT_ERR_IGNORE_FK_NOKEY) at /usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:1039
|
#7 0x000056327d699f65 in ha_innobase::open(char const*, int, unsigned int) () at /usr/src/debug/MariaDB-/src_0/storage/innobase/handler/ha_innodb.cc:6187
|
#9 0x000056327d20fa82 in open_table_from_share
|
3. Threads stuck waiting for free page in buffer pool.
#1 0x000056327d7dcd9f in buf_LRU_get_free_block (have_mutex=have_mutex@entry=false) at storage/innobase/buf/buf0lru.cc:467
|
3A. One of the above threads is fetching dict stat information holding X dictionary latch and hence blocking all dictionary operations.
#1 0x000056327d7dcd9f in buf_LRU_get_free_block (have_mutex=have_mutex@entry=false) at storage/innobase/buf/buf0lru.cc:467
|
#2 0x000056327d02425f in buf_page_init_for_read (mode=132, page_id={m_id = 37005438222352}, at storage/innobase/buf/buf0rea.cc:121
|
#3 0x000056327d7dd6f7 in buf_read_page_low (unzip=false, zip_size=0, page_id={m_id = 37005438222352}, mode=132, sync=true, space=0x56327efbbdb8) at storage/innobase/buf/buf0rea.cc:295
|
#10 0x000056327d76188b in row_sel_step (thr=thr@entry=0x7f9418606dd0) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0sel.cc:2433
|
#14 0x000056327d71dd1b in que_eval_sql (info=info@entry=0x7f941805d890,
|
#15 0x000056327d8060a2 in dict_stats_fetch_from_ps (table=0x7f9418600e40) at /usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0stats.cc:3910
|
4. Page cleaner thread which should generate free pages, is busy flushing pages.
|
#0 0x00007fa721b3248c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
|
#1 0x000056327d7d14bc in buf_dblwr_t::flush_buffered_writes (this=0x56327eaa3520 <buf_dblwr>, size=64) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0dblwr.cc:588
|
#2 0x000056327d7d1b5b in buf_dblwr_t::add_to_batch (this=<buf_dblwr>, request= {bpage = 0x7f9e6800faa0, slot=0x0, type = IORequest::WRITE_ASYNC}, size=16384) at storage/innobase/buf/buf0dblwr.cc:647
|
#3 0x000056327d7d7655 in buf_page_t::flush (this=0x7f9e6800faa0, space=0x7f92e0072e78) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/os0file.h:209
|
#4 0x000056327d7d9397 in buf_do_flush_list_batch (lsn=<optimized out>, max_n=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0flu.cc:1524
|
#5 buf_flush_list_holding_mutex (lsn=<optimized out>, max_n=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0flu.cc:1598
|
#6 buf_flush_list_holding_mutex (max_n=<optimized out>, lsn=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0flu.cc:1579
|
#7 0x000056327d7d9dc9 in buf_flush_page_cleaner () at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0flu.cc:2503
|
The crash happened as page cleaner could not free up any page for 10 minutes. Now during regular iteration the page cleaner does flush from flush list for maximum of innodb_max_io_capacity capacity and then goes for generating free pages from LRU tail.
In the specific case innodb_max_io_capacity = 4000 4000x16k ~64M data page flush(128M write IO with double write) should not take that long. From the detailed code analysis I found one catch in some specific case where we may skip freeing any page from the LRU tail and go to the next iteration for flush list flush. The process could repeat for long time if the buffer pool size (in GBs) and the total number of dirty pages are large.
I could repeat the case where LRU flush/eviction step fails to free any free page in debugger with the following scenario.
1. Dirty pages that are modified and not used for long time (set-A) filling up the tail of the LRU.
2. At the same time other dirty pages modified earlier than the LRU tail pages(set-B) but used and modified repeatedly. These pages are in the head of the flush list and would be flushed prior to set-A while flushing from the flush list to reduce checkpoint age.
3. When asynchronous flush is triggered i.e. when 7/8 th of the LSN margin is filled in the redo log, we try to flush innodb_io_capacity_max number of pages from flush list in the iteration. after flushing innodb_io_capacity_max number of pages from flush list we try to do flush and eviction from the LRU (buf_flush_LRU) with max_n=0 i.e. the number of pages allowed to flush during LRU scan is zero because we have already flushed innodb_io_capacity_max amount of pages from flush list in the iteration.
4. We have all dirty pages (set-A) in LRU tail and we return not able to evict any page at all.
This scenario could happen with large buffer pool size. It should be fixed by ensuring at least some flush quota for LRU flush and eviction stage in a iteration.
{"report":{"fcp":4519,"ttfb":2897.5,"pageVisibility":"visible","entityId":133156,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0,"journeyId":"7534fdcf-7ee4-4de3-b77f-c5b08a17acb6","navigationType":0,"readyForUser":4649.699999809265,"redirectCount":0,"resourceLoadedEnd":5397,"resourceLoadedStart":2910.300000190735,"resourceTiming":[{"duration":521.1999998092651,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":2910.300000190735,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":2910.300000190735,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":3431.5,"responseStart":0,"secureConnectionStart":0},{"duration":521.2000002861023,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":2910.699999809265,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":2910.699999809265,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":3431.9000000953674,"responseStart":0,"secureConnectionStart":0},{"duration":532.7999997138977,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":2910.800000190735,"connectEnd":2910.800000190735,"connectStart":2910.800000190735,"domainLookupEnd":2910.800000190735,"domainLookupStart":2910.800000190735,"fetchStart":2910.800000190735,"redirectEnd":0,"redirectStart":0,"requestStart":2910.800000190735,"responseEnd":3443.5999999046326,"responseStart":3443.5999999046326,"secureConnectionStart":2910.800000190735},{"duration":589.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":2911,"connectEnd":2911,"connectStart":2911,"domainLookupEnd":2911,"domainLookupStart":2911,"fetchStart":2911,"redirectEnd":0,"redirectStart":0,"requestStart":2911,"responseEnd":3500.5999999046326,"responseStart":3500.5999999046326,"secureConnectionStart":2911},{"duration":599.3000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":2911.199999809265,"connectEnd":2911.199999809265,"connectStart":2911.199999809265,"domainLookupEnd":2911.199999809265,"domainLookupStart":2911.199999809265,"fetchStart":2911.199999809265,"redirectEnd":0,"redirectStart":0,"requestStart":2911.199999809265,"responseEnd":3510.5,"responseStart":3510.5,"secureConnectionStart":2911.199999809265},{"duration":599.5,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":2911.5,"connectEnd":2911.5,"connectStart":2911.5,"domainLookupEnd":2911.5,"domainLookupStart":2911.5,"fetchStart":2911.5,"redirectEnd":0,"redirectStart":0,"requestStart":2911.5,"responseEnd":3511,"responseStart":3511,"secureConnectionStart":2911.5},{"duration":599.7000002861023,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":2911.699999809265,"connectEnd":2911.699999809265,"connectStart":2911.699999809265,"domainLookupEnd":2911.699999809265,"domainLookupStart":2911.699999809265,"fetchStart":2911.699999809265,"redirectEnd":0,"redirectStart":0,"requestStart":2911.699999809265,"responseEnd":3511.4000000953674,"responseStart":3511.4000000953674,"secureConnectionStart":2911.699999809265},{"duration":600.8999996185303,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":2911.800000190735,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":2911.800000190735,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":3512.699999809265,"responseStart":0,"secureConnectionStart":0},{"duration":600.0999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":2912,"connectEnd":2912,"connectStart":2912,"domainLookupEnd":2912,"domainLookupStart":2912,"fetchStart":2912,"redirectEnd":0,"redirectStart":0,"requestStart":2912,"responseEnd":3512.0999999046326,"responseStart":3512.0999999046326,"secureConnectionStart":2912},{"duration":600.9000000953674,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":2912.199999809265,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":2912.199999809265,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":3513.0999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":600.6999998092651,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":2912.4000000953674,"connectEnd":2912.4000000953674,"connectStart":2912.4000000953674,"domainLookupEnd":2912.4000000953674,"domainLookupStart":2912.4000000953674,"fetchStart":2912.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":2912.4000000953674,"responseEnd":3513.0999999046326,"responseStart":3513.0999999046326,"secureConnectionStart":2912.4000000953674},{"duration":872.3999996185303,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":2913.300000190735,"connectEnd":2913.300000190735,"connectStart":2913.300000190735,"domainLookupEnd":2913.300000190735,"domainLookupStart":2913.300000190735,"fetchStart":2913.300000190735,"redirectEnd":0,"redirectStart":0,"requestStart":2913.300000190735,"responseEnd":3785.699999809265,"responseStart":3785.699999809265,"secureConnectionStart":2913.300000190735},{"duration":2483.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":2913.4000000953674,"connectEnd":2913.4000000953674,"connectStart":2913.4000000953674,"domainLookupEnd":2913.4000000953674,"domainLookupStart":2913.4000000953674,"fetchStart":2913.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":2913.4000000953674,"responseEnd":5397,"responseStart":5397,"secureConnectionStart":2913.4000000953674},{"duration":1469.2000002861023,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":3809.0999999046326,"connectEnd":3809.0999999046326,"connectStart":3809.0999999046326,"domainLookupEnd":3809.0999999046326,"domainLookupStart":3809.0999999046326,"fetchStart":3809.0999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":3809.0999999046326,"responseEnd":5278.300000190735,"responseStart":5278.300000190735,"secureConnectionStart":3809.0999999046326}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":2724,"responseStart":2897,"responseEnd":2906,"domLoading":2908,"domInteractive":5450,"domContentLoadedEventStart":5450,"domContentLoadedEventEnd":5534,"domComplete":6308,"loadEventStart":6309,"loadEventEnd":6309,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":5403.900000095367},{"name":"bigPipe.sidebar-id.end","time":5404.900000095367},{"name":"bigPipe.activity-panel-pipe-id.start","time":5405.099999904633},{"name":"bigPipe.activity-panel-pipe-id.end","time":5406.300000190735},{"name":"activityTabFullyLoaded","time":5556.099999904633}],"measures":[],"correlationId":"47e6927d3358a3","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":103,"dbReadsTimeInMs":22,"dbConnsTimeInMs":31,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}