[MDEV-35424] False alarm/crash: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch - Jira

Xan Charbonnet added a comment - 2024-11-15 23:11

I guess I should have followed up immediately to say that I did upload the stack trace. It's in private/MDEV-35424_stack_trace.txt.gz .

Xan Charbonnet added a comment - 2024-11-15 23:11 I guess I should have followed up immediately to say that I did upload the stack trace. It's in private/ MDEV-35424 _stack_trace.txt.gz .

Xan Charbonnet added a comment - 2024-11-16 03:11

It just happened again to a different machine in the cluster. I've uploaded MDEV-35424_stack_trace_2.txt.gz. This file is only 6KB compressed, whereas the other was over 700KB compressed. Either this new one will be easier to figure out, or I've done something wrong.

I'm going to attempt to revert back to 10.11.9.

Xan Charbonnet added a comment - 2024-11-16 03:11 It just happened again to a different machine in the cluster. I've uploaded MDEV-35424 _stack_trace_2.txt.gz. This file is only 6KB compressed, whereas the other was over 700KB compressed. Either this new one will be easier to figure out, or I've done something wrong. I'm going to attempt to revert back to 10.11.9.

Xan Charbonnet added a comment - 2024-12-18 04:41

The problem has not recurred since I reverted back to 10.11.9 on November 16. It does seem to be an issue with 10.11.10.

Xan Charbonnet added a comment - 2024-12-18 04:41 The problem has not recurred since I reverted back to 10.11.9 on November 16. It does seem to be an issue with 10.11.10.

Marko Mäkelä added a comment - 2024-12-18 10:25

In the second stack trace, there do not seem to be any actual waits, and the server is being shut down:

mariadb-10.11.10
#12 std::unique_ptr<tpool::timer, std::default_delete<tpool::timer> >::reset (__p=0x0, this=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:501
No locals.
#13 logs_empty_and_mark_files_at_shutdown () at ./storage/innobase/log/log0log.cc:1319
loop = <optimized out>
lsn = <optimized out>
count = 0
do_srv_shutdown = true
thread_name = <optimized out>
#14 0x00005560bf1ba0b0 in innodb_shutdown () at ./storage/innobase/srv/srv0start.cc:1990
No locals.

Did you upload the corresponding server error log as well?
The source code line here is the following:

srv_monitor_timer.reset();

This seems to be directly related to the srv_monitor_task that seems to have triggered a false alarm in another thread:

mariadb-10.11.10
#7 0x00005560bf1c1bcd in ib::fatal::~fatal (this=this@entry=0x7f608470eb00, __in_chrg=<optimized out>) at ./storage/innobase/ut/ut0ut.cc:488
No locals.
#8 0x00005560bf1b9800 in srv_monitor_task () at ./storage/innobase/srv/srv0srv.cc:1091
waited = 32
start = <optimized out>
old_lsn = 72279359647574
new_lsn = <optimized out>
now = <optimized out>
threshold = <optimized out>
#9 0x00005560bfa74038 in tpool::thread_pool_generic::timer_generic::run (this=0x5560fb816240) at ./tpool/tpool_generic.cc:344
No locals.
#10 tpool::thread_pool_generic::timer_generic::execute (arg=0x5560fb816240) at ./tpool/tpool_generic.cc:364
timer = 0x5560fb816240
#11 0x00005560bfa74a6f in tpool::task::execute (this=0x5560fb816280) at ./tpool/task.cc:37
No locals.
#12 tpool::task::execute (this=0x5560fb816280) at ./tpool/task.cc:27
No locals.
#13 0x00005560bfa72cc7 in tpool::thread_pool_generic::worker_main (this=0x5560f8538700, thread_var=0x5560f8688180) at ./tpool/tpool_generic.cc:583
task = 0x5560fb816280

I see that the watchdog is invoking my_hrtime_coarse(), which translates into my_hrtime(), which in turn will probably invoke clock_gettime(CLOCK_REALTIME) to read the timer. The Linux manual page mentions the following:

This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by frequency adjustments performed by NTP and similar applications via adjtime(3), adjtimex(2), clock_adjtime(2), and ntp_adjtime(3).

Could this be applicable here?

There also is a monotonic timer that should not be affected by this, but it would be more expensive to read.

For the record, I was not keen to retain this watchdog in ~~MDEV-21452~~, but I did it on popular demand. I’ll need some more time to check if the longer set of stack traces is indicating a real problem.

Marko Mäkelä added a comment - 2024-12-18 10:25 In the second stack trace, there do not seem to be any actual waits, and the server is being shut down: mariadb-10.11.10 #12 std::unique_ptr<tpool::timer, std::default_delete<tpool::timer> >::reset (__p=0x0, this=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:501 No locals. #13 logs_empty_and_mark_files_at_shutdown () at ./storage/innobase/log/log0log.cc:1319 loop = <optimized out> lsn = <optimized out> count = 0 do_srv_shutdown = true thread_name = <optimized out> #14 0x00005560bf1ba0b0 in innodb_shutdown () at ./storage/innobase/srv/srv0start.cc:1990 No locals. Did you upload the corresponding server error log as well? The source code line here is the following: srv_monitor_timer.reset(); This seems to be directly related to the srv_monitor_task that seems to have triggered a false alarm in another thread: mariadb-10.11.10 #7 0x00005560bf1c1bcd in ib::fatal::~fatal (this=this@entry=0x7f608470eb00, __in_chrg=<optimized out>) at ./storage/innobase/ut/ut0ut.cc:488 No locals. #8 0x00005560bf1b9800 in srv_monitor_task () at ./storage/innobase/srv/srv0srv.cc:1091 waited = 32 start = <optimized out> old_lsn = 72279359647574 new_lsn = <optimized out> now = <optimized out> threshold = <optimized out> #9 0x00005560bfa74038 in tpool::thread_pool_generic::timer_generic::run (this=0x5560fb816240) at ./tpool/tpool_generic.cc:344 No locals. #10 tpool::thread_pool_generic::timer_generic::execute (arg=0x5560fb816240) at ./tpool/tpool_generic.cc:364 timer = 0x5560fb816240 #11 0x00005560bfa74a6f in tpool::task::execute (this=0x5560fb816280) at ./tpool/task.cc:37 No locals. #12 tpool::task::execute (this=0x5560fb816280) at ./tpool/task.cc:27 No locals. #13 0x00005560bfa72cc7 in tpool::thread_pool_generic::worker_main (this=0x5560f8538700, thread_var=0x5560f8688180) at ./tpool/tpool_generic.cc:583 task = 0x5560fb816280 I see that the watchdog is invoking my_hrtime_coarse() , which translates into my_hrtime() , which in turn will probably invoke clock_gettime(CLOCK_REALTIME) to read the timer. The Linux manual page mentions the following: This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by frequency adjustments performed by NTP and similar applications via adjtime(3), adjtimex(2), clock_adjtime(2), and ntp_adjtime(3). Could this be applicable here? There also is a monotonic timer that should not be affected by this, but it would be more expensive to read. For the record, I was not keen to retain this watchdog in MDEV-21452 , but I did it on popular demand. I’ll need some more time to check if the longer set of stack traces is indicating a real problem.

Marko Mäkelä added a comment - 2024-12-18 11:24

In the longer stack traces, we have a periodic task that is invoking dict_sys_t::evict_table_LRU(), which will acquire an exclusive dict_sys.latch. This will block any subsequent shared latch requests until the exclusive latch has been acquired and released. I will need to check which threads could be holding a shared latch, and what is blocking them.

A search for dict_sys in the output finds dict_sys_t::load_table(), which must be holding an exclusive dict_sys.latch. It is blocked by a pread system call, reading a page of the SYS_TABLES table. Is the system I/O bound by any chance? Could innodb_open_files and table_open_cache be configured to be larger? Also, would it be thinkable to set innodb_stats_auto_recalc=off?

I also see that Thread 484 is reading a page that Thread 434 is waiting for. We seem to have at least two page read requests that are apparently taking a long time. Have there been any messages about file systems or storage in the system logs or in the S.M.A.R.T. logs? Does the storage pass the S.M.A.R.T. health checks?

Marko Mäkelä added a comment - 2024-12-18 11:24 In the longer stack traces, we have a periodic task that is invoking dict_sys_t::evict_table_LRU() , which will acquire an exclusive dict_sys.latch . This will block any subsequent shared latch requests until the exclusive latch has been acquired and released. I will need to check which threads could be holding a shared latch, and what is blocking them. A search for dict_sys in the output finds dict_sys_t::load_table() , which must be holding an exclusive dict_sys.latch . It is blocked by a pread system call, reading a page of the SYS_TABLES table. Is the system I/O bound by any chance? Could innodb_open_files and table_open_cache be configured to be larger? Also, would it be thinkable to set innodb_stats_auto_recalc=off ? I also see that Thread 484 is reading a page that Thread 434 is waiting for. We seem to have at least two page read requests that are apparently taking a long time. Have there been any messages about file systems or storage in the system logs or in the S.M.A.R.T. logs? Does the storage pass the S.M.A.R.T. health checks?

Xan Charbonnet added a comment - 2025-01-07 17:37

Marko,

I don't believe this is a hardware issue. I'm running two separate Galera clusters (a main and a backup) connected via asynchronous replication. Everything was running fine until I upgraded both clusters to 10.11.10, when this issue started happening. After downgrading the main cluster to 10.11.9, the issue stopped happening on the main cluster. It is still happening on the backup cluster. I've uploaded two more stack traces just now.

None of the machines are having hardware issues.

I truly appreciate your efforts looking at this!

Xan Charbonnet added a comment - 2025-01-07 17:37 Marko, I don't believe this is a hardware issue. I'm running two separate Galera clusters (a main and a backup) connected via asynchronous replication. Everything was running fine until I upgraded both clusters to 10.11.10, when this issue started happening. After downgrading the main cluster to 10.11.9, the issue stopped happening on the main cluster. It is still happening on the backup cluster. I've uploaded two more stack traces just now. None of the machines are having hardware issues. I truly appreciate your efforts looking at this!

Marko Mäkelä added a comment - 2025-01-08 11:52 - edited

The crash during shutdown in MDEV-35424_stack_trace_2.txt.gz is what I referred to as "the second stack trace" in my first comment on 2024-12-18. Maybe wlad has some idea about that? It does not seem to me that anything can actually be holding dict_sys.latch there. Could the system time have been moved backwards?

In MDEV-35424_stacktrace_njgamma_2024-11-29.txt we have a crash in ha_innobase::create() because the wait for an exclusive dict_sys.latch has apparently taken too long. There are several other threads blocked in a pread operation. One of them is Thread 75, which is waiting for a page to be read in dict_load_foreigns(). This is something that I would fix in MDEV-35154, by quickly creating a dict_table_t "stub" and then releasing dict_sys.latch before accessing the InnoDB data dictionary tables.

Marko Mäkelä added a comment - 2025-01-08 11:52 - edited The crash during shutdown in MDEV-35424 _stack_trace_2.txt.gz is what I referred to as "the second stack trace" in my first comment on 2024-12-18. Maybe wlad has some idea about that? It does not seem to me that anything can actually be holding dict_sys.latch there. Could the system time have been moved backwards? In MDEV-35424 _stacktrace_njgamma_2024-11-29.txt we have a crash in ha_innobase::create() because the wait for an exclusive dict_sys.latch has apparently taken too long. There are several other threads blocked in a pread operation. One of them is Thread 75 , which is waiting for a page to be read in dict_load_foreigns() . This is something that I would fix in MDEV-35154 , by quickly creating a dict_table_t "stub" and then releasing dict_sys.latch before accessing the InnoDB data dictionary tables.

Marko Mäkelä added a comment - 2025-01-27 09:26

xan@biblionix.com, thanks to your heroic effort in ~~MDEV-35886~~, this could be a Linux kernel bug. Can you confirm if your kernel patch would make this hang go away?

Marko Mäkelä added a comment - 2025-01-27 09:26 xan@biblionix.com , thanks to your heroic effort in MDEV-35886 , this could be a Linux kernel bug. Can you confirm if your kernel patch would make this hang go away?

Xan Charbonnet added a comment - 2025-01-28 15:43 - edited

Thank you Marko. I have installed the patched kernel on the 10.11.10 Galera cluster where I've been observing problems. We'll see what happens.

Taking another look at the issue now, however, I'm thinking it might actually be related to ~~MDEV-34753~~, in particular, the issue that @Adelin is having.

This is a cluster of 3 servers, one of which (and therefore the whole cluster) is an asynchronous slave of another cluster.

I'll take one for example: just since Sunday, the log contains 4 "memory pressure" events:

2025-01-26T02:11:03.535382-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-26  2:11:03 0 [Note] InnoDB: Memory pressure event freed 1426426 pages

2025-01-27T01:18:46.205575-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-27  1:18:46 0 [Note] InnoDB: Memory pressure event freed 2823438 pages

2025-01-28T02:01:51.009936-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-28  2:01:51 0 [Note] InnoDB: Memory pressure event freed 4416513 pages

2025-01-28T08:39:13.193821-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-28  8:39:13 0 [Note] InnoDB: Memory pressure event freed 1888696 pages

Two of those events were immediately followed by this:

2025-01-26T02:11:03.535880-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-26  2:11:03 38826579 [Warning] InnoDB: A long wait (11 seconds) was observed for dict_sys.latch

2025-01-28T08:39:13.197187-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-28  8:39:13 50356276 [Warning] InnoDB: A long wait (10 seconds) was observed for dict_sys.latch

I have logging that runs once a minute and records a lot of different statuses of the system. Here's the snapshot that was taken at 2:11:01am, that is, two seconds before one of the "memory pressure" warnings:

top - 02:11:01 up 52 days,  5:11,  0 user,  load average: 2.24, 1.68, 1.32

Tasks: 767 total,   4 running, 763 sleeping,   0 stopped,   0 zombie

%Cpu(s):  8.0 us,  3.2 sy,  0.0 ni, 88.0 id,  0.8 wa,  0.0 hi,  0.0 si,  0.0 st

MiB Mem : 257473.4 total,   1977.6 free,  25939.0 used, 139946.7 buff/cache

MiB Swap:      0.0 total,      0.0 free,      0.0 used. 231534.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

 557361 mysql     20   0  135.0g 105.5g   5.0g S 100.0  42.0     39,33 /usr/sbin/mariadbd --wsrep_start_position=02841326-d132-11ef-9279-2f607e8f4b06:1415490

As you can see, MariaDB was indeed taking a lot of memory, but only around half of what was available. This appears to be a memory-pressure false alarm.

Based on this small sample, 50% of these memory pressure events cause long dict_sys.latch waits. I wonder if the hangs I'm observing happen to be times when the wait is longer than others for whatever reason.

It sounds like the root cause of all of this might be the memory pressure false alarms.

All three servers are getting a regular stream of memory pressure events when there is in fact no memory pressure.

Xan Charbonnet added a comment - 2025-01-28 15:43 - edited Thank you Marko. I have installed the patched kernel on the 10.11.10 Galera cluster where I've been observing problems. We'll see what happens. Taking another look at the issue now, however, I'm thinking it might actually be related to MDEV-34753 , in particular, the issue that @Adelin is having. This is a cluster of 3 servers, one of which (and therefore the whole cluster) is an asynchronous slave of another cluster. I'll take one for example: just since Sunday, the log contains 4 "memory pressure" events: 2025-01-26T02:11:03.535382-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-26 2:11:03 0 [Note] InnoDB: Memory pressure event freed 1426426 pages 2025-01-27T01:18:46.205575-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-27 1:18:46 0 [Note] InnoDB: Memory pressure event freed 2823438 pages 2025-01-28T02:01:51.009936-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-28 2:01:51 0 [Note] InnoDB: Memory pressure event freed 4416513 pages 2025-01-28T08:39:13.193821-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-28 8:39:13 0 [Note] InnoDB: Memory pressure event freed 1888696 pages Two of those events were immediately followed by this: 2025-01-26T02:11:03.535880-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-26 2:11:03 38826579 [Warning] InnoDB: A long wait (11 seconds) was observed for dict_sys.latch 2025-01-28T08:39:13.197187-06:00 a-nj-gamma-1 mariadbd[557361]: 2025-01-28 8:39:13 50356276 [Warning] InnoDB: A long wait (10 seconds) was observed for dict_sys.latch I have logging that runs once a minute and records a lot of different statuses of the system. Here's the snapshot that was taken at 2:11:01am, that is, two seconds before one of the "memory pressure" warnings: top - 02:11:01 up 52 days, 5:11, 0 user, load average: 2.24, 1.68, 1.32 Tasks: 767 total, 4 running, 763 sleeping, 0 stopped, 0 zombie %Cpu(s): 8.0 us, 3.2 sy, 0.0 ni, 88.0 id, 0.8 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 257473.4 total, 1977.6 free, 25939.0 used, 139946.7 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 231534.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 557361 mysql 20 0 135.0g 105.5g 5.0g S 100.0 42.0 39,33 /usr/sbin/mariadbd --wsrep_start_position=02841326-d132-11ef-9279-2f607e8f4b06:1415490 As you can see, MariaDB was indeed taking a lot of memory, but only around half of what was available. This appears to be a memory-pressure false alarm. Based on this small sample, 50% of these memory pressure events cause long dict_sys.latch waits. I wonder if the hangs I'm observing happen to be times when the wait is longer than others for whatever reason. It sounds like the root cause of all of this might be the memory pressure false alarms. All three servers are getting a regular stream of memory pressure events when there is in fact no memory pressure.

Marko Mäkelä added a comment - 2025-01-29 06:28

The memory pressure events (~~MDEV-24670~~) will trigger a cleanup of the InnoDB buffer pool, freeing all blocks that can be freed. If this happens on a busy server, many pages will have to be read back soon again. Possibly, danblack has an idea how the memory pressure interface could be disabled. Would we need a new configuration parameter for that?

We have a known design issue that unnecessarily makes the dict_sys.latch a contention point during file I/O. We need to fix MDEV-34999 and the linked related tickets. The dict_sys.latch and its precursor dict_operation_lock always violated the design principle that a global mutex or rw-lock must not be held across potentially time consuming system calls, such as reading an index page.

Marko Mäkelä added a comment - 2025-01-29 06:28 The memory pressure events ( MDEV-24670 ) will trigger a cleanup of the InnoDB buffer pool, freeing all blocks that can be freed. If this happens on a busy server, many pages will have to be read back soon again. Possibly, danblack has an idea how the memory pressure interface could be disabled. Would we need a new configuration parameter for that? We have a known design issue that unnecessarily makes the dict_sys.latch a contention point during file I/O. We need to fix MDEV-34999 and the linked related tickets. The dict_sys.latch and its precursor dict_operation_lock always violated the design principle that a global mutex or rw-lock must not be held across potentially time consuming system calls, such as reading an index page.

Xan Charbonnet added a comment - 2025-01-29 16:49

It certainly seems like the memory pressure events are faulty. I don't know what promises the kernel makes about sending them, but either it's sending them incorrectly or MariaDB needs a way to tell whether the server is actually having a memory issue when it gets one.

Failing that, a way to turn it off would be required, and in fact disabling it should probably be the default, at least for versions that are not-the-latest that had the fix for ~~MDEV-34753~~ backported to them.

I'll need to either stick with 10.11.9 or undo the ~~MDEV-34753~~ change.

Xan Charbonnet added a comment - 2025-01-29 16:49 It certainly seems like the memory pressure events are faulty. I don't know what promises the kernel makes about sending them, but either it's sending them incorrectly or MariaDB needs a way to tell whether the server is actually having a memory issue when it gets one. Failing that, a way to turn it off would be required, and in fact disabling it should probably be the default, at least for versions that are not-the-latest that had the fix for MDEV-34753 backported to them. I'll need to either stick with 10.11.9 or undo the MDEV-34753 change.

Xan Charbonnet added a comment - 2025-02-19 18:50

To be clear I'm not sure whether the issue that caused this report in the first place is related to memory pressure. Maybe it is. On my primary Galera cluster I'm still on 10.11.9, and my secondary Galera cluster (mirrored asynchronously from the primary) has been on 10.11.10 (just upgraded today to 10.11.11).

The only weird behavior I'm seeing on the secondary cluster is these "InnoDB: Memory pressure event freed XXXX pages" notifications. Maybe those were the cause of the hanging I observed. Maybe they just cause some performance issue. Maybe they don't cause any noticeable problem at all.

I read up on this whole memory pressure system: https://www.kernel.org/doc/html/latest/accounting/psi.html

I was operating under the assumption that memory pressure meant that the system was running low on free memory, and since I'm not anywhere close to running low on memory, the events were firing incorrectly.

However, it turns out that the notification system is based on "stalling": how much time a process (in this case MariaDB) has spent stalled waiting for a resource (in this case memory).

Here are the contents of /proc/pressure/memory on a system that has had 11 MariaDB memory pressure events so far this week, with an uptime of 6 days 15 hours:

some avg10=0.00 avg60=0.00 avg300=0.00 total=38026736

full avg10=0.00 avg60=0.00 avg300=0.00 total=37812694

I don't know whether those totals are abnormally high or not. It looks like all my systems are building that "total" number at around the same rate.

In storage/innobase/buf/buf0buf.cc is this:

const char* const mem_pressure::m_triggers[]=

  {"some 5000000 10000000", /* 5s out of 10s */

   "full 10000 2000000"}; /* 10ms out of 2s */

Meaning that MariaDB will do garbage collection when "some" tasks are stalled waiting for memory 5 seconds out of 10 seconds, or when all tasks are stalled for 10 milliseconds out of 2 seconds.

I don't believe there's a way to tell whether I'm hitting the "some" or the "full" threshold. My guess is it's the "full" threshold. Do we know where the 10ms out of 2s threshold came from? Could the issue be that this threshold is too sensitive?

Xan Charbonnet added a comment - 2025-02-19 18:50 To be clear I'm not sure whether the issue that caused this report in the first place is related to memory pressure. Maybe it is. On my primary Galera cluster I'm still on 10.11.9, and my secondary Galera cluster (mirrored asynchronously from the primary) has been on 10.11.10 (just upgraded today to 10.11.11). The only weird behavior I'm seeing on the secondary cluster is these "InnoDB: Memory pressure event freed XXXX pages" notifications. Maybe those were the cause of the hanging I observed. Maybe they just cause some performance issue. Maybe they don't cause any noticeable problem at all. I read up on this whole memory pressure system: https://www.kernel.org/doc/html/latest/accounting/psi.html I was operating under the assumption that memory pressure meant that the system was running low on free memory, and since I'm not anywhere close to running low on memory, the events were firing incorrectly. However, it turns out that the notification system is based on "stalling": how much time a process (in this case MariaDB) has spent stalled waiting for a resource (in this case memory). Here are the contents of /proc/pressure/memory on a system that has had 11 MariaDB memory pressure events so far this week, with an uptime of 6 days 15 hours: some avg10=0.00 avg60=0.00 avg300=0.00 total=38026736 full avg10=0.00 avg60=0.00 avg300=0.00 total=37812694 I don't know whether those totals are abnormally high or not. It looks like all my systems are building that "total" number at around the same rate. In storage/innobase/buf/buf0buf.cc is this: const char * const mem_pressure::m_triggers[]= { "some 5000000 10000000" , /* 5s out of 10s */ "full 10000 2000000" }; /* 10ms out of 2s */ Meaning that MariaDB will do garbage collection when "some" tasks are stalled waiting for memory 5 seconds out of 10 seconds, or when all tasks are stalled for 10 milliseconds out of 2 seconds. I don't believe there's a way to tell whether I'm hitting the "some" or the "full" threshold. My guess is it's the "full" threshold. Do we know where the 10ms out of 2s threshold came from? Could the issue be that this threshold is too sensitive?

Marko Mäkelä added a comment - 2025-02-20 11:36

xan@biblionix.com, I’m planning to revise the memory pressure interface in ~~MDEV-34863~~. I believe that we need to introduce some limits on how much of the buffer pool would be freed in response to a memory pressure event. We should also make the response compatible with large_pages, and invoke madvise(MADV_FREE) on larger chunks, instead of individual buffer blocks.

Marko Mäkelä added a comment - 2025-02-20 11:36 xan@biblionix.com , I’m planning to revise the memory pressure interface in MDEV-34863 . I believe that we need to introduce some limits on how much of the buffer pool would be freed in response to a memory pressure event. We should also make the response compatible with large_pages , and invoke madvise(MADV_FREE) on larger chunks, instead of individual buffer blocks.

Marko Mäkelä added a comment - 2025-02-20 16:59

danblack, https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of ~~MDEV-34863~~, which would introduce a parameter innodb_buffer_pool_size_min. A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached.

This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min. I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?

Marko Mäkelä added a comment - 2025-02-20 16:59 danblack , https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of MDEV-34863 , which would introduce a parameter innodb_buffer_pool_size_min . A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached. This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min . I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?

Xan Charbonnet added a comment - 2025-03-13 01:55

A machine in my 10.11.11 Galera cluster had another "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch", "got signal 6" event yesterday and dumped its core. It doesn't seem to have been related to a memory pressure event. There was a memory pressure event near it, but it was after the shutdown had started. The fatal event was at 2025-03-11 09:56:37 US Central, and the memory pressure event was at 09:56:41.

I've generated a full stacktrace (hopefully correctly) and uploaded it to the "private" folder of the MariaDB SFTP server. It's called MDEV-35424_backtraces_nj-beta_2025-03-11.txt.

Xan Charbonnet added a comment - 2025-03-13 01:55 A machine in my 10.11.11 Galera cluster had another "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch", "got signal 6" event yesterday and dumped its core. It doesn't seem to have been related to a memory pressure event. There was a memory pressure event near it, but it was after the shutdown had started. The fatal event was at 2025-03-11 09:56:37 US Central, and the memory pressure event was at 09:56:41. I've generated a full stacktrace (hopefully correctly) and uploaded it to the "private" folder of the MariaDB SFTP server. It's called MDEV-35424 _backtraces_nj-beta_2025-03-11.txt.

Marko Mäkelä added a comment - 2025-03-13 05:44

Thank you. Meanwhile, I revised my proposed fix of ~~MDEV-34863~~ so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking.

The stack traces look reasonable. We’re missing the debug information for libstdc++.so, but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer:

	srv_shutdown_state = SRV_SHUTDOWN_CLEANUP;

	if (srv_buffer_pool_dump_at_shutdown &&

		!srv_read_only_mode && srv_fast_shutdown < 2) {

		buf_dump_start();

	srv_monitor_timer.reset();

	if (do_srv_shutdown) {

		srv_shutdown(srv_fast_shutdown == 0);

loop:

Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere:

void dict_sys_t::lock_wait(SRW_LOCK_ARGS(const char *file, unsigned line)) noexcept

  ulonglong now= my_hrtime_coarse().val, old= 0;

  if (latch_ex_wait_start.compare_exchange_strong

      (old, now, std::memory_order_relaxed, std::memory_order_relaxed))

    latch.wr_lock(SRW_LOCK_ARGS(file, line));

    latch_ex_wait_start.store(0, std::memory_order_relaxed);

    return;

  // …

  latch.wr_lock(SRW_LOCK_ARGS(file, line));

This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch, and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced.

To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits.

Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended (MDEV-33594, MDEV-34988, MDEV-34999, ~~MDEV-35436~~), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make ~~MDEV-36226~~ observe innodb_fatal_semaphore_wait?

For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in ~~MDEV-21452~~ I initially simply removed it and then implemented a replacement, which was revised in ~~MDEV-24258~~. Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.

Marko Mäkelä added a comment - 2025-03-13 05:44 Thank you. Meanwhile, I revised my proposed fix of MDEV-34863 so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking. The stack traces look reasonable. We’re missing the debug information for libstdc++.so , but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer: srv_shutdown_state = SRV_SHUTDOWN_CLEANUP; if (srv_buffer_pool_dump_at_shutdown && !srv_read_only_mode && srv_fast_shutdown < 2) { buf_dump_start(); } srv_monitor_timer.reset(); if (do_srv_shutdown) { srv_shutdown(srv_fast_shutdown == 0); } loop: Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere: void dict_sys_t::lock_wait(SRW_LOCK_ARGS( const char *file, unsigned line)) noexcept { ulonglong now= my_hrtime_coarse().val, old= 0; if (latch_ex_wait_start.compare_exchange_strong (old, now, std::memory_order_relaxed, std::memory_order_relaxed)) { latch.wr_lock(SRW_LOCK_ARGS(file, line)); latch_ex_wait_start.store(0, std::memory_order_relaxed); return ; } // … latch.wr_lock(SRW_LOCK_ARGS(file, line)); } This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch , and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced. To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits. Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended ( MDEV-33594 , MDEV-34988 , MDEV-34999 , MDEV-35436 ), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make MDEV-36226 observe innodb_fatal_semaphore_wait ? For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in MDEV-21452 I initially simply removed it and then implemented a replacement , which was revised in MDEV-24258 . Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.

Otto Kekäläinen added a comment - 2025-03-26 07:40

This was reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101340

Otto Kekäläinen added a comment - 2025-03-26 07:40 This was reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101340

Marko Mäkelä added a comment - 2025-03-26 15:56

For the record, ~~MDEV-34863~~ disables the memory pressure event handling by default.

Marko Mäkelä added a comment - 2025-03-26 15:56 For the record, MDEV-34863 disables the memory pressure event handling by default.

MariaDB Server

False alarm/crash: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration