Thank you. Meanwhile, I revised my proposed fix of MDEV-34863 so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking.
The stack traces look reasonable. We’re missing the debug information for libstdc++.so, but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer:
srv_shutdown_state = SRV_SHUTDOWN_CLEANUP;
|
|
if (srv_buffer_pool_dump_at_shutdown &&
|
!srv_read_only_mode && srv_fast_shutdown < 2) {
|
buf_dump_start();
|
}
|
srv_monitor_timer.reset();
|
|
if (do_srv_shutdown) {
|
srv_shutdown(srv_fast_shutdown == 0);
|
}
|
|
|
loop:
|
Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere:
void dict_sys_t::lock_wait(SRW_LOCK_ARGS(const char *file, unsigned line)) noexcept
|
{
|
ulonglong now= my_hrtime_coarse().val, old= 0;
|
if (latch_ex_wait_start.compare_exchange_strong
|
(old, now, std::memory_order_relaxed, std::memory_order_relaxed))
|
{
|
latch.wr_lock(SRW_LOCK_ARGS(file, line));
|
latch_ex_wait_start.store(0, std::memory_order_relaxed);
|
return;
|
}
|
// …
|
latch.wr_lock(SRW_LOCK_ARGS(file, line));
|
}
|
This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch, and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced.
To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits.
Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended (MDEV-33594, MDEV-34988, MDEV-34999, MDEV-35436), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make MDEV-36226 observe innodb_fatal_semaphore_wait?
For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in MDEV-21452 I initially simply removed it and then implemented a replacement, which was revised in MDEV-24258. Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.
danblack, https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of
MDEV-34863, which would introduce a parameter innodb_buffer_pool_size_min. A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached.This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min. I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?