[MDEV-29843] Server hang in thd_decrement_pending_ops/pthread_cond_signal Created: 2022-10-21  Updated: 2023-06-08  Resolved: 2022-10-25

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.8, 10.6.10
Fix Version/s: 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1

Type: Bug Priority: Major
Reporter: Nico Jehle Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7
Systems are replicas


Attachments: File MDEV-29843.full_backtrace.host_b.2022-10-24.log     File MDEV-29843.full_backtrace.log     File MDEV-29843.host_b.2022-10-24.arguments.log    
Issue Links:
Duplicate
is duplicated by MDEV-31427 MariaDB replication server's SQL Thre... Closed
Problem/Incident
is caused by MDEV-24341 Innodb - do not block in foreground t... Closed
Relates
relates to MDEV-31293 Threads stuck on semaphore wait causi... Closed
relates to MDEV-29660 [ERROR] [FATAL] InnoDB: innodb_fatal_... Closed

 Description   

Since upgrading to MariaDB 10.6 our systems randomly abort due to InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_ sys.latch.
I could not tell whether it's the same bug as e.g. MDEV-29590, so here as new issue.

We could not find steps to reproduce it, the aborts seem to happen randomly, sometimes after a few days of uptime, sometimes after a few hours.
The affected systems are all replicas, but not sure that matters in this case.

error log:

2022-10-19 13:01:01 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
221019 13:01:01 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.10-MariaDB-log
key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=320
max_threads=1002
thread_count=308
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 732888 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
??:0(my_print_stacktrace)[0x5655215226ae]
??:0(handle_fatal_signal)[0x565520f71527]
sigaction.c:0(__restore_rt)[0x7f3c7f19b630]
:0(__GI_raise)[0x7f3c7e5e6387]
:0(__GI_abort)[0x7f3c7e5e7a78]
??:0(std::pair<std::_Rb_tree_iterator<std::pair<unsigned long const, bool> >, bool> std::_Rb_tree<unsigned long, std::pair<unsigned long const, bool>, std::_Select1st<std::pair<unsigned long const, bool> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, bool> > >::_M_emplace_unique<unsigned long&, bool>(unsigned long&, bool&&))[0x5655213a5780]
??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x56552135cd29]
??:0(tpool::thread_pool_generic::timer_generic::execute(void*))[0x5655214ac390]
??:0(tpool::task::execute())[0x5655214ad29b]
??:0(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x5655214aa6d1]
??:0(std::this_thread::__sleep_for(std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >))[0x7f3c7ed35330]
pthread_create.c:0(start_thread)[0x7f3c7f193ea5]
??:0(__clone)[0x7f3c7e6aeb0d]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /srv/mysql/pdns-customer-slave/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             2061349              2061349              processes 
Max open files            50000                50000                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       2061349              2061349              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core
 
Kernel version: Linux version 3.10.0-1160.76.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Wed Aug 10 16:21:17 UTC 2022

gdb backtrace:

#0  0x00007f3c7f198aa1 in __pthread_kill (threadid=<optimized out>, signo=6) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:59
#1  0x0000565520f7159e in handle_fatal_signal (sig=6) at /usr/src/debug/MariaDB-/src_0/sql/signal_handler.cc:355
#2  <signal handler called>
#3  0x00007f3c7e5e6387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#4  0x00007f3c7e5e7a78 in __GI_abort () at abort.c:90
#5  0x00005655213a5780 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/src/debug/MariaDB-/src_0/storage/innobase/ut/ut0ut.cc:515
#6  0x000056552135cd29 in srv_monitor_task () at /usr/src/debug/MariaDB-/src_0/storage/innobase/srv/srv0srv.cc:1300
#7  0x00005655214ac390 in run (this=0x565522c552a0) at /usr/src/debug/MariaDB-/src_0/tpool/tpool_generic.cc:343
#8  tpool::thread_pool_generic::timer_generic::execute (arg=0x565522c552a0) at /usr/src/debug/MariaDB-/src_0/tpool/tpool_generic.cc:363
#9  0x00005655214ad29b in tpool::task::execute (this=0x565522c552e0) at /usr/src/debug/MariaDB-/src_0/tpool/task.cc:37
#10 0x00005655214aa6d1 in tpool::thread_pool_generic::worker_main (this=0x565522a65670, thread_var=0x565522a67fa0)
    at /usr/src/debug/MariaDB-/src_0/tpool/tpool_generic.cc:580
#11 0x00007f3c7ed35330 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
    at ../../../../../libstdc++-v3/src/c++11/thread.cc:84
#12 0x00007f3c7f193ea5 in start_thread (arg=0x7f3ac3fff700) at pthread_create.c:307
#13 0x00007f3c7e6aeb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111



 Comments   
Comment by Marko Mäkelä [ 2022-10-21 ]

njehle, thank you for producing a stack trace with GDB. However, that crashing thread is only the messenger. Can you please post the output of thread all apply backtrace so that I can see which threads are actually involved in the deadlock?

I may need some further information for a particular crash, so please save the core dump and be prepared to run more GDB commands. Some special steps may be necessary to counter MDEV-10814. These hangs usually involve the InnoDB buffer pool and the block descriptors would be excluded from core dumps.

Comment by Nico Jehle [ 2022-10-21 ]

MDEV-29843.full_backtrace.log
Thanks for the quick reply.

I've attached a full backtrace of the same core dump, created by following the steps at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-backtraces-with-gdb-on-linux

set logging on
set pagination off
set print frame-arguments all
thread apply all bt full
set logging off

The core dump we've moved to a separate system with the debuginfo packages so we can run further commands.

If there is a need to adjust config to include the more data in the coredump, please let me know.

Comment by Marko Mäkelä [ 2022-10-21 ]

For the record, this does not look at all related to MDEV-29835.

On a quick look, very many threads are waiting for dict_sys.latch. I do not think that we explicitly keep track of the writer thread ID in that type of lock, so it will take time to figure out which thread is holding it exclusively, and what that thread is doing.

By accident, I may have found the dict_sys.latch owner thread: Thread 13 is waiting inside log_write_up_to(), for an update of InnoDB persistent statistics to become durable. To me, it does not look like being blocked at the moment. wlad wrote this group_commit_lock::release(); I hope that he can check if anything could get stuck there.

It could help if you could collect multiple sets of stack traces when the server appears to be hung.

Does the storage hardware work flawlessly? Have you seen any log messages (including Linux dmesg) about trouble? Does smartctl report anything of interest?

Comment by Nico Jehle [ 2022-10-21 ]

Does the storage hardware work flawlessly? Have you seen any log messages (including Linux dmesg) about trouble? Does smartctl report anything of interest?

Forgot to mention that in the initial description; it's not a single server: we have ~80 host with each ~10 instances of MariaDB, there does not seem to be a pattern on which server or DB instance the issue occurs. So far it only affected single instances on different servers. And nothing unusual in the server logs.

It could help if you could collect multiple sets of stack traces when the server appears to be hung.

The core dump in this ticket is from a system that was rebooted to activate making core dumps.
It's up since 2022-10-19 02:56:29 CEST, MariaDB aborted roughly 10 hours later, at 2022-10-19 13:01:01.
There were no prior indicators that there is an issue, nothing unusual in journal, dmesg or MariaDB error log this time.
So I'm not sure how we can get additional core dump while it is in the hung state, before the abort kicks in.

For the past crashes we unfortunately don't have core dumps, as we enabled them just this week,
but I'll check the logs for all instances next week, I recall some warning about the wait threshold being reached soon, not just that it has been reached already.
We could potentially check for presence of those lines to create core dumps, but they apparently do not appear each time it hangs.

Comment by Marko Mäkelä [ 2022-10-21 ]

What are your configuration parameters, such as innodb_buffer_pool_size, innodb_log_file_size, innodb_flush_log_at_trx_commit? On what file system and type of storage is the ib_logfile0 located?

Comment by Nico Jehle [ 2022-10-22 ]

In the config there are only those for innodb:

instance my.cnf

innodb_stats_on_metadata                 = OFF
innodb_buffer_pool_size                  = 2000M
innodb_data_home_dir                     = /srv/mysql/<instance>/binlog/innodb
innodb_data_file_path                    = ibdata1:128M:autoextend
innodb_autoextend_increment              = 256
innodb_file_per_table                    = 1
innodb_log_group_home_dir                = /srv/mysql/<instance>/binlog/innodb
innodb_flush_log_at_trx_commit           = 2
innodb_flush_method                      = O_DIRECT
innodb_log_buffer_size                   = 8M
innodb_log_file_size                     = 128M
innodb_purge_threads                     = 1
innodb_read_io_threads                   = 8
innodb_write_io_threads                  = 8
innodb_fast_shutdown                     = 1
innodb_buffer_pool_filename              = ib_buffer_pool
innodb_buffer_pool_load_at_startup       = 1
innodb_buffer_pool_dump_at_shutdown      = 1

We're only passing that configuratio to MariaDB via ExecStart=/usr/sbin/mysqld --defaults-file=/srv/mysql/%i/my.cnf in the systemd service, so for the rest it's the defaults.

Storage is XFS on SATA SSDs; 2x SSDSC2KG960G8R as RAID1 on an PERC H330 Mini.

Comment by Nico Jehle [ 2022-10-24 ]

Happened on another host and instance (same config, just different data), attaching the backtrace of that, a bit smaller this time.
MDEV-29843.full_backtrace.host_b.2022-10-24.log

Comment by Marko Mäkelä [ 2022-10-24 ]

I see that you have a very small redo log, only 6.4% of the buffer pool size. That may cause frequent log checkpointing and excessive page writing.

Also in MDEV-29843.full_backtrace.host_b.2022-10-24.log we can see that dict_stats_save() is waiting inside log_write_up_to(). The buf_flush_page_cleaner() thread is idle.

It feels very strange that pthread_cond_signal() appears to be waiting for something. I checked the source code of GNU libc 2.35 (hopefully not too different from the one that you are using), and found the following in glibc-2.35/nptl/pthread_cond_signal.c:

___pthread_cond_signal (pthread_cond_t *cond)
{
  LIBC_PROBE (cond_signal, 1, cond);
 
  /* First check whether there are waiters.  Relaxed MO is fine for that for
     the same reasons that relaxed MO is fine when observing __wseq (see
     below).  */
  unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
  if (wrefs >> 3 == 0)
    return 0;
  int private = __condvar_get_private (wrefs);
 
  __condvar_acquire_lock (cond, private);

The _condvar_acquire_lock() in nptl/pthread_cond_common.c includes a spinloop and a futex call. It might be that your GNU libc version included a direct call to _lll_lock_wait() here.

In the core dump, can you try to dump the arguments of the following:

Thread 33 (Thread 0x7fc422ffd700 (LWP 31492)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
No locals.
#1  0x00007fd75822114d in pthread_cond_signal@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S:144
No locals.
#2  0x000055a5679579cd in thd_decrement_pending_ops (thd=0x7fc6c40009b8) at /usr/src/debug/MariaDB-/src_0/include/mysql/psi/mysql_thread.h:1099
No locals.
#3  0x000055a567f21036 in group_commit_lock::release (this=this@entry=0x55a5694fd2c0 <write_lock>, num=num@entry=1996704773451) at /usr/src/debug/MariaDB-/src_0/storage/innobase/log/log0sync.cc:388

That is:

print *(group_commit_lock*)0x55a5694fd2c0
print *(THD*)0x7fc6c40009b8

Comment by Nico Jehle [ 2022-10-24 ]

Attaching log with arguments.
MDEV-29843.host_b.2022-10-24.arguments.log

glibc on CentOS 7 is kind of old, it's based on 2.17: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_cond_signal.c;h=063dcbc32a25d0e36ba04de01581530424c47912;hb=c758a686

Comment by Marko Mäkelä [ 2022-10-24 ]

I think that wlad knows this part of our code best.

Comment by Vladislav Vaintroub [ 2022-10-24 ]

Even when looking most attentively , I do not see how the tiny condition variable in
tiny thd_async_state class can become corrupted, by its use in the class itself. It is used once for signaling and once for waiting, and it is properly created and destroyed, and its use is properly protected by mutex (although, the academic discussion whether to signal condition inside or outside mutex does not seem to come to end).

Though I have admittedly no knowledge of libc or Linux (marko is our subject expert)
, I'm sure that the fact that no other thread executes anything from THD::async_state, and yet there is a hang on condition variable signal , tells one thing - condvar is corrupted. by the code which is outside of this THD::async_state class, i.e by a random memory overwrite.

I tend to blame replication code, because first THD::async_state is in the middle of replication variables, and also the code that executes on every innodb update, would now suddenly hang on replication only.

If this is any help, I could try to move the variable outside of the replication area in THD, and then something else would be affected by alleged (by myself) memory overwrite.

Comment by Marko Mäkelä [ 2022-10-25 ]

wlad, I think that your hypothesis is plausible. Please implement that change. I see that you enhanced the data structure in two phases: first MDEV-21534 in 10.5.2, and then MDEV-24341 in 10.6.0.

njehle, are you using replication? XA transactions? temporary tables? Galera?

Comment by Vladislav Vaintroub [ 2022-10-25 ]

The code is in bb-10.6-MDEV-29843

Comment by Nico Jehle [ 2022-10-25 ]

marko:
All our systems are using replication, the primary systems are being maintained by another team in the company.
So yes, we're using replication,
no XA transactions, temporary tables, or Galera.

Comment by Marko Mäkelä [ 2022-10-25 ]

wlad, moving the field to an earlier position in THD would have been a work-around. It does not look like your other changes would be a real fix; they could only reduce the probability of this hang. I am not familiar with that code, but the changes look reasonable to me, other than misspelling the function name log_write_up_to.

We have a similar example in MDEV-25611 where a rather frequent hang was “fixed” (in reality, the probability of the hang was drastically reduced) by MDEV-25948.

I guess that we will need some feedback from njehle to see if this really worked. Unfortunately, most users who report hangs would never produce stack traces of all threads, so we can’t know how common this hang is or was.

Comment by Nico Jehle [ 2022-10-25 ]

if this really worked

I've rebooted all our server last week, so far only 5 MariaDB instanced were aborted due to hangs.
With our ~650 instances and how randomly the issue surfaces, we won't be able to reliably check whether the fix helps here, without rolling it out on all servers, and waiting one or two weeks.
The issue has not occured in any non-Live environments either, and switching to a prerelease MariaDB build with the (potential) fix in our Live environment is probably not something I'll be allowed to do, considering the criticality of the services here.

I'll try to reproduce the issue in our QA with MariaDB 10.6.10, by replicating the queries from Live. Depending on how that goes, I'll look into other options here.

Are there automatic builds/packages for the change from this ticket's git branch? If there are none for CentOS 7, I'll have to look into building those as well.

Comment by Marko Mäkelä [ 2022-10-25 ]

njehle, packages for the latest development snapshots are available at addresses like http://hasky.askmonty.org/archive/10.6/. That could be an option once this fix has been pushed to the main 10.6 branch.

The 10.6.11 release should be out soon (within a week, I hope), and you might simply upgrade to 10.6.11 once it becomes available. This work-around should still be in time for that release.

Comment by Nico Jehle [ 2022-10-26 ]

marko Thanks for the info, the fix has been pushed to the 10.6 branch already.
With this, we'll just update to 10.6.11 once it's been released.

I'll let you know whether we still run into this, something like 1 or 2 weeks after it has been released.

Thanks for the support,
much appreciated.

Comment by Marko Mäkelä [ 2022-11-09 ]

I see that this hang involves a field wrefs inside pthread_cond_t, just like in MDEV-29843 where mariadb-backup appears to be misusing condition variables from the GNU libc developers’ perspective. I must admit that before I implemented MDEV-21452, I did not know much about condition variables; I just assumed that only pthread_cond_wait() or pthread_cond_timedwait() can block.

Because I am not familiar with this part of code, I can’t say if any misuse of pthread_cond_t is possible here, but it could be more plausible than the alternative explanation that the memory was corrupted by some write through a wild pointer (without corrupting anything else).

Comment by Marko Mäkelä [ 2022-11-09 ]

In MDEV-29843.host_b.2022-10-24.arguments.log, I found the following:

async_state = {
    m_state = thd_async_state::NONE, m_command = COM_SLEEP, m_packet = {str = 0x0, length = 0}, m_mtx = {m_mutex = {__data = {__lock = 1, __count = 0, __owner = 31492, 
          __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
        __size = "\001\000\000\000\000\000\000\000\004{\000\000\001", '\000' <repeats 26 times>, __align = 1}, m_psi = 0x0}, m_cond = {m_cond = {__data = {__lock = 2, 
          __futex = 0, __total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, 
        __size = "\002\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 2}, m_psi = 0x0}, m_pending_ops = {
      m_counter = {<std::__atomic_base<int>> = {_M_i = 0}, <No data fields>}}}

The m_mtx seems to be owned by Thread 33 (Thread 0x7fc422ffd700 (LWP 31492)) of MDEV-29843.full_backtrace.host_b.2022-10-24.log, as expected. The wrefs was previously known as nwaiters, and it is 0 here. I did not yet study the glibc 2.17 code to see what its expectations would be here.

Comment by Kristian Nielsen [ 2023-06-08 ]

The fix for this is to not use the async completion mechanism for replication worker threads, and I think that makes sense.

I'm thinking the problem could be that the replication worker threads re-use the THD for each transaction? I assume the async completion needs each pending commit completion / OK packet to have its own THD?
If there's no mechanism to pause the replication worker until the THD has gone through completion, it seems plausible that a problem can occur as a several transactions collide on completion on the same THD.

And it doesn't make much sense to have such a mechanism, as then the async completion becomes kinda pointless - so the fix seems correct.

Generated at Thu Feb 08 10:11:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.