[MDEV-29346] update_rows_log_event hung causing galera cluster failure Created: 2022-08-22  Updated: 2023-12-12

Status: Stalled
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.6.5
Fix Version/s: 10.6

Type: Bug Priority: Critical
Reporter: Khai Ping Assignee: Julius Goryavsky
Resolution: Unresolved Votes: 1
Labels: None
Environment:

3 Node Galera Cluster


Attachments: PNG File image-2022-08-22-12-29-26-874.png     Text File mariadbd_full_bt_all_threads.txt     Text File mariadbd_full_bt_all_threads_11feb246.txt    
Issue Links:
Relates
relates to MDEV-27689 Node hangs and complete galera cluste... Open

 Description   

We have multiple galera clusters working in a multi-master setup. And noticed that a "sleeping" system thread could hung the whole cluster.

When this system thread hung as shown in the screenshot, the whole galera cluster goes into a stand still. Nothing an be written into the database

We have a log that print the "wsrep_last_committed", it shows that one of the node 's wsrep_last_commited is not moving. Did the wsrep plugin in Galera hung?

The h5 server is the one that stuck. There is nothing in the mysql.err showing any stacktrace

2022-08-18 06:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h4: 
2022-08-18 06:10:04,861 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "150", "wsrep_last_committed": "21383020", 
2022-08-18 06:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h5: 
2022-08-18 06:10:04,862 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "590", "wsrep_last_committed": "21382990", 
2022-08-18 06:10:04,863 INFO galera_alert line:93 galerastats on node xxx-h6: 
2022-08-18 06:10:04,863 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "204", "wsrep_last_committed": "21383020", 
....
....
2022-08-18 06:30:04,996 INFO galera_alert line:93 galerastats on node xxx-h4: 
2022-08-18 06:30:04,996 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "170", "wsrep_last_committed": "21383020",
2022-08-18 06:30:04,997 INFO galera_alert line:93 galerastats on node xxx-h5: 
2022-08-18 06:30:04,997 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "643", "wsrep_last_committed": "21382990", 
2022-08-18 06:30:04,997 INFO galera_alert line:93 galerastats on node xxx-h6: 
2022-08-18 06:30:04,997 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "228", "wsrep_last_committed": "21383020", 

The only solution to "unbreak" it is to stop the hung node, kill mariadb and start the mariadb service



 Comments   
Comment by Daniel Black [ 2022-08-22 ]

Can you:

Comment by Khai Ping [ 2022-08-23 ]

@daniel, does installing the debug-info packages have any performance impact ?

Comment by Daniel Black [ 2022-08-23 ]

No, they are information only and used by gdb. Small bit of storage but no impacts to the running server or any replacement of code.

Comment by Khai Ping [ 2022-08-23 ]

thank you,i will come back with more information

Comment by Khai Ping [ 2022-08-24 ]

@daniel,

Does this means i do not need to install the debug info packages? As my binary is not stripped.

/opt/sbin/mariadbd: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=893a0b4698fc39d184df3f3c32df693dfa008884, not stripped

When i tried to gdb attach <pid>, i get these lines. Does that mean i need to install the debuginfo?

Reading symbols from /usr/lib64/libgssapi_krb5.so.2...Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.

Comment by Daniel Black [ 2022-08-24 ]

The binary is not technically stripped however a split-debug technique commonly used means that the debug info isn't in the binary, but in separate files, hence the debuginfo packages are still needed.

Missing debug information from the libraries mariadb uses isn't a large impediment as the fault is unlikely to be in these libraries. If in doubt, just include the generated gdb information.

If for some reason you feel uncomfortable with the detail in the gdb output you can upload it privately to the ftp server.

Comment by Khai Ping [ 2022-09-06 ]

@daniel, we are building our own mariadb using the spec file , however the debuginfo rpm is not getting generated for 10.6.5 , however it is getting generated for 10.6.9

any idea what could be causing it?

Comment by Daniel Black [ 2022-09-26 ]

> we are building our own mariadb using the spec file ,

Why? What is it?

> however the debuginfo rpm is not getting generated for 10.6.5 , however it is getting generated for 10.6.9
> any idea what could be causing it?

No. I could guess the cmake version is different. But I can't think of a code change that made this difference.

Comment by Khai Ping [ 2022-10-18 ]

hi daniel, the command provided by the doc does not seems to work in my system.

sudo gdb --batch --eval-command="thread apply all bt  -frame-arguments all full" /usr/sbin/mariadbd $(pgrep -xn mariadbd)  > mariadbd_full_bt_all_threads.txt

The command above give me output like this

[New LWP 27304]
[New LWP 27303]
[New LWP 27110]
...
...
[New LWP 29265]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
0x00007f4727783ccd in poll () from /usr/lib64/libc.so.6
 
Thread 100 (Thread 0x7f4729b7c700 (LWP 29265)):
[Inferior 1 (process 29263) detached]

However, this alternate command seems to be working

sudo gdb --batch --eval-command="thread apply all bt full" /usr/sbin/mariadbd $(pgrep -xn mariadbd)  > mariadbd_full_bt_all_threads.txt

Sample output looks like this, is this something you guys are looking for?

Thread 1 (Thread 0x7f4729cfc8c0 (LWP 29263)):
#0  0x00007f4727783ccd in poll () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000055b20755e7ca in poll (__timeout=-1, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:46
No locals.
#2  handle_connections_sockets() () at /usr/src/debug/MariaDB-Galera-/src_0/sql/mysqld.cc:6112
        sock = {fd = 33, is_unix_domain_socket = <optimized out>, is_extra_port = <optimized out>, address_family = <optimized out>, m_psi = 0x0}
        error_count = 0
        cAddr = {ss_family = 1,
          __ss_padding = "\263$\177\000\000\001\000\000\000\000\000\000\000\000p\017\340\v\262U\000\000h\210\220\304\376\177\000\000\200%c\b\262U\000\000G*\344\a\262U\000\000\352\f\000\000\000\000\000\000\330+\344\a\262U\000\000\270\nc\b\262U\000\000\060Bc\b\262U\000\000\000\000\000\000\000\000\000\000!\000\000\000\000\000\000\000\300s\220\304\376\177\000\000I\002X\a\262U\000\000\023p\372\a\262U\000", __ss_align = 0}
        retval = <optimized out>
        fds = {array = {buffer = 0x55b20bd5d808 "\037", elements = 3, max_element = 16, alloc_increment = 16, size_of_element = 8, m_psi_key = 0, malloc_flags = 0}}
#3  0x000055b20755f739 in mysqld_main(int, char**) () at /usr/src/debug/MariaDB-Galera-/src_0/sql/mysqld.cc:5817
        please_close_stdin = true
        ho_error = <optimized out>
        new_thread_stack_size = <optimized out>
        user = <optimized out>
---Type <return> to continue, or q <return> to quit---
#4  0x00007f47276b2555 in __libc_start_main () from /usr/lib64/libc.so.6
No symbol table info available.
#5  0x000055b207553ec4 in _start () at /usr/src/debug/MariaDB-Galera-/src_0/sql/sql_array.h:129

Comment by Jan Lindström (Inactive) [ 2023-01-02 ]

khaiping.loh Yes, that output would be more than useful. Please provide also full error log. Can you try with more recent version of MariaDB and Galera library.

Comment by king [ 2023-01-30 ]

10.6.8 have the same problem

Comment by Daniel Black [ 2023-01-30 ]

> Please provide also full error log

and the full output of the sudo gdb --batch ....

Comment by Khai Ping [ 2023-02-10 ]

@daniel , i have attached mariadbd_full_bt_all_threads.txt .

Is this issue resolve in mariadb 10.6.12? I am referencing this ticket https://jira.mariadb.org/browse/MDEV-29684, it seems like it is fixed?

Comment by Daniel Black [ 2023-02-10 ]

Thank you. What analysis have you done that makes you think it is MDEV-29684?

This does have killed threads holding locks so it potentially the same, but a more complete look than what I have time for now is required to be more definate.

Comment by Khai Ping [ 2023-02-10 ]
  1. Due to the release notes , https://mariadb.com/kb/en/mariadb-10-6-12-release-notes/ . It mention "Fixes for cluster wide write conflict resolving"
  2. In my environment, we only see this problem in multi master galera nodes. Our application can send the writes to any of the galera nodes. In high concurrency, there is bound to have galera write conflicts. In MDEV-29684, it mention this line "This requires multi-master testing"
  3. And also in MDEV-29684, the first sentence mention this "There are a number of bug reports of cluster wide conflict resolving related crashes or hangs." When this issue happens in our environment, nothing can be written anymore, it is as tho the cluster hung

Appreciate your prompt response, i hope the bt thread logs is helpful. That log is retrieve from the node that hung.

Comment by Khai Ping [ 2023-02-11 ]

uploaded another logs mariadbd_full_bt_all_threads_11feb246.txt

Comment by Khai Ping [ 2023-02-15 ]

@Julius Goryavsky , any idea if the logs were useful in helping to find out if it is related to MDEV-29684?

Comment by Julien Fritsch [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Comment by JiraAutomate [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Generated at Thu Feb 08 10:07:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.