Details
Description
We have multiple galera clusters working in a multi-master setup. And noticed that a "sleeping" system thread could hung the whole cluster.
When this system thread hung as shown in the screenshot, the whole galera cluster goes into a stand still. Nothing an be written into the database
We have a log that print the "wsrep_last_committed", it shows that one of the node 's wsrep_last_commited is not moving. Did the wsrep plugin in Galera hung?
The h5 server is the one that stuck. There is nothing in the mysql.err showing any stacktrace
2022-08-18 06:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h4: |
2022-08-18 06:10:04,861 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "150", "wsrep_last_committed": "21383020", |
2022-08-18 06:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h5: |
2022-08-18 06:10:04,862 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "590", "wsrep_last_committed": "21382990", |
2022-08-18 06:10:04,863 INFO galera_alert line:93 galerastats on node xxx-h6: |
2022-08-18 06:10:04,863 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "204", "wsrep_last_committed": "21383020", |
....
|
....
|
2022-08-18 06:30:04,996 INFO galera_alert line:93 galerastats on node xxx-h4: |
2022-08-18 06:30:04,996 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "170", "wsrep_last_committed": "21383020", |
2022-08-18 06:30:04,997 INFO galera_alert line:93 galerastats on node xxx-h5: |
2022-08-18 06:30:04,997 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "643", "wsrep_last_committed": "21382990", |
2022-08-18 06:30:04,997 INFO galera_alert line:93 galerastats on node xxx-h6: |
2022-08-18 06:30:04,997 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "228", "wsrep_last_committed": "21383020", |
The only solution to "unbreak" it is to stop the hung node, kill mariadb and start the mariadb service
Attachments
Issue Links
- is caused by
-
MDEV-29293 MariaDB stuck on starting commit state (waiting on commit order critical section)
- Closed
- relates to
-
MDEV-27689 Node hangs and complete galera cluster freezes
- Closed
-
MDEV-30718 Cluster hanging regularly on Update_rows_log_event
- Closed