Details
-
Bug
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.11.9
-
None
-
None
-
Rocky Linux 9.4
Description
Dear all,
we are running a MariaDB Galera Cluster consisting of three nodes, with an additional garbd (Galera Arbitrator) node.
Occasionally, the cluster becomes stuck and the MariaDB processes enter the state "Waiting for certification". When this happens, the cluster does not recover automatically and normal operation cannot continue. The only way to restore the service has been to follow the crash recovery procedure and bootstrap a new Galera cluster.
The behavior appears to be similar to the issue reported in MDEV-34784. It also seems related to a fix that has already been implemented in Percona’s Galera fork:
https://github.com/percona/galera/pull/214/files
Could you please confirm whether this issue is already known in MariaDB Galera and whether the corresponding fix is planned to be included?
Below I have attached details of our environment where the problem occurs.
Best regards
- Version Information:
MariaDB [(none)]> show variables like "%version%";
+-----------------------------------+------------------------------------------+| Variable_name | Value |+-----------------------------------+------------------------------------------+| in_predicate_conversion_threshold | 1000 |
| protocol_version | 10 |
| slave_type_conversions | || system_versioning_alter_history | ERROR || system_versioning_asof | DEFAULT || system_versioning_insert_history | OFF || tls_version | TLSv1.2,TLSv1.3 |
| version | 10.11.9-MariaDB-log |
| version_comment | MariaDB Server || version_compile_machine | x86_64 || version_compile_os | Linux || version_malloc_library | system || version_source_revision | 0e8fb977b00983d98c4c35e39bc1f36463095938 || version_ssl_library | OpenSSL 3.0.7 1 Nov 2022 |
| wsrep_patch_version | wsrep_26.22 |
+-----------------------------------+------------------------------------------+
The part of processlist from the time when the issue appears. There is a lot of "Waiting for certification" states.
node0
| 4756856 | xxxx | xxx.xxx.xxx.x:44586 | xxxx | Query | 5562 | Waiting for certification | COMMIT | 0.000 | |
| 4756858 | xxxx | xxx.xxx.xxx.x:44594 | xxxx | Sleep | 105 | | NULL | 0.000 | |
| 4756863 | xxxx | xxx.xxx.xxx.x:44634 | xxxx | Sleep | 224 | | NULL | 0.000 | |
| 4756867 | xxxx | xxx.xxx.xxx.x:44666 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756870 | xxxx | xxx.xxx.xxx.x:44700 | xxxx | Sleep | 898 | | NULL | 0.000 | |
| 4756871 | xxxx | xxx.xxx.xxx.x:44716 | xxxx | Sleep | 54 | | NULL | 0.000 | |
| 4756875 | xxxx | xxx.xxx.xxx.x:49596 | xxxx | Query | 5560 | Waiting for certification | COMMIT | 0.000 | |
| 4756880 | xxxx | xxx.xxx.xxx.x:49646 | xxxx | Query | 5553 | Waiting for certification | COMMIT | 0.000 | |
| 4756881 | xxxx | xxx.xxx.xxx.x:49660 | xxxx | Query | 5473 | Waiting for certification | COMMIT | 0.000 | |
| 4756882 | xxxx | xxx.xxx.xxx.x:49670 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756883 | xxxx | xxx.xxx.xxx.x:49684 | xxxx | Query | 5230 | Waiting for certification | COMMIT | 0.000 | |
| 4756885 | xxxx | xxx.xxx.xxx.x:49690 | xxxx | Query | 5136 | Waiting for certification | COMMIT | 0.000 | |
| 4756886 | xxxx | xxx.xxx.xxx.x:49696 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756887 | xxxx | xxx.xxx.xxx.x:49710 | xxxx | Query | 5559 | Waiting for certification | COMMIT | 0.000 | |
| 4756888 | xxxx | xxx.xxx.xxx.x:49724 | xxxx | Query | 5462 | Waiting for certification | COMMIT | 0.000 | |
| 4756889 | xxxx | xxx.xxx.xxx.x:49728 | xxxx | Sleep | 874 | | NULL | 0.000 | |
| 4756893 | xxxx | xxx.xxx.xxx.x:49768 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756895 | xxxx | xxx.xxx.xxx.x:49780 | xxxx | Sleep | 2938 | | NULL | 0.000 | |
| 4756901 | xxxx | xxx.xxx.xxx.x:49828 | xxxx | Query | 4748 | Commit | COMMIT | 0.000 | |
| 4756902 | xxxx | xxx.xxx.xxx.x:49836 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756903 | xxxx | xxx.xxx.xxx.x:49846 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756906 | xxxx | xxx.xxx.xxx.x:49874 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756907 | xxxx | xxx.xxx.xxx.x:49886 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756908 | xxxx | xxx.xxx.xxx.x:49888 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756912 | xxxx | xxx.xxx.xxx.x:49926 | xxxx | Query | 5594 | Waiting for certification | COMMIT | 0.000 | |
| 4756914 | xxxx | xxx.xxx.xxx.x:49948 | xxxx | Query | 5531 | Waiting for certification | COMMIT | 0.000 | |
| 4756917 | xxxx | xxx.xxx.xxx.x:49970 | xxxx | Query | 5592 | Waiting for certification | COMMIT | 0.000 | |
| 4756919 | xxxx | xxx.xxx.xxx.x:49982 | xxxx | Query | 2291 | Commit | COMMIT | 0.000 | |
| 4756921 | xxxx | xxx.xxx.xxx.x:50000 | xxxx | Query | 5560 | Waiting for certification | COMMIT | 0.000 | |
node1
MariaDB [vpabx]> show processlist;
|
+--------+-------------+----------------------+-------+---------+----------+---------------------------------+------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+--------+-------------+----------------------+-------+---------+----------+---------------------------------+------------------------------------+----------+
|
| 1 | system user | | NULL | Sleep | 11361749 | wsrep aborter idle | NULL | 0.000 | |
| 2 | system user | | NULL | Sleep | 1856 | wsrep applier committed | NULL | 0.000 | |
| 6 | system user | | NULL | Sleep | 1878 | wsrep applier committed | NULL | 0.000 | |
| 8 | system user | | NULL | Sleep | 1874 | After apply log event | NULL | 0.000 | |
| 7 | system user | | NULL | Sleep | 1868 | wsrep applier committed | NULL | 0.000 | |
| 12 | system user | | NULL | Sleep | 1853 | wsrep applier committed | NULL | 0.000 | |
| 11 | system user | | NULL | Sleep | 1867 | wsrep applier committed | NULL | 0.000 | |
| 9 | system user | | NULL | Sleep | 1832 | wsrep applier committed | NULL | 0.000 | |
| 13 | system user | | NULL | Sleep | 1868 | wsrep applier committed | NULL | 0.000 | |
| 16 | system user | | NULL | Sleep | 1829 | wsrep applier committed | NULL | 0.000 | |
| 15 | system user | | NULL | Sleep | 1834 | wsrep applier committed | NULL | 0.000 | |
| 14 | system user | | NULL | Sleep | 1868 | wsrep applier committed | NULL | 0.000 | |
| 17 | system user | | NULL | Sleep | 1857 | wsrep applier committed | NULL | 0.000 | |
| 18 | system user | | NULL | Sleep | 1868 | wsrep applier committed | NULL | 0.000 | |
| 19 | system user | | NULL | Sleep | 1868 | wsrep applier committed | NULL | 0.000 | |
| 20 | system user | | NULL | Sleep | 1868 | wsrep applier committed | NULL | 0.000 | |
| 22 | system user | | NULL | Sleep | 1880 | wsrep applier committed | NULL | 0.000 | |
| 176776 | monitor | xxx.xxx.xxx.xx:40088 | NULL | Sleep | 8 | | NULL | 0.000 | |
| 176777 | monitor | xxx.xxx.xxx.xx:40092 | NULL | Sleep | 3 | | NULL | 0.000 | |
| 347847 | monitor | xxx.xxx.xxx.x:57252 | NULL | Sleep | 0 | | NULL | 0.000 | |
| 359331 | monitor | xxx.xxx.xxx.x:43282 | NULL | Sleep | 5 | | NULL | 0.000 | |
| 474190 | root | localhost | xxxx | Sleep | 353 | | NULL | 0.000 | |
| 474213 | xxxx | % | xxxx | Query | 1695 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474218 | xxxx | % | xxxx | Query | 1575 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474223 | xxxx | % | xxxx | Query | 1455 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474225 | root | localhost | xxxx | Sleep | 1408 | | NULL | 0.000 | |
| 474229 | xxxx | % | xxxx | Query | 1335 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474234 | xxxx | % | xxxx | Query | 1215 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474239 | xxxx | % | xxxx | Query | 1095 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474241 | root | localhost | NULL | Sleep | 1044 | | NULL | 0.000 | |
| 474245 | xxxx | % | xxxx | Query | 975 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474250 | xxxx | % | xxxx | Query | 855 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474255 | xxxx | % | xxxx | Query | 735 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474260 | xxxx | % | xxxx | Query | 615 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474265 | xxxx | % | xxxx | Query | 495 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474270 | xxxx | % | xxxx | Query | 375 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474275 | xxxx | % | xxxx | Query | 255 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474280 | xxxx | % | xxxx | Query | 135 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
| 474281 | root | localhost | xxxx | Query | 0 | starting | show processlist | 0.000 | |
| 474286 | xxxx | % | xxxx | Query | 15 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx | 0.000 | |
+--------+-------------+----------------------+-------+---------+----------+---------------------------------+------------------------------------+----------+
|
- Wsrep status:
MariaDB [(none)]> show global status like 'wsrep%';
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+| Variable_name | Value |+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+| wsrep_local_state_uuid | 5f682e6f-957d-11f0-97e1-7ec9f3bc7bdc || wsrep_protocol_version | 11 |
| wsrep_last_committed | 599998991 |
| wsrep_replicated | 373431 |
| wsrep_replicated_bytes | 765580936 |
| wsrep_repl_keys | 23011604 |
| wsrep_repl_keys_bytes | 193055184 |
| wsrep_repl_data_bytes | 547291425 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 488032019 |
| wsrep_received_bytes | 635836905928 |
| wsrep_local_commits | 184116 |
| wsrep_local_cert_failures | 1 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 37 |
| wsrep_local_send_queue_max | 37 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.00019345 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 299 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 0.163409 |
| wsrep_local_cached_downto | 598051212 |
| wsrep_flow_control_paused_ns | 4900326180382 |
| wsrep_flow_control_paused | 0.000431198 |
| wsrep_flow_control_sent | 5871 |
| wsrep_flow_control_recv | 5865 |
| wsrep_flow_control_active | true |
| wsrep_flow_control_requested | false |
| wsrep_cert_deps_distance | 66.6429 |
| wsrep_apply_oooe | 0.121689 |
| wsrep_apply_oool | 0.00223408 |
| wsrep_apply_window | 1.2887 |
| wsrep_apply_waits | 49267 |
| wsrep_commit_oooe | 0 |
| wsrep_commit_oool | 0 |
| wsrep_commit_window | 1.10472 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced || wsrep_cert_index_size | 10479 |
| wsrep_causal_reads | 1 |
| wsrep_cert_interval | 100.176 |
| wsrep_open_transactions | 0 |
| wsrep_open_connections | 37 |
| wsrep_incoming_addresses | ,xxx.xxx.xx.x:0,xxx.xxx.xxx.xx:0 |
| wsrep_cluster_weight | 3 |
| wsrep_desync_count | 0 |
| wsrep_evs_delayed | || wsrep_evs_evict_list | || wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_evs_state | OPERATIONAL || wsrep_gcomm_uuid | f0b72391-aec0-11f0-be31-be8343836e8e || wsrep_gmcast_segment | 0 |
| wsrep_applier_thread_count | 16 |
| wsrep_cluster_capabilities | || wsrep_cluster_conf_id | 626 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | 5f682e6f-957d-11f0-97e1-7ec9f3bc7bdc || wsrep_cluster_status | Primary || wsrep_connected | ON || wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 2 |
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: || wsrep_provider_name | Galera || wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 26.4.19(r5db72dad) |
| wsrep_ready | ON || wsrep_rollbacker_thread_count | 1 |
| wsrep_thread_count | 17 |
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
- The errors at mariadb log. As you can see, there were problems on the network on node0.
2026-03-02 8:47:11 0 [Note] WSREP: Unable to report last applied write-set to cluster. Will try later. (gcs_sm_enter(): 110 seqno: 599795896)
2026-03-02 9:03:23 4752610 [Warning] Aborted connection 4752610 to db: 'xxxxt' user: 'xxxx' host: 'xxx.xxx.xx.x' (Got an error reading communication packets)
2026-03-02 9:03:52 0 [Note] WSREP: Unable to report last applied write-set to cluster. Will try later. (gcs_sm_enter(): 110 seqno: 599996251)
2026-03-02 9:04:16 4646307 [Warning] Aborted connection 4646307 to db: 'xxxx' user: 'xxxx' host: 'xxx.xxx.xx.x' (Got an error writing communication packets)
2026-03-02 9:04:18 0 [Note] WSREP: Unable to report last applied write-set to cluster. Will try later. (gcs_sm_enter(): 110 seqno: 599996261)
2026-03-02 9:05:56 4755507 [Warning] Aborted connection 4755507 to db: 'xxxx: 'xxxx' host: 'xxx.xxx.xx.x' (Got an error reading communication packets)
2026-03-02 9:06:16 4758103 [Note] InnoDB: Number of transaction pools: 3
Attachments
Issue Links
- relates to
-
MDEV-34784 MariaDB Stuck in "Waiting for certification"
-
- Open
-