Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38985

MariaDB Galera Cluster Stuck in "Waiting for certification"

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.11.9
    • None
    • Galera, wsrep
    • None
    • Rocky Linux 9.4

    Description

      Dear all,

      we are running a MariaDB Galera Cluster consisting of three nodes, with an additional garbd (Galera Arbitrator) node.

      Occasionally, the cluster becomes stuck and the MariaDB processes enter the state "Waiting for certification". When this happens, the cluster does not recover automatically and normal operation cannot continue. The only way to restore the service has been to follow the crash recovery procedure and bootstrap a new Galera cluster.

      The behavior appears to be similar to the issue reported in MDEV-34784. It also seems related to a fix that has already been implemented in Percona’s Galera fork:

      https://github.com/percona/galera/pull/214/files

      Could you please confirm whether this issue is already known in MariaDB Galera and whether the corresponding fix is planned to be included?

      Below I have attached details of our environment where the problem occurs.

      Best regards

      • Version Information:

        MariaDB [(none)]> show variables like "%version%";
        +-----------------------------------+------------------------------------------+
        | Variable_name                     | Value                                    |
        +-----------------------------------+------------------------------------------+
        | in_predicate_conversion_threshold | 1000                                     |
        | protocol_version                  | 10                                       |
        | slave_type_conversions            |                                          |
        | system_versioning_alter_history   | ERROR                                    |
        | system_versioning_asof            | DEFAULT                                  |
        | system_versioning_insert_history  | OFF                                      |
        | tls_version                       | TLSv1.2,TLSv1.3                          |
        | version                           | 10.11.9-MariaDB-log                      |
        | version_comment                   | MariaDB Server                           |
        | version_compile_machine           | x86_64                                   |
        | version_compile_os                | Linux                                    |
        | version_malloc_library            | system                                   |
        | version_source_revision           | 0e8fb977b00983d98c4c35e39bc1f36463095938 |
        | version_ssl_library               | OpenSSL 3.0.7 1 Nov 2022                 |
        | wsrep_patch_version               | wsrep_26.22                              |
        +-----------------------------------+------------------------------------------+
        

      The part of processlist from the time when the issue appears. There is a lot of "Waiting for certification" states.

      node0

      | 4756856 | xxxx        | xxx.xxx.xxx.x:44586  | xxxx       | Query   |     5562 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756858 | xxxx        | xxx.xxx.xxx.x:44594  | xxxx       | Sleep   |      105 |                                 | NULL           |    0.000 |
      | 4756863 | xxxx        | xxx.xxx.xxx.x:44634  | xxxx       | Sleep   |      224 |                                 | NULL           |    0.000 |
      | 4756867 | xxxx        | xxx.xxx.xxx.x:44666  | xxxx       | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756870 | xxxx        | xxx.xxx.xxx.x:44700  | xxxx       | Sleep   |      898 |                                 | NULL           |    0.000 |
      | 4756871 | xxxx        | xxx.xxx.xxx.x:44716  | xxxx       | Sleep   |       54 |                                 | NULL           |    0.000 |
      | 4756875 | xxxx        | xxx.xxx.xxx.x:49596  | xxxx       | Query   |     5560 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756880 | xxxx        | xxx.xxx.xxx.x:49646  | xxxx       | Query   |     5553 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756881 | xxxx        | xxx.xxx.xxx.x:49660  | xxxx | Query   |     5473 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756882 | xxxx        | xxx.xxx.xxx.x:49670  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756883 | xxxx        | xxx.xxx.xxx.x:49684  | xxxx | Query   |     5230 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756885 | xxxx        | xxx.xxx.xxx.x:49690  | xxxx | Query   |     5136 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756886 | xxxx        | xxx.xxx.xxx.x:49696  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756887 | xxxx        | xxx.xxx.xxx.x:49710  | xxxx       | Query   |     5559 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756888 | xxxx        | xxx.xxx.xxx.x:49724  | xxxx       | Query   |     5462 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756889 | xxxx        | xxx.xxx.xxx.x:49728  | xxxx       | Sleep   |      874 |                                 | NULL           |    0.000 |
      | 4756893 | xxxx        | xxx.xxx.xxx.x:49768  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756895 | xxxx        | xxx.xxx.xxx.x:49780  | xxxx       | Sleep   |     2938 |                                 | NULL           |    0.000 |
      | 4756901 | xxxx        | xxx.xxx.xxx.x:49828  | xxxx       | Query   |     4748 | Commit                          | COMMIT           |    0.000 |
      | 4756902 | xxxx        | xxx.xxx.xxx.x:49836  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756903 | xxxx        | xxx.xxx.xxx.x:49846  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756906 | xxxx        | xxx.xxx.xxx.x:49874  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756907 | xxxx        | xxx.xxx.xxx.x:49886  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756908 | xxxx        | xxx.xxx.xxx.x:49888  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756912 | xxxx        | xxx.xxx.xxx.x:49926  | xxxx | Query   |     5594 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756914 | xxxx        | xxx.xxx.xxx.x:49948  | xxxx | Query   |     5531 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756917 | xxxx        | xxx.xxx.xxx.x:49970  | xxxx | Query   |     5592 | Waiting for certification       | COMMIT           |    0.000 |
      | 4756919 | xxxx        | xxx.xxx.xxx.x:49982  | xxxx       | Query   |     2291 | Commit                          | COMMIT           |    0.000 |
      | 4756921 | xxxx        | xxx.xxx.xxx.x:50000  | xxxx       | Query   |     5560 | Waiting for certification       | COMMIT           |    0.000 |
      

      node1

      MariaDB [vpabx]> show processlist;
      +--------+-------------+----------------------+-------+---------+----------+---------------------------------+------------------------------------+----------+
      | Id     | User        | Host                 | db    | Command | Time     | State                           | Info                               | Progress |
      +--------+-------------+----------------------+-------+---------+----------+---------------------------------+------------------------------------+----------+
      |      1 | system user |                      | NULL  | Sleep   | 11361749 | wsrep aborter idle              | NULL                               |    0.000 |
      |      2 | system user |                      | NULL  | Sleep   |     1856 | wsrep applier committed         | NULL                               |    0.000 |
      |      6 | system user |                      | NULL  | Sleep   |     1878 | wsrep applier committed         | NULL                               |    0.000 |
      |      8 | system user |                      | NULL  | Sleep   |     1874 | After apply log event           | NULL                               |    0.000 |
      |      7 | system user |                      | NULL  | Sleep   |     1868 | wsrep applier committed         | NULL                               |    0.000 |
      |     12 | system user |                      | NULL  | Sleep   |     1853 | wsrep applier committed         | NULL                               |    0.000 |
      |     11 | system user |                      | NULL  | Sleep   |     1867 | wsrep applier committed         | NULL                               |    0.000 |
      |      9 | system user |                      | NULL  | Sleep   |     1832 | wsrep applier committed         | NULL                               |    0.000 |
      |     13 | system user |                      | NULL  | Sleep   |     1868 | wsrep applier committed         | NULL                               |    0.000 |
      |     16 | system user |                      | NULL  | Sleep   |     1829 | wsrep applier committed         | NULL                               |    0.000 |
      |     15 | system user |                      | NULL  | Sleep   |     1834 | wsrep applier committed         | NULL                               |    0.000 |
      |     14 | system user |                      | NULL  | Sleep   |     1868 | wsrep applier committed         | NULL                               |    0.000 |
      |     17 | system user |                      | NULL  | Sleep   |     1857 | wsrep applier committed         | NULL                               |    0.000 |
      |     18 | system user |                      | NULL  | Sleep   |     1868 | wsrep applier committed         | NULL                               |    0.000 |
      |     19 | system user |                      | NULL  | Sleep   |     1868 | wsrep applier committed         | NULL                               |    0.000 |
      |     20 | system user |                      | NULL  | Sleep   |     1868 | wsrep applier committed         | NULL                               |    0.000 |
      |     22 | system user |                      | NULL  | Sleep   |     1880 | wsrep applier committed         | NULL                               |    0.000 |
      | 176776 | monitor     | xxx.xxx.xxx.xx:40088 | NULL  | Sleep   |        8 |                                 | NULL                               |    0.000 |
      | 176777 | monitor     | xxx.xxx.xxx.xx:40092 | NULL  | Sleep   |        3 |                                 | NULL                               |    0.000 |
      | 347847 | monitor     | xxx.xxx.xxx.x:57252  | NULL  | Sleep   |        0 |                                 | NULL                               |    0.000 |
      | 359331 | monitor     | xxx.xxx.xxx.x:43282  | NULL  | Sleep   |        5 |                                 | NULL                               |    0.000 |
      | 474190 | root        | localhost            | xxxx | Sleep   |      353 |                                 | NULL                               |    0.000 |
      | 474213 | xxxx        | %                    | xxxx | Query   |     1695 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474218 | xxxx        | %                    | xxxx | Query   |     1575 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474223 | xxxx        | %                    | xxxx | Query   |     1455 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474225 | root        | localhost            | xxxx | Sleep   |     1408 |                                 | NULL                               |    0.000 |
      | 474229 | xxxx        | %                    | xxxx | Query   |     1335 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474234 | xxxx        | %                    | xxxx | Query   |     1215 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474239 | xxxx        | %                    | xxxx | Query   |     1095 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474241 | root        | localhost            | NULL  | Sleep   |     1044 |                                 | NULL                               |    0.000 |
      | 474245 | xxxx        | %                    | xxxx | Query   |      975 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474250 | xxxx        | %                    | xxxx | Query   |      855 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474255 | xxxx        | %                    | xxxx | Query   |      735 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474260 | xxxx        | %                    | xxxx | Query   |      615 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474265 | xxxx        | %                    | xxxx | Query   |      495 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474270 | xxxx        | %                    | xxxx | Query   |      375 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474275 | xxxx        | %                    | xxxx | Query   |      255 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474280 | xxxx        | %                    | xxxx | Query   |      135 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      | 474281 | root        | localhost            | xxxx | Query   |        0 | starting                        | show processlist                   |    0.000 |
      | 474286 | xxxx        | %                    | xxxx | Query   |       15 | Waiting to execute in isolation | TRUNCATE xxxxxxxxxxxxxxxxxxxxx |    0.000 |
      +--------+-------------+----------------------+-------+---------+----------+---------------------------------+------------------------------------+----------+
      

      • Wsrep status:

        MariaDB [(none)]> show global status like 'wsrep%';
        +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
        | Variable_name                 | Value                                                                                                                                          |
        +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
        | wsrep_local_state_uuid        | 5f682e6f-957d-11f0-97e1-7ec9f3bc7bdc                                                                                                           |
        | wsrep_protocol_version        | 11                                                                                                                                             |
        | wsrep_last_committed          | 599998991                                                                                                                                      |
        | wsrep_replicated              | 373431                                                                                                                                         |
        | wsrep_replicated_bytes        | 765580936                                                                                                                                      |
        | wsrep_repl_keys               | 23011604                                                                                                                                       |
        | wsrep_repl_keys_bytes         | 193055184                                                                                                                                      |
        | wsrep_repl_data_bytes         | 547291425                                                                                                                                      |
        | wsrep_repl_other_bytes        | 0                                                                                                                                              |
        | wsrep_received                | 488032019                                                                                                                                      |
        | wsrep_received_bytes          | 635836905928                                                                                                                                   |
        | wsrep_local_commits           | 184116                                                                                                                                         |
        | wsrep_local_cert_failures     | 1                                                                                                                                              |
        | wsrep_local_replays           | 0                                                                                                                                              |
        | wsrep_local_send_queue        | 37                                                                                                                                             |
        | wsrep_local_send_queue_max    | 37                                                                                                                                             |
        | wsrep_local_send_queue_min    | 0                                                                                                                                              |
        | wsrep_local_send_queue_avg    | 0.00019345                                                                                                                                     |
        | wsrep_local_recv_queue        | 0                                                                                                                                              |
        | wsrep_local_recv_queue_max    | 299                                                                                                                                            |
        | wsrep_local_recv_queue_min    | 0                                                                                                                                              |
        | wsrep_local_recv_queue_avg    | 0.163409                                                                                                                                       |
        | wsrep_local_cached_downto     | 598051212                                                                                                                                      |
        | wsrep_flow_control_paused_ns  | 4900326180382                                                                                                                                  |
        | wsrep_flow_control_paused     | 0.000431198                                                                                                                                    |
        | wsrep_flow_control_sent       | 5871                                                                                                                                           |
        | wsrep_flow_control_recv       | 5865                                                                                                                                           |
        | wsrep_flow_control_active     | true                                                                                                                                           |
        | wsrep_flow_control_requested  | false                                                                                                                                          |
        | wsrep_cert_deps_distance      | 66.6429                                                                                                                                        |
        | wsrep_apply_oooe              | 0.121689                                                                                                                                       |
        | wsrep_apply_oool              | 0.00223408                                                                                                                                     |
        | wsrep_apply_window            | 1.2887                                                                                                                                         |
        | wsrep_apply_waits             | 49267                                                                                                                                          |
        | wsrep_commit_oooe             | 0                                                                                                                                              |
        | wsrep_commit_oool             | 0                                                                                                                                              |
        | wsrep_commit_window           | 1.10472                                                                                                                                        |
        | wsrep_local_state             | 4                                                                                                                                              |
        | wsrep_local_state_comment     | Synced                                                                                                                                         |
        | wsrep_cert_index_size         | 10479                                                                                                                                          |
        | wsrep_causal_reads            | 1                                                                                                                                              |
        | wsrep_cert_interval           | 100.176                                                                                                                                        |
        | wsrep_open_transactions       | 0                                                                                                                                              |
        | wsrep_open_connections        | 37                                                                                                                                             |
        | wsrep_incoming_addresses      | ,xxx.xxx.xx.x:0,xxx.xxx.xxx.xx:0                                                                                                              |
        | wsrep_cluster_weight          | 3                                                                                                                                              |
        | wsrep_desync_count            | 0                                                                                                                                              |
        | wsrep_evs_delayed             |                                                                                                                                                |
        | wsrep_evs_evict_list          |                                                                                                                                                |
        | wsrep_evs_repl_latency        | 0/0/0/0/0                                                                                                                                      |
        | wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
        | wsrep_gcomm_uuid              | f0b72391-aec0-11f0-be31-be8343836e8e                                                                                                           |
        | wsrep_gmcast_segment          | 0                                                                                                                                              |
        | wsrep_applier_thread_count    | 16                                                                                                                                             |
        | wsrep_cluster_capabilities    |                                                                                                                                                |
        | wsrep_cluster_conf_id         | 626                                                                                                                                            |
        | wsrep_cluster_size            | 3                                                                                                                                              |
        | wsrep_cluster_state_uuid      | 5f682e6f-957d-11f0-97e1-7ec9f3bc7bdc                                                                                                           |
        | wsrep_cluster_status          | Primary                                                                                                                                        |
        | wsrep_connected               | ON                                                                                                                                             |
        | wsrep_local_bf_aborts         | 0                                                                                                                                              |
        | wsrep_local_index             | 2                                                                                                                                              |
        | wsrep_provider_capabilities   | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
        | wsrep_provider_name           | Galera                                                                                                                                         |
        | wsrep_provider_vendor         | Codership Oy <info@codership.com>                                                                                                              |
        | wsrep_provider_version        | 26.4.19(r5db72dad)                                                                                                                             |
        | wsrep_ready                   | ON                                                                                                                                             |
        | wsrep_rollbacker_thread_count | 1                                                                                                                                              |
        | wsrep_thread_count            | 17                                                                                                                                             |
        +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
        

      • The errors at mariadb log. As you can see, there were problems on the network on node0.

        2026-03-02  8:47:11 0 [Note] WSREP: Unable to report last applied write-set to cluster. Will try later. (gcs_sm_enter(): 110 seqno: 599795896)
        2026-03-02  9:03:23 4752610 [Warning] Aborted connection 4752610 to db: 'xxxxt' user: 'xxxx' host: 'xxx.xxx.xx.x' (Got an error reading communication packets)
        2026-03-02  9:03:52 0 [Note] WSREP: Unable to report last applied write-set to cluster. Will try later. (gcs_sm_enter(): 110 seqno: 599996251)
        2026-03-02  9:04:16 4646307 [Warning] Aborted connection 4646307 to db: 'xxxx' user: 'xxxx' host: 'xxx.xxx.xx.x' (Got an error writing communication packets)
        2026-03-02  9:04:18 0 [Note] WSREP: Unable to report last applied write-set to cluster. Will try later. (gcs_sm_enter(): 110 seqno: 599996261)
        2026-03-02  9:05:56 4755507 [Warning] Aborted connection 4755507 to db: 'xxxx: 'xxxx' host: 'xxx.xxx.xx.x' (Got an error reading communication packets)
        2026-03-02  9:06:16 4758103 [Note] InnoDB: Number of transaction pools: 3
        

      Attachments

        Issue Links

          Activity

            People

              seppo Seppo Jaakola
              szolnik Sebastian Zolnik
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.