Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
11.0.3
-
None
-
None
Description
Dear all,
we run a MariaDB Galera Cluster in Kubernetes. Once in a while the galera cluster is stuck and not a single process is doing anything anymore. Any command which I run will wait until the waiting for certification is finished. So all 3 nodes are basically not working anymore.
Version Information:
MariaDB [(none)]> show variables like "%version%";
|
+-----------------------------------+------------------------------------------+
|
| Variable_name | Value |
|
+-----------------------------------+------------------------------------------+
|
| in_predicate_conversion_threshold | 1000 |
|
| protocol_version | 10 |
|
| slave_type_conversions | |
|
| system_versioning_alter_history | ERROR |
|
| system_versioning_asof | DEFAULT |
|
| system_versioning_insert_history | OFF |
|
| tls_version | TLSv1.1,TLSv1.2,TLSv1.3 |
|
| version | 11.0.3-MariaDB-log |
|
| version_comment | Source distribution |
|
| version_compile_machine | x86_64 |
|
| version_compile_os | Linux |
|
| version_malloc_library | system |
|
| version_source_revision | 70905bcb9059dcc40db3b73bc46a36c7d40f1e10 |
|
| version_ssl_library | OpenSSL 1.1.1n 15 Mar 2022 |
|
| wsrep_patch_version | wsrep_26.22 |
|
| wsrep_provider_evs_version | 1 |
|
| wsrep_provider_gmcast_version | 0 |
|
| wsrep_provider_pc_version | 0 |
|
| wsrep_provider_protonet_version | 0 |
|
+-----------------------------------+------------------------------------------+
|
I checked the process lists on all nodes, and I saw that on one node there is this "Waiting for certification" state and this never changes.
SHOW PROCESSLIST State:
node0:
MariaDB [(none)]> SHOW PROCESSLIST;
|
+--------+-------------+-------------------+------------+---------+---------+-------------------------+------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+--------+-------------+-------------------+------------+---------+---------+-------------------------+------------------+----------+
|
| 1 | system user | | NULL | Sleep | 1017475 | wsrep aborter idle | NULL | 0.000 |
|
| 2 | system user | | NULL | Sleep | 3654 | wsrep applier committed | NULL | 0.000 |
|
| 7 | system user | | NULL | Sleep | 3654 | wsrep applier committed | NULL | 0.000 |
|
| 9 | system user | | NULL | Sleep | 3590 | wsrep applier committed | NULL | 0.000 |
|
| 10 | system user | | NULL | Sleep | 1790 | wsrep applier committed | NULL | 0.000 |
|
| 203061 | root | 10.42.20.85:38412 | <masked-text> | Sleep | 3654 | | NULL | 0.000 |
|
| 203253 | root | 10.42.20.85:48306 | <masked-text> | Sleep | 3654 | | NULL | 0.000 |
|
| 203469 | root | 10.42.20.85:50558 | <masked-text>| Sleep | 3654 | | NULL | 0.000 |
|
| 203593 | root | localhost | NULL | Sleep | 7797 | | NULL | 0.000 |
|
| 204246 | root | localhost | NULL | Sleep | 3125 | | NULL | 0.000 |
|
| 204438 | root | 10.42.20.85:59826 | <masked-text>| Sleep | 3654 | | NULL | 0.000 |
|
| 204584 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
|
| 204829 | root | 10.42.2.105:53198 | camunda | Sleep | 0 | | NULL | 0.000 |
|
| 205058 | root | 10.42.2.105:56408 | camunda | Sleep | 590 | | NULL | 0.000 |
|
node1:
MariaDB [(none)]> SHOW PROCESSLIST;
|
+--------+-------------+--------------------+--------------------+---------+---------+-------------------------+------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+--------+-------------+--------------------+--------------------+---------+---------+-------------------------+------------------+----------+
|
| 1 | system user | | NULL | Sleep | 2835 | wsrep applier committed | NULL | 0.000 |
|
| 2 | system user | | NULL | Sleep | 2946115 | wsrep aborter idle | NULL | 0.000 |
|
| 7 | system user | | NULL | Sleep | 1035 | wsrep applier committed | NULL | 0.000 |
|
| 8 | system user | | NULL | Sleep | 135 | wsrep applier committed | NULL | 0.000 |
|
| 10 | system user | | NULL | Sleep | 3735 | wsrep applier committed | NULL | 0.000 |
|
| 389044 | <masked-text> | 10.42.21.144:39916 | <masked-text>| Sleep | 137 | | NULL | 0.000 |
|
| 592531 | root | 10.42.20.85:36294 | <masked-text>| Sleep | 3799 | | NULL | 0.000 |
|
| 592840 | root | localhost | NULL | Sleep | 6451 | | NULL | 0.000 |
|
| 593457 | root | 10.42.20.85:54570 | <masked-text> | Sleep | 3799 | | NULL | 0.000 |
|
| 593538 | root | localhost | NULL | Sleep | 3280 | | NULL | 0.000 |
|
| 593726 | root | 10.42.20.85:59822 | <masked-text> | Sleep | 3799 | | NULL | 0.000 |
|
| 594363 | root | 10.42.2.105:60570 | camunda | Sleep | 638 | | NULL | 0.000 |
|
| 594376 | root | 10.42.2.105:36094 | camunda | Sleep | 580 | | NULL | 0.000 |
|
| 594381 | root | 10.42.2.105:38352 | camunda | Sleep | 557 | | NULL | 0.000 |
|
| 594388 | root | 10.42.2.105:49956 | camunda | Sleep | 526 | | NULL | 0.000 |
|
| 594493 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
|
+--------+-------------+--------------------+--------------------+---------+---------+-------------------------+------------------+----------+
|
16 rows in set (0.000 sec)
|
node2:
MariaDB [(none)]> SHOW PROCESSLIST;
|
+-------+-------------+--------------------+------------+---------+-------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+-------+-------------+--------------------+------------+---------+-------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|
| 1 | system user | | NULL | Sleep | 88846 | wsrep aborter idle | NULL | 0.000 |
|
| 2 | system user | | NULL | Sleep | 1735 | Waiting for certification | NULL | 0.000 |
|
| 6 | system user | | NULL | Sleep | 1666 | Waiting for certification | NULL | 0.000 |
|
| 7 | system user | | <masked-text>| Sleep | 2 | Delete_rows_log_event::ha_delete_row(24399) on table `H_TRANSACTIONS` | DELETE FROM `H_TRANSACTIONS` WHERE `H_ACCOUNT_ID` IN ('89510002515', '81816370621', '81819150657', ' | 0.000 |
|
| 9 | system user | | NULL | Sleep | 1735 | Waiting for certification | NULL | 0.000 |
|
| 832 | root | 10.42.19.119:52826 | <masked-text> | Query | 13310 | Commit | INSERT INTO `IMPORT_LOG`(`ID`, `LEGACY_SOURCE_SYSTEM`, `MINIO_PATH`, `FILE_NAME`, `CHANGES_OF_DAY`, | 0.000 |
|
| 833 | root | 10.42.19.119:52824 | <masked-text>| Query | 13310 | Commit | INSERT INTO `IMPORT_LOG`(`ID`, `LEGACY_SOURCE_SYSTEM`, `MINIO_PATH`, `FILE_NAME`, `CHANGES_OF_DAY`, | 0.000 |
|
| 16480 | root | 10.42.20.85:34928 | <masked-text>| Query | 5085 | starting | COMMIT | 0.000 |
|
| 16537 | root | 10.42.20.85:56224 | <masked-text> | Sleep | 101 | | NULL | 0.000 |
|
| 16550 | root | 10.42.20.85:34616 | <masked-text>| Query | 3860 | starting | COMMIT | 0.000 |
|
| 16792 | root | localhost | NULL | Query | 3574 | Waiting to execute in isolation | create database test20 | 0.000 |
|
| 17170 | root | 10.42.2.105:43124 | camunda | Sleep | 0 | | NULL | 0.000 |
|
| 17239 | root | 10.42.2.105:51736 | camunda | Sleep | 1401 | | NULL | 0.000 |
|
| 17368 | root | 10.42.2.105:54384 | camunda | Sleep | 752 | | NULL | 0.000 |
|
| 17515 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
|
+-------+-------------+--------------------+------------+---------+-------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|
15 rows in set (0.001 sec)
|
wsrep status seems ok for me...
node0:
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 37c66242-3f71-11ef-b0a6-debd1bb3b324 |
|
| wsrep_protocol_version | 10 |
|
| wsrep_last_committed | 24502 |
|
| wsrep_replicated | 2064 |
|
| wsrep_replicated_bytes | 263079328 |
|
| wsrep_repl_keys | 2771687 |
|
| wsrep_repl_keys_bytes | 22223400 |
|
| wsrep_repl_data_bytes | 78246653 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3424 |
|
| wsrep_received_bytes | 1396591210 |
|
| wsrep_local_commits | 2062 |
|
| wsrep_local_cert_failures | 1 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 4 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0.0106792 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 32 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 1.8455 |
|
| wsrep_local_cached_downto | 24385 |
|
| wsrep_flow_control_paused_ns | 40703822696367 |
|
| wsrep_flow_control_paused | 0.0401197 |
|
| wsrep_flow_control_sent | 19 |
|
| wsrep_flow_control_recv | 49 |
|
| wsrep_flow_control_active | false |
|
| wsrep_flow_control_requested | false |
|
| wsrep_cert_deps_distance | 48.7026 |
|
| wsrep_apply_oooe | 0.231066 |
|
| wsrep_apply_oool | 0.0150318 |
|
| wsrep_apply_window | 1.63789 |
|
| wsrep_apply_waits | 93 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1.28984 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 30924 |
|
| wsrep_causal_reads | 1 |
|
| wsrep_cert_interval | 133.325 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 10.42.2.37:0,10.42.0.120:0,10.42.4.36:0 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 1afb841a-440f-11ef-8946-47d7221eea33 |
|
| wsrep_gmcast_segment | 0 |
|
| wsrep_applier_thread_count | 4 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 23 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 37c66242-3f71-11ef-b0a6-debd1bb3b324 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 1 |
|
| wsrep_local_index | 0 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 4.16(rc333b19) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 5 |
|
+-------------------------------+
|
node1:
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 37c66242-3f71-11ef-b0a6-debd1bb3b324 |
|
| wsrep_protocol_version | 10 |
|
| wsrep_last_committed | 24502 |
|
| wsrep_replicated | 7291 |
|
| wsrep_replicated_bytes | 2414952720 |
|
| wsrep_repl_keys | 25587851 |
|
| wsrep_repl_keys_bytes | 204880520 |
|
| wsrep_repl_data_bytes | 615445311 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 14362 |
|
| wsrep_received_bytes | 4408608008 |
|
| wsrep_local_commits | 7275 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 5 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0.00843388 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 31 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0.580281 |
|
| wsrep_local_cached_downto | 24385 |
|
| wsrep_flow_control_paused_ns | 87648040329592 |
|
| wsrep_flow_control_paused | 0.0297835 |
|
| wsrep_flow_control_sent | 21 |
|
| wsrep_flow_control_recv | 90 |
|
| wsrep_flow_control_active | false |
|
| wsrep_flow_control_requested | false |
|
| wsrep_cert_deps_distance | 113.213 |
|
| wsrep_apply_oooe | 0.0966473 |
|
| wsrep_apply_oool | 0.0094386 |
|
| wsrep_apply_window | 1.27937 |
|
| wsrep_apply_waits | 324 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1.13696 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 30924 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 270.552 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 10.42.2.37:0,10.42.0.120:0,10.42.4.36:0 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | e0e9365e-440f-11ef-9c8d-4ac33822081a |
|
| wsrep_gmcast_segment | 0 |
|
| wsrep_applier_thread_count | 4 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 23 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 37c66242-3f71-11ef-b0a6-debd1bb3b324 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 1 |
|
| wsrep_local_index | 2 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 4.16(rc333b19) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 5 |
|
+-------------------------------+-
|
node2:
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 37c66242-3f71-11ef-b0a6-debd1bb3b324 |
|
| wsrep_protocol_version | 10 |
|
| wsrep_last_committed | 24502 |
|
| wsrep_replicated | 326 |
|
| wsrep_replicated_bytes | 576701336 |
|
| wsrep_repl_keys | 6132471 |
|
| wsrep_repl_keys_bytes | 49068208 |
|
| wsrep_repl_data_bytes | 143674284 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 1657 |
|
| wsrep_received_bytes | 614555552 |
|
| wsrep_local_commits | 325 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 4 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0.0217822 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 36 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 5.16295 |
|
| wsrep_local_cached_downto | 24385 |
|
| wsrep_flow_control_paused_ns | 38706322913812 |
|
| wsrep_flow_control_paused | 0.428917 |
|
| wsrep_flow_control_sent | 26 |
|
| wsrep_flow_control_recv | 43 |
|
| wsrep_flow_control_active | false |
|
| wsrep_flow_control_requested | false |
|
| wsrep_cert_deps_distance | 46.0429 |
|
| wsrep_apply_oooe | 0.60321 |
|
| wsrep_apply_oool | 0.0243498 |
|
| wsrep_apply_window | 2.58273 |
|
| wsrep_apply_waits | 61 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1.6497 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 30924 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 444.097 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 10.42.2.37:0,10.42.0.120:0,10.42.4.36:0 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 94989bc2-5e01-11ef-bb7b-db95587363ac |
|
| wsrep_gmcast_segment | 0 |
|
| wsrep_applier_thread_count | 4 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 23 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 37c66242-3f71-11ef-b0a6-debd1bb3b324 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 2 |
|
| wsrep_local_index | 1 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 4.16(rc333b19) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 5 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
69 rows in set (0.001 sec)
|
The logs don`t show too much actually, just found these log lines in the db pods that show at least some warning/error behavior:
[Warning] WSREP: Failed to report last committed 37c66242-3f71-11ef-b0a6-<masked-text>:24192, -110 (Connection timed out)
2024-08-20 7:07:06 185410 [Warning] Aborted connection 185410 to db: '<masked-text>' user: 'root' host: '<masked-text>' (Got an error reading communication packets)
Any advice, reading, help is appreciated. We have this behavior all 1-2 month once and we are only able to fix this by freshly restarting/boostraping the whole cluster.
Attachments
Issue Links
- relates to
-
MDEV-33252 One Node in galera cluster stucks with a query in state "Waiting for certification"
- Open