Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Not a Bug
-
10.5.11, 10.5
-
None
Description
We hit disaster scenario unexpectedly on our multi-master galera setup running 10.5 version.
During recovery we were able to bring first node up without issues up. It always started as primary with quorum and allowed another node to join cluster and was safe to bootstrap
root@mysql-0:/var/lib/mysql# cat grastate.dat |
# GALERA saved state
|
version: 2.1 |
uuid: 676e5a48-fbd8-11ea-a02d-8f1e7fb61efb
|
seqno: -1 |
safe_to_bootstrap: 1 |
|
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status'; |
+----------------------+---------+
|
| Variable_name | Value |
|
+----------------------+---------+
|
| wsrep_cluster_status | Primary |
|
+----------------------+---------+
|
1 row in set (0.001 sec) |
|
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment'; |
+---------------------------+--------+
|
| Variable_name | Value |
|
+---------------------------+--------+
|
| wsrep_local_state_comment | Synced |
|
+---------------------------+--------+
|
1 row in set (0.002 sec) |
To prevent any data mismatch and have the restore faster we recreated disk to have node-1 blank and join cluster as new one. This worked fine, the process started to full sync but after a while (~19-25GB) of data sync the whole node crashed and in logs we could see that QUORUM was lost and both nodes didn't know who is the master now. The sync stopped and nodes never joined back to form any kind of cluster.
Sometimes this sync ended sooner with same quorum break and sometimes later. After retrying lot of times we sucessfully synced the node with pure hope it will once sync fully.
Error log in attachment where this was thrown
2021-08-06T07:15:00.0000000Z "2021-08-06 7:15:45 0 [Warning] WSREP: Quorum: No node with complete state: |
"
|
2021-08-06T07:15:00.0000000Z " |
"
|
2021-08-06T07:15:00.0000000Z " Version : 6 |
"
|
2021-08-06T07:15:00.0000000Z " Flags : 0x1 |
"
|
2021-08-06T07:15:00.0000000Z " Protocols : 2 / 10 / 4 |
"
|
2021-08-06T07:15:00.0000000Z " State : JOINER |
"
|
2021-08-06T07:15:00.0000000Z " Desync count : 0 |
"
|
2021-08-06T07:15:00.0000000Z " Prim state : JOINER |
"
|
2021-08-06T07:15:00.0000000Z " Prim UUID : 610fc4ce-f685-11eb-9e4f-725b850de788 |
"
|
2021-08-06T07:15:00.0000000Z " Prim seqno : 2 |
"
|
2021-08-06T07:15:00.0000000Z " First seqno : -1 |
"
|
2021-08-06T07:15:00.0000000Z " Last seqno : 245421524 |
"
|
2021-08-06T07:15:00.0000000Z " Commit cut : 245421512 |
"
|
2021-08-06T07:15:00.0000000Z " Last vote : -1.0 |
"
|
2021-08-06T07:15:00.0000000Z " Vote policy : 0 |
"
|
2021-08-06T07:15:00.0000000Z " Prim JOINED : 1 |
"
|
2021-08-06T07:15:00.0000000Z " State UUID : 1dda2d24-f686-11eb-8c64-5ab16981001e |
"
|
2021-08-06T07:15:00.0000000Z " Group UUID : 676e5a48-fbd8-11ea-a02d-8f1e7fb61efb |
"
|
2021-08-06T07:15:00.0000000Z " Name : 'mysql-1' |
"
|
2021-08-06T07:15:00.0000000Z " Incoming addr: 'AUTO' |
"
|
2021-08-06T07:15:00.0000000Z " |
"
|
2021-08-06T07:15:00.0000000Z "2021-08-06 7:15:45 0 [Warning] WSREP: No re-merged primary component found. |
"
|
2021-08-06T07:15:00.0000000Z "2021-08-06 7:15:45 0 [Warning] WSREP: No bootstrapped primary component found. |
"
|
2021-08-06T07:15:00.0000000Z "2021-08-06 7:15:45 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum. |
"
|
2021-08-06T07:15:00.0000000Z "2021-08-06 7:15:45 0 [Note] WSREP: Quorum results: |
"
|
2021-08-06T07:15:00.0000000Z " version = 6, |
"
|
2021-08-06T07:15:00.0000000Z " component = NON-PRIMARY, |
"
|
2021-08-06T07:15:00.0000000Z " conf_id = -1, |
"
|
2021-08-06T07:15:00.0000000Z " members = 1/1 (joined/total), |
"
|
2021-08-06T07:15:00.0000000Z " act_id = -1, |
"
|
2021-08-06T07:15:00.0000000Z " last_appl. = 245421512, |
"
|
2021-08-06T07:15:00.0000000Z " protocols = -1/-1/-1 (gcs/repl/appl), |
"
|
2021-08-06T07:15:00.0000000Z " vote policy= 1, |
"
|
2021-08-06T07:15:00.0000000Z " group UUID = 00000000-0000-0000-0000-000000000000 |
"
|
2021-08-06T07:15:00.0000000Z "2021-08-06 7:15:45 0 [Note] WSREP: Flow-control interval: [16, 16] |
|
This can be closed (deleted). It was caused by our livenessprobe to kick too soon.