[MDEV-26316] quorum is lost on full or bigger resync using wsrep on galera multi-master Created: 2021-08-06  Updated: 2021-12-09  Resolved: 2021-12-09

Status: Closed
Project: MariaDB Server
Component/s: Galera, wsrep
Affects Version/s: 10.5.11, 10.5
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Jaroslav Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: Text File wsrep.log    

 Description   

We hit disaster scenario unexpectedly on our multi-master galera setup running 10.5 version.

During recovery we were able to bring first node up without issues up. It always started as primary with quorum and allowed another node to join cluster and was safe to bootstrap

root@mysql-0:/var/lib/mysql# cat grastate.dat
# GALERA saved state
version: 2.1
uuid:    676e5a48-fbd8-11ea-a02d-8f1e7fb61efb
seqno:   -1
safe_to_bootstrap: 1
 
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+
1 row in set (0.001 sec)
 
MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.002 sec)

To prevent any data mismatch and have the restore faster we recreated disk to have node-1 blank and join cluster as new one. This worked fine, the process started to full sync but after a while (~19-25GB) of data sync the whole node crashed and in logs we could see that QUORUM was lost and both nodes didn't know who is the master now. The sync stopped and nodes never joined back to form any kind of cluster.
Sometimes this sync ended sooner with same quorum break and sometimes later. After retrying lot of times we sucessfully synced the node with pure hope it will once sync fully.

Error log in attachment where this was thrown

2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Warning] WSREP: Quorum: No node with complete state:
"
2021-08-06T07:15:00.0000000Z	"
"
2021-08-06T07:15:00.0000000Z	"	Version      : 6
"
2021-08-06T07:15:00.0000000Z	"	Flags        : 0x1
"
2021-08-06T07:15:00.0000000Z	"	Protocols    : 2 / 10 / 4
"
2021-08-06T07:15:00.0000000Z	"	State        : JOINER
"
2021-08-06T07:15:00.0000000Z	"	Desync count : 0
"
2021-08-06T07:15:00.0000000Z	"	Prim state   : JOINER
"
2021-08-06T07:15:00.0000000Z	"	Prim UUID    : 610fc4ce-f685-11eb-9e4f-725b850de788
"
2021-08-06T07:15:00.0000000Z	"	Prim  seqno  : 2
"
2021-08-06T07:15:00.0000000Z	"	First seqno  : -1
"
2021-08-06T07:15:00.0000000Z	"	Last  seqno  : 245421524
"
2021-08-06T07:15:00.0000000Z	"	Commit cut   : 245421512
"
2021-08-06T07:15:00.0000000Z	"	Last vote    : -1.0
"
2021-08-06T07:15:00.0000000Z	"	Vote policy  : 0
"
2021-08-06T07:15:00.0000000Z	"	Prim JOINED  : 1
"
2021-08-06T07:15:00.0000000Z	"	State UUID   : 1dda2d24-f686-11eb-8c64-5ab16981001e
"
2021-08-06T07:15:00.0000000Z	"	Group UUID   : 676e5a48-fbd8-11ea-a02d-8f1e7fb61efb
"
2021-08-06T07:15:00.0000000Z	"	Name         : 'mysql-1'
"
2021-08-06T07:15:00.0000000Z	"	Incoming addr: 'AUTO'
"
2021-08-06T07:15:00.0000000Z	"
"
2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Warning] WSREP: No re-merged primary component found.
"
2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Warning] WSREP: No bootstrapped primary component found.
"
2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
"
2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Note] WSREP: Quorum results:
"
2021-08-06T07:15:00.0000000Z	"	version    = 6,
"
2021-08-06T07:15:00.0000000Z	"	component  = NON-PRIMARY,
"
2021-08-06T07:15:00.0000000Z	"	conf_id    = -1,
"
2021-08-06T07:15:00.0000000Z	"	members    = 1/1 (joined/total),
"
2021-08-06T07:15:00.0000000Z	"	act_id     = -1,
"
2021-08-06T07:15:00.0000000Z	"	last_appl. = 245421512,
"
2021-08-06T07:15:00.0000000Z	"	protocols  = -1/-1/-1 (gcs/repl/appl),
"
2021-08-06T07:15:00.0000000Z	"	vote policy= 1,
"
2021-08-06T07:15:00.0000000Z	"	group UUID = 00000000-0000-0000-0000-000000000000
"
2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Note] WSREP: Flow-control interval: [16, 16]



 Comments   
Comment by Jaroslav [ 2021-08-11 ]

This can be closed (deleted). It was caused by our livenessprobe to kick too soon.

Generated at Thu Feb 08 09:44:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.