[MDEV-26735] unexpected Galera desync when running backup stage block_commit at MariaDB 10.5.12 Created: 2021-09-30  Updated: 2021-10-01

Status: Open
Project: MariaDB Server
Component/s: Galera, Server
Affects Version/s: 10.5.12
Fix Version/s: None

Type: Bug Priority: Major
Reporter: William Wong Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

redhat 7



 Description   

Hi,

We just upgrade our DB cluster to MariaDB 10.5.12. Found below unexpected Galera desync when using "backup stage block_commit" command. We leverage storage technology for backup. Backup stage commands are used to ensure data consistency in snapshot.

Rejoin cluster when running "backup stage end".

mysql> backup stage start ;

<<no message in error log which is normal>>

mysql> backup stage block_commit ;

2021-09-30 11:59:16 7 [Note] WSREP: Desyncing and pausing the provider
2021-09-30 11:59:16 0 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2021-09-30 11:59:16 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 51)
2021-09-30 11:59:16 7 [Note] WSREP: pause
2021-09-30 11:59:16 7 [Note] WSREP: Provider paused at b738a6ce-081b-11ec-86f1-ba40f7f51758:51 (8)
2021-09-30 11:59:16 7 [Note] WSREP: Provider paused at: 51

mysql> backup stage end ;

2021-09-30 11:59:16 7 [Note] WSREP: Resuming and resyncing the provider
2021-09-30 11:59:16 7 [Note] WSREP: resume
2021-09-30 11:59:16 7 [Note] WSREP: resuming provider at 8
2021-09-30 11:59:16 7 [Note] WSREP: Provider resumed.
2021-09-30 11:59:16 0 [Note] WSREP: Member 0.0 (node1) resyncs itself to group.
2021-09-30 11:59:16 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 51)
2021-09-30 11:59:16 0 [Note] WSREP: Member 0.0 (node1) synced with group.
2021-09-30 11:59:16 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 51)
2021-09-30 11:59:16 2 [Note] WSREP: Server node1 synced with group



 Comments   
Comment by William Wong [ 2021-10-01 ]

Found this Galera desync behavior change is from PR MDEV-23080 (https://github.com/MariaDB/server/pull/1877)

Our production system rely on snapshot technology to take backup according to (https://mariadb.com/kb/en/storage-snapshots-and-backup-stage-commands/). At the same time, our system requirement is RPO=0.

Below is the mechanism of our backup:
1. backup stage start + backup stage block_commit
2. take OS LVM snapshot
3. backup stage end
4. take VM snapshot later
5. remove OS LVM snapshot

In our system (and some other RPO=0 systems), the few seconds Galera replication paused during block_commit is acceptable but desync is not.

However, the new "backup stage block_commit" behavior causes RPO>0 until "backup stage end".

Is it possible to add a new backup stage command to keep original behavior of "backup stage block_commit"?

Generated at Thu Feb 08 09:47:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.