This is most probably result of fixes related to MDEV-23080 as this behavior appeared after updating to version 10.4.21.
We run galera cluster of 3 nodes. On every backup performed with mariabackup after update we get "Member desyncs itself from group" during backup final phase.
[00] 2021-08-1702:57:56 mariabackup: The latest check point (for incremental): '32481906568593'
[00] 2021-08-1702:57:56 Executing BACKUP STAGE END
[00] 2021-08-1702:57:56 All tables unlocked
[00] 2021-08-1702:57:56 Streaming ib_buffer_pool to <STDOUT>
[00] 2021-08-1702:57:56 Backup created in directory '/xtrabackup_backupfiles/'
[00] 2021-08-1702:57:56 MySQL binlog position: filename 'mariadb-bin.019684', position '421', GTID of the last change ''
[00] 2021-08-1702:57:56 Streaming backup-my.cnf
[00] 2021-08-1702:57:56 Streaming xtrabackup_info
[00] 2021-08-1702:57:56 Redo log (from LSN 32481655310193 to 32481906568602) was copied.
[00] 2021-08-1702:57:56 completed OK!
Within this phase, .TRG, .PAR, .FRM and other metadata files are copied. But after last update node started to report self-desync immediately when backup comes to this phase.
Node remains desynced until backup finished and then synchronizes with others.
mysqld log:
2021-08-17 2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider
2021-08-17 2:54:34 0 [Note] WSREP: Member 0.0 (node2.localdomain) desyncs itself from group
2021-08-17 2:57:57 24 [Note] WSREP: Server node2.localdomain synced with group
Note desync event just after starting to stream non-InnoDB files!
As dataset includes significant number of databases (just several thousands), phase of streaming non-InnoDB data can take several minutes, real lentgh of this desynced phase depends on number of tables the cluster handles, can be even longer than what we hit. For all this time the node remains desynced.
There are two questions:
Is it really necessary to put node into Desynced state while streaming non-InnoDB data (considering the fact that WSREP only replicates InnoDB transactions) and
is there any safe workaround on this that wouldn't render backup into inconsistent set of data?
We use mariabackup for long time already, but there was no such a behavior before.
Have to say that this issue neutralizes main mariabackup advantage of being non-blocking and artificially decreases cluster availability. Can this be fixed without breaking previous fixes for MDEV-23080?
@janlindstrom,
I will open a new bug and provide details.
I'm able to reliably reproduce using a docker-compose environment.
Eric Hontz
added a comment - @janlindstrom,
I will open a new bug and provide details.
I'm able to reliably reproduce using a docker-compose environment.
ehontz I looked it and not yet see a problem. Can you open a new bug and provide error logs from all nodes, node configuration and if you can show processlist from donor.
Jan Lindström
added a comment - ehontz I looked it and not yet see a problem. Can you open a new bug and provide error logs from all nodes, node configuration and if you can show processlist from donor.
I tried out the new wsrep_mode = BF_ABORT_MARIABACKUP setting and found that, when using wsrep_sst_method = mariabackup, a node never returns from Donor/Desynced to Synced after a serving as an SST donor.
Eric Hontz
added a comment - I tried out the new wsrep_mode = BF_ABORT_MARIABACKUP setting and found that, when using wsrep_sst_method = mariabackup , a node never returns from Donor/Desynced to Synced after a serving as an SST donor.
Please see my comment on the commit: https://github.com/MariaDB/server/commit/95de5248c7f59f96039f96f5442142c79da27b20#r121407370
2023-06-164:13:050 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1056655266)
2023-06-164:13:052 [Note] WSREP: Server host_c synced with group
Fortunately, there was no necessity to run cluster on single node within last months for long periods, so other nodes simply handle application requests while one is performing backup.
However, an option to stop backup if there's only one node running and DDL query issued, might be good. So we will probably use it in future.
Eugene
added a comment - We are remain with default settings, so backup still causes node desync for ~5 minutes:
2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Desyncing and pausing the provider
2023 - 06 - 16 4 : 07 : 44 0 [Note] WSREP: Member 0.0 (host_c) desyncs itself from group
2023 - 06 - 16 4 : 07 : 44 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 1056620529 )
2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: pause
2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Provider paused at eebbcc99-e2e2- 1111 - 8484 -ffdd99eb2a58: 1056620529 ( 51043020 )
2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Provider paused at: 1056620529
2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: Resuming and resyncing the provider
2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: resume
2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: resuming provider at 51043020
2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: Provider resumed.
2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Member 0.0 (host_c) resyncs itself to group.
2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 1056652398 )
2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Processing event queue:... 0.0 % ( 0 / 31822 events) complete.
2023 - 06 - 16 4 : 13 : 00 30 [Note] WSREP: Processing event queue:... 48.6 % ( 16720 / 34391 events) complete.
2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Member 0.0 (host_c) synced with group.
2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Processing event queue:... 100.0 % ( 35590 / 35590 events) complete.
2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1056655266 )
2023 - 06 - 16 4 : 13 : 05 2 [Note] WSREP: Server host_c synced with group
Fortunately, there was no necessity to run cluster on single node within last months for long periods, so other nodes simply handle application requests while one is performing backup.
However, an option to stop backup if there's only one node running and DDL query issued, might be good. So we will probably use it in future.
@janlindstrom: I opened
MDEV-31737