[MDEV-26391] mariabackup always triggers node desync - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.6.12, 10.7.8, 10.9.5, 10.10.3
Component/s: Backup, Galera, wsrep
Labels:
None

Description

This is most probably result of fixes related to ~~MDEV-23080~~ as this behavior appeared after updating to version 10.4.21.

We run galera cluster of 3 nodes. On every backup performed with mariabackup after update we get "Member desyncs itself from group" during backup final phase.

Mariabackup is started with:

mariabackup -u root -p PASSWORD --backup --galera-info --stream=xbstream --parallel 8 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql 2>>/var/log/mariabackup_copy.log| /usr/bin/zstd --fast -T8 -q -o /home/mariabackup/backup.zst

While creating the backup, after phase of streaming InnoDB data, phase of streaming non-InnoDB data comes, for example:
mariabackup log:

[00] 2021-08-17 02:54:31 Acquiring BACKUP LOCKS...

[00] 2021-08-17 02:54:34 Starting to backup non-InnoDB tables and files

...

[00] 2021-08-17 02:57:53 Finished backing up non-InnoDB tables and files

[00] 2021-08-17 02:57:53 Waiting for log copy thread to read lsn 32481906458304

...

[00] 2021-08-17 02:57:56 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...

[00] 2021-08-17 02:57:56 mariabackup: The latest check point (for incremental): '32481906568593'

[00] 2021-08-17 02:57:56 Executing BACKUP STAGE END

[00] 2021-08-17 02:57:56 All tables unlocked

[00] 2021-08-17 02:57:56 Streaming ib_buffer_pool to <STDOUT>

[00] 2021-08-17 02:57:56 Backup created in directory '/xtrabackup_backupfiles/'

[00] 2021-08-17 02:57:56 MySQL binlog position: filename 'mariadb-bin.019684', position '421', GTID of the last change ''

[00] 2021-08-17 02:57:56 Streaming backup-my.cnf

[00] 2021-08-17 02:57:56 Streaming xtrabackup_info

[00] 2021-08-17 02:57:56 Redo log (from LSN 32481655310193 to 32481906568602) was copied.

[00] 2021-08-17 02:57:56 completed OK!

Within this phase, .TRG, .PAR, .FRM and other metadata files are copied. But after last update node started to report self-desync immediately when backup comes to this phase.
Node remains desynced until backup finished and then synchronizes with others.
mysqld log:

2021-08-17  2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider

2021-08-17  2:54:34 0 [Note] WSREP: Member 0.0 (node2.localdomain) desyncs itself from group

2021-08-17  2:54:34 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 3821784561)

2021-08-17  2:54:34 18009831 [Note] WSREP: pause

2021-08-17  2:54:34 18009831 [Note] WSREP: Provider paused at 0ca12340-****-****-****-******ed4dfb:3821784561 (30463042)

2021-08-17  2:54:34 18009831 [Note] WSREP: Provider paused at: 3821784561

2021-08-17  2:57:56 18009831 [Note] WSREP: Resuming and resyncing the provider

2021-08-17  2:57:56 18009831 [Note] WSREP: resume

2021-08-17  2:57:56 18009831 [Note] WSREP: resuming provider at 30463042

2021-08-17  2:57:56 18009831 [Note] WSREP: Provider resumed.

2021-08-17  2:57:56 0 [Note] WSREP: Member 0.0 (node2.localdomain) resyncs itself to group.

2021-08-17  2:57:56 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 3821789518)

2021-08-17  2:57:57 0 [Note] WSREP: Member 0.0 (node2.localdomain) synced with group.

2021-08-17  2:57:57 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 3821789533)

2021-08-17  2:57:57 24 [Note] WSREP: Server node2.localdomain synced with group

Note desync event just after starting to stream non-InnoDB files!

As dataset includes significant number of databases (just several thousands), phase of streaming non-InnoDB data can take several minutes, real lentgh of this desynced phase depends on number of tables the cluster handles, can be even longer than what we hit. For all this time the node remains desynced.

There are two questions:

Is it really necessary to put node into Desynced state while streaming non-InnoDB data (considering the fact that WSREP only replicates InnoDB transactions) and
is there any safe workaround on this that wouldn't render backup into inconsistent set of data?

We use mariabackup for long time already, but there was no such a behavior before.
Have to say that this issue neutralizes main mariabackup advantage of being non-blocking and artificially decreases cluster availability. Can this be fixed without breaking previous fixes for ~~MDEV-23080~~?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt_all.txt
78 kB
2022-10-06 08:10
error.log
79 kB
2022-10-06 08:10

Issue Links

relates to

MDEV-23080 mariabackup: position saved in xtrabackup_binlog_info is incorrect for replication

Closed

Activity

Ascending order - Click to sort in descending order

Eugene created issue - 2021-08-17 20:51

Eugene made changes - 2021-08-17 20:51

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-23080~~ [ ~~MDEV-23080~~ ]

Stephan Vos added a comment - 2021-11-24 13:31

As far as I am aware its recommended to put the node on which the backup is taken in "desync/donor" mode before taking the backup.
mysql -u admin --password=JUjmf5M6HNB8L4yy -e "SET GLOBAL wsrep_desync = ON;"

That is what we do when we run our backup process.
I was not aware that mariabackup would automatically do this?

My question then is:
Is or is'nt it recommended to desync the node before taking the backup?
I would prefer if its not required because then we can stream and compress our backup all in one step instead of having to:
1. desync
2. backup
3. resync
4. compress and copy the backup

We are doing compress and copy separately to minimize desync period.

Stephan Vos added a comment - 2021-11-24 13:31 As far as I am aware its recommended to put the node on which the backup is taken in "desync/donor" mode before taking the backup. mysql -u admin --password=JUjmf5M6HNB8L4yy -e "SET GLOBAL wsrep_desync = ON;" That is what we do when we run our backup process. I was not aware that mariabackup would automatically do this? My question then is: Is or is'nt it recommended to desync the node before taking the backup? I would prefer if its not required because then we can stream and compress our backup all in one step instead of having to: 1. desync 2. backup 3. resync 4. compress and copy the backup We are doing compress and copy separately to minimize desync period.

Sergei Golubchik made changes - 2021-12-06 21:35

Workflow

MariaDB v3 [ 124361 ]

MariaDB v4 [ 143098 ]

Elena Stepanova made changes - 2022-03-04 22:42

Fix Version/s		10.4 [ 22408 ]
Assignee		Jan Lindström [ jplindst ]

Julien Fritsch made changes - 2022-03-16 09:56

Status

Open [ 1 ]

Confirmed [ 10101 ]

Jan Lindström (Inactive) made changes - 2022-03-16 10:14

Assignee

Jan Lindström [ jplindst ]

Seppo Jaakola [ seppo ]

Valerii Kravchuk made changes - 2022-04-02 09:22

Priority

Minor [ 4 ]

Major [ 3 ]

Ralf Gebhardt made changes - 2022-04-06 12:41

Priority

Major [ 3 ]

Critical [ 2 ]

Seppo Jaakola made changes - 2022-04-21 13:05

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Seppo Jaakola added a comment - 2022-04-22 12:33

It is true that ~~MDEV-23080~~ has changed the behavior, and node now switches to desynced and paused state, when mariabackup acquires backup stage block DDL lock. Reason for this was that DDL replication could interfere with backup processing in this backup stage, and by pausing the node it was guaranteed that no DDL was executed in the node. However, this blocked all DML in the node with same go.

This behavior might be possible to optimize, by one of following alternatives:

refactor mariabackup to lift the requirement to block DDL
not pausing the node, but honoring the block DDL locking. This would have the side effect that cluster would freeze if any DDL happens during mariabackup
not pausing the node, but letting the replication to abort ongoing backup process, if it has conflicting block DDL lock on

Seppo Jaakola added a comment - 2022-04-22 12:33 It is true that MDEV-23080 has changed the behavior, and node now switches to desynced and paused state, when mariabackup acquires backup stage block DDL lock. Reason for this was that DDL replication could interfere with backup processing in this backup stage, and by pausing the node it was guaranteed that no DDL was executed in the node. However, this blocked all DML in the node with same go. This behavior might be possible to optimize, by one of following alternatives: refactor mariabackup to lift the requirement to block DDL not pausing the node, but honoring the block DDL locking. This would have the side effect that cluster would freeze if any DDL happens during mariabackup not pausing the node, but letting the replication to abort ongoing backup process, if it has conflicting block DDL lock on

Seppo Jaakola made changes - 2022-04-22 12:46

Status

In Progress [ 3 ]

Needs Feedback [ 10501 ]

Valerii Kravchuk made changes - 2022-04-22 12:57

Status

Needs Feedback [ 10501 ]

Open [ 1 ]

Eugene added a comment - 2022-04-22 13:06

IMHO, way 2 and 3 (that lead to whole cluster block or backup failure on running single conflicting DDL query) are way bad idea. For examle, last days the phase of "desync" state during the backup lasts for several minutes. Having the cluster blocked for all this time can easily lead to client applications downtime of similar period. So the only way left is to adjust mariabackup itself.

BTW. While performing the backup on single-node cluster (but with WSREP library loaded and active), the only node of the cluster is also put into "desync" state during the backup, making client applications failing for the whole period the node remains desynced.

Eugene added a comment - 2022-04-22 13:06 IMHO, way 2 and 3 (that lead to whole cluster block or backup failure on running single conflicting DDL query) are way bad idea. For examle, last days the phase of "desync" state during the backup lasts for several minutes. Having the cluster blocked for all this time can easily lead to client applications downtime of similar period. So the only way left is to adjust mariabackup itself. BTW. While performing the backup on single-node cluster (but with WSREP library loaded and active), the only node of the cluster is also put into "desync" state during the backup, making client applications failing for the whole period the node remains desynced.

Seppo Jaakola made changes - 2022-04-25 08:00

Status

Open [ 1 ]

Needs Feedback [ 10501 ]

Valerii Kravchuk made changes - 2022-04-25 08:35

Status

Needs Feedback [ 10501 ]

Open [ 1 ]

Seppo Jaakola made changes - 2022-06-08 12:30

Status

Open [ 1 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2022-07-01 06:44

Link

This issue is blocked by TODO-3506 [ TODO-3506 ]

Ralf Gebhardt made changes - 2022-07-21 16:26

Affects Version/s	10.4.21 [ 26030 ]
Environment	Linux 5.10.58-gentoo x86_64 AMD EPYC 7451
Issue Type	Bug [ 1 ]	Task [ 3 ]

Chris Calender (Inactive) made changes - 2022-08-17 15:29

Assignee

Seppo Jaakola [ seppo ]

Jan Lindström [ jplindst ]

Eugene added a comment - 2022-08-18 05:10 - edited

Looking at this all discussion, now I'd like to ask for advice.
For technical reason we have had to reduce the cluster to one node.
We use wsrep_sst_method=mariabackup because according to the documentation it's the only method that allows SST without taking the donor node down, at it really was.
However, with mariadb behavior we have now it is impossible to add another node to the cluster without significant downtime, for instance, today it was 15 minutes:

2022-08-18  6:28:39 379299 [Note] WSREP: Desyncing and pausing the provider

2022-08-18  6:28:39 379299 [Note] WSREP: pause

2022-08-18  6:28:39 379299 [Note] WSREP: Provider paused at 0ca58cd7-821e-11ea-aec3-23d843ed4dfb:8533940816 (1818998)

2022-08-18  6:28:39 379299 [Note] WSREP: Provider paused at: 8533940816

2022-08-18  6:45:53 379299 [Note] WSREP: Resuming and resyncing the provider

2022-08-18  6:45:53 379299 [Note] WSREP: resume

2022-08-18  6:45:53 379299 [Note] WSREP: resuming provider at 1818998

2022-08-18  6:45:53 379299 [Note] WSREP: Provider resumed.

2022-08-18  6:45:56 0 [Note] WSREP: SST sent: 0ca58cd7-821e-11ea-aec3-23d843ed4dfb:8532294770

2022-08-18  6:45:56 0 [Note] WSREP: Server status change donor -> joined

2022-08-18  6:45:56 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

WSREP_SST: [INFO] Total time on donor: 0 seconds (20220818 06:45:56.210)

WSREP_SST: [INFO] mariabackup SST completed on donor (20220818 06:45:56.231)

WSREP_SST: [INFO] Cleaning up temporary directories (20220818 06:45:56.247)

2022-08-18  6:45:56 0 [Note] WSREP: Donor monitor thread ended with total time 20994 sec

This was the only node in the cluster, so cluster and all the SQL-relying things were simply offline for all the 15 minutes since 6:28 till 6:45. So mariabackup is no longer the method that let's one joining another node without stopping the donoring one.
The question is: is there any method to join a node to the cluster without having downtime or since the date this bug was reported there's no such a method?
Again, before the date this bug appeared we performed SST without problems and related application downtime many times, until it got broken a year ago. Unfortunately, since that day every backup run is a problem.
Can anyone advise some workaround if this can't be fixed?
Maybe there's some better SST method that can be used without having guaranteed downtime?

Eugene added a comment - 2022-08-18 05:10 - edited Looking at this all discussion, now I'd like to ask for advice. For technical reason we have had to reduce the cluster to one node. We use wsrep_sst_method=mariabackup because according to the documentation it's the only method that allows SST without taking the donor node down, at it really was. However, with mariadb behavior we have now it is impossible to add another node to the cluster without significant downtime, for instance, today it was 15 minutes: 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: Desyncing and pausing the provider 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: pause 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: Provider paused at 0ca58cd7-821e-11ea-aec3-23d843ed4dfb: 8533940816 ( 1818998 ) 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: Provider paused at: 8533940816 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: Resuming and resyncing the provider 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: resume 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: resuming provider at 1818998 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: Provider resumed. 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: SST sent: 0ca58cd7-821e-11ea-aec3-23d843ed4dfb: 8532294770 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: Server status change donor -> joined 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. WSREP_SST: [INFO] Total time on donor: 0 seconds ( 20220818 06 : 45 : 56.210 ) WSREP_SST: [INFO] mariabackup SST completed on donor ( 20220818 06 : 45 : 56.231 ) WSREP_SST: [INFO] Cleaning up temporary directories ( 20220818 06 : 45 : 56.247 ) 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: Donor monitor thread ended with total time 20994 sec This was the only node in the cluster, so cluster and all the SQL-relying things were simply offline for all the 15 minutes since 6:28 till 6:45. So mariabackup is no longer the method that let's one joining another node without stopping the donoring one. The question is: is there any method to join a node to the cluster without having downtime or since the date this bug was reported there's no such a method? Again, before the date this bug appeared we performed SST without problems and related application downtime many times, until it got broken a year ago. Unfortunately, since that day every backup run is a problem. Can anyone advise some workaround if this can't be fixed? Maybe there's some better SST method that can be used without having guaranteed downtime?

Chris Calender (Inactive) made changes - 2022-08-31 15:44

Fix Version/s

10.4 [ 22408 ]

Stephan Vos added a comment - 2022-09-04 08:50 - edited

@Seppo will this new flag also allow SST not to block the donor mode as where the case before 10.5.12?

Stephan Vos added a comment - 2022-09-04 08:50 - edited @Seppo will this new flag also allow SST not to block the donor mode as where the case before 10.5.12?

Stephan Vos added a comment - 2022-09-26 12:49

How does this current situation effect SST method = MariaBackup. will it freeze the donor node?
Any chance this feature will make it into the next release of 10.5/10.6?

Stephan Vos added a comment - 2022-09-26 12:49 How does this current situation effect SST method = MariaBackup. will it freeze the donor node? Any chance this feature will make it into the next release of 10.5/10.6?

Jan Lindström (Inactive) made changes - 2022-09-27 05:32

Fix Version/s

10.4 [ 22408 ]

Luke Cousins added a comment - 2022-09-27 15:26

Yes, it will freeze the doner node, it's pretty serious! We've found that if you can restart the doner node before making the SST, you can then complete the SST without a problem. This often means needing a minute of downtime as you find yourself all the way down to 1 node and a fresh bootstrap. Not good really and very much looking forward to this being fixed.

Luke Cousins added a comment - 2022-09-27 15:26 Yes, it will freeze the doner node, it's pretty serious! We've found that if you can restart the doner node before making the SST, you can then complete the SST without a problem. This often means needing a minute of downtime as you find yourself all the way down to 1 node and a fresh bootstrap. Not good really and very much looking forward to this being fixed.

Stephan Vos added a comment - 2022-09-27 16:22

Thanks for confirming Luke.
Yes this is not good and will be a serious problem for us as well.

Stephan Vos added a comment - 2022-09-27 16:22 Thanks for confirming Luke. Yes this is not good and will be a serious problem for us as well.

Stephan Vos added a comment - 2022-09-27 17:16 - edited

I just did a test with 2 node cluster both on 10.5.17.

I stopped node 2 and removed the datadir and then started again after which it proceeded to do a SST via MariaBackup.
node 1 (Doner) showed wsrep_local_state_comment=Donor/Desynced but I could insert into tables while it was busy being the donor, so it was actually not blocking like I expected it to!

and SST was successful.

I also tested a backup and do see the mentioned:
"WSREP: Desyncing and pausing the provider" at the end.
However in our case this only lasted a couple of seconds as we don't have any non-InnoDB tables.
I also created a table during this backup which was picked up by the DDL monitor thread:
[00] 2022-09-27 19:35:03 DDL tracking : create 11830 "./opmon/svt.ibd"

Furthermore I have always issued:
SET GLOBAL wsrep_desync = ON
before executing mariabackup as was recommended to us when we started using it in 2019.
I have never seen this cause any issues and as per my test above it does allow DML to take place.

Stephan Vos added a comment - 2022-09-27 17:16 - edited I just did a test with 2 node cluster both on 10.5.17. I stopped node 2 and removed the datadir and then started again after which it proceeded to do a SST via MariaBackup. node 1 (Doner) showed wsrep_local_state_comment=Donor/Desynced but I could insert into tables while it was busy being the donor, so it was actually not blocking like I expected it to! and SST was successful. I also tested a backup and do see the mentioned: "WSREP: Desyncing and pausing the provider" at the end. However in our case this only lasted a couple of seconds as we don't have any non-InnoDB tables. I also created a table during this backup which was picked up by the DDL monitor thread: [00] 2022-09-27 19:35:03 DDL tracking : create 11830 "./opmon/svt.ibd" Furthermore I have always issued: SET GLOBAL wsrep_desync = ON before executing mariabackup as was recommended to us when we started using it in 2019. I have never seen this cause any issues and as per my test above it does allow DML to take place.

Eugene added a comment - 2022-09-27 20:37

The impact really depends on how busy the cluster is.
The problem is when you have big dataset and intensive writes.
In our case SST takes 3-5 hours. For all this time node is desynced, but serves requests, even creating new databases. The problem is final phase. The longer SST takes, the longer period between "Desyncing and pausing the provider" and "resuming" events will be. In case there are several nodes in the cluster already, donor will trigger flowcontrol. But in case the cluster has just bootstrapped and the donoring node is the only consistent (say, second one has datadir wiped for any reason), for all this time between "pausing" and "resuming" the joining nod is not yet ready to serve requests, and the donoring has provider paused. As a result, all the requests are stuck and in fact all the applications relying on Mariadb are down. The longer SST lasts, the longer the downtime is.
Can't test 10.5 branch at the moment, but latest 10.4 has this behavior, so SST becomes a disaster in certain way.

Also, really not sure whether it's safe to set "wsrep_desync = ON" for the cluster that has a node paused for 3-5 minutes and has amount of writes that trigger node desync within few seconds. If writes are not intensive, one can pause any component for any time safely. Will the donor perform IST once mariabackup is done?

Eugene added a comment - 2022-09-27 20:37 The impact really depends on how busy the cluster is. The problem is when you have big dataset and intensive writes. In our case SST takes 3-5 hours. For all this time node is desynced, but serves requests, even creating new databases. The problem is final phase. The longer SST takes, the longer period between "Desyncing and pausing the provider" and "resuming" events will be. In case there are several nodes in the cluster already, donor will trigger flowcontrol. But in case the cluster has just bootstrapped and the donoring node is the only consistent (say, second one has datadir wiped for any reason), for all this time between "pausing" and "resuming" the joining nod is not yet ready to serve requests, and the donoring has provider paused. As a result, all the requests are stuck and in fact all the applications relying on Mariadb are down. The longer SST lasts, the longer the downtime is. Can't test 10.5 branch at the moment, but latest 10.4 has this behavior, so SST becomes a disaster in certain way. Also, really not sure whether it's safe to set "wsrep_desync = ON" for the cluster that has a node paused for 3-5 minutes and has amount of writes that trigger node desync within few seconds. If writes are not intensive, one can pause any component for any time safely. Will the donor perform IST once mariabackup is done?

Jan Lindström (Inactive) added a comment - 2022-09-28 06:03 - edited

ramesh Can you do testing for this fix:

branch: bb-10.6-~~MDEV-26391~~-galera
Using defaults set load on both donor and joiner and start backup and verify that donor desyncs.
Using defaults set load on both donor and joiner and start DDL and then backup and verify taht donor desyncs and that DDL does not crash
Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start backup and verify that donor does not desync
Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start DDL (if possible) and backup and verify that backup fails (hopefully with some clear error message)

Jan Lindström (Inactive) added a comment - 2022-09-28 06:03 - edited ramesh Can you do testing for this fix: branch: bb-10.6- MDEV-26391 -galera Using defaults set load on both donor and joiner and start backup and verify that donor desyncs. Using defaults set load on both donor and joiner and start DDL and then backup and verify taht donor desyncs and that DDL does not crash Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start backup and verify that donor does not desync Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start DDL (if possible) and backup and verify that backup fails (hopefully with some clear error message)

Jan Lindström (Inactive) made changes - 2022-09-28 06:04

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Jan Lindström (Inactive) made changes - 2022-09-28 06:04

Assignee

Jan Lindström [ jplindst ]

Ramesh Sivaraman [ JIRAUSER48189 ]

Stephan Vos added a comment - 2022-09-28 08:25

Eugene
From my understanding all "wsrep_desync = ON" does is disable flow control so that the remaining nodes are not held up by the backup.
The DONOR/Node performing backup will still accept and apply writesets.
However I think that "WSREP: Desyncing and pausing the provider" at the end of the backup/SST does more than that and actually pauses replication which will be a problem for client applications.

As to IST I would expect that the DONOR does not need to perform this as it applies writesets during SST.
At the end of the SST DONOR nodes displays: "WSREP: async IST sender served" (not sure why as it performed a SST and there was no indication that a IST were performed after the SST)

Stephan Vos added a comment - 2022-09-28 08:25 Eugene From my understanding all "wsrep_desync = ON" does is disable flow control so that the remaining nodes are not held up by the backup. The DONOR/Node performing backup will still accept and apply writesets. However I think that "WSREP: Desyncing and pausing the provider" at the end of the backup/SST does more than that and actually pauses replication which will be a problem for client applications. As to IST I would expect that the DONOR does not need to perform this as it applies writesets during SST. At the end of the SST DONOR nodes displays: "WSREP: async IST sender served" (not sure why as it performed a SST and there was no indication that a IST were performed after the SST)

Eugene added a comment - 2022-09-28 09:50 - edited

According to documentation, flowcontrol is triggered when the node has buffer full and can't accept more writesets, so with "wsrep_desync = ON" it will not trigger flowcontrol (and thus, will not pause replication in the cluster), but new writesets can't be put to the buffer, so they can't be accepted and applied, thus node desyncs and falls out of the cluster, so client application should check data consistency on such a node. So this would be also a problem. And at least IST will be needed for the node to catch up. This is the case when you are performing a backup in from the node.

In case it's the only consistent node in the cluster and you are joining second one, flowcontrol event doesn't happen as in fact there's no other node that could create writesets - instead, you will have client handling thread stuck, so the number of connected clients and number of threads will be increasing up to the maximum number of permitted client connections, then the node will simply stops accepting new connections and requests.

And yes, "Desyncing and pausing the provider" event renders the node unusable for client application until it resumes. Unfortunately, this already triggered considerable amount of downtime.
By the way, there was no problem and no downtime caused by SST before the "fix" in 10.4.21 that forced me to report this bug.

Eugene added a comment - 2022-09-28 09:50 - edited According to documentation, flowcontrol is triggered when the node has buffer full and can't accept more writesets, so with "wsrep_desync = ON" it will not trigger flowcontrol (and thus, will not pause replication in the cluster), but new writesets can't be put to the buffer, so they can't be accepted and applied, thus node desyncs and falls out of the cluster, so client application should check data consistency on such a node. So this would be also a problem. And at least IST will be needed for the node to catch up. This is the case when you are performing a backup in from the node. In case it's the only consistent node in the cluster and you are joining second one, flowcontrol event doesn't happen as in fact there's no other node that could create writesets - instead, you will have client handling thread stuck, so the number of connected clients and number of threads will be increasing up to the maximum number of permitted client connections, then the node will simply stops accepting new connections and requests. And yes, "Desyncing and pausing the provider" event renders the node unusable for client application until it resumes. Unfortunately, this already triggered considerable amount of downtime. By the way, there was no problem and no downtime caused by SST before the "fix" in 10.4.21 that forced me to report this bug.

Stephan Vos added a comment - 2022-09-28 10:31

I don't completely agree that if a node is desynced/donor that it cannot apply writesets.
It does seem to be the case only when flow control kicks in or the node is paused (At the end of backup for example).
Unless I misunderstand stood your comment with regards to desync?

https://www.percona.com/blog/2016/11/16/all-you-need-to-know-about-gcache-galera-cache/
What if one of the node is DESYNCED and PAUSED?
If a node desyncs, it will continue to received write-sets and apply them, so there is no major change in gcache handling.
If the node is desynced and paused, that means the node can’t apply write-sets and needs to keep caching them. This will, of course, affect the desynced/paused node and the node will continue to create on-demand page store. Since one of the cluster nodes can’t proceed, it will not emit a “last committed” message. In turn, other nodes in the cluster (that can purge the entry) will continue to retain the write-sets, even if these nodes are not desynced and paused.

I did a test again now to confirm that the above article.
1. "SET GLOBAL wsrep_desync = ON" on node2 (wsrep_local_state_comment = Donor/Desynced)
2. Update a record in a table on node1
3. Select from table on node2 and confirmed that the update has been applied

Stephan Vos added a comment - 2022-09-28 10:31 I don't completely agree that if a node is desynced/donor that it cannot apply writesets. It does seem to be the case only when flow control kicks in or the node is paused (At the end of backup for example). Unless I misunderstand stood your comment with regards to desync? https://www.percona.com/blog/2016/11/16/all-you-need-to-know-about-gcache-galera-cache/ What if one of the node is DESYNCED and PAUSED? If a node desyncs, it will continue to received write-sets and apply them, so there is no major change in gcache handling. If the node is desynced and paused, that means the node can’t apply write-sets and needs to keep caching them. This will, of course, affect the desynced/paused node and the node will continue to create on-demand page store. Since one of the cluster nodes can’t proceed, it will not emit a “last committed” message. In turn, other nodes in the cluster (that can purge the entry) will continue to retain the write-sets, even if these nodes are not desynced and paused. I did a test again now to confirm that the above article. 1. "SET GLOBAL wsrep_desync = ON" on node2 (wsrep_local_state_comment = Donor/Desynced) 2. Update a record in a table on node1 3. Select from table on node2 and confirmed that the update has been applied

Eugene added a comment - 2022-09-28 11:21

If node is just desynced, it does applies worksets and processes requests. The only thing you can't do on desynced node is start mariabackup. In case the node is donoring, running mariabackup will trigger the node to report "WSREP not ready".
The state is "desynced", but the node usually still has consistent data and processes requests.
The problem happens in case the node in the state when it a) runs mariabackup (that normally doesn't trigger "desynced" state) or performs SST (that triggers "desynced" state always, but node still processes requests and participates in replication) and b) it accepts writesets faster then it can apply them - for example, if big amount of small tables or databases is backed up or sent to joining node. In this case the node will either trigger flowcontrol (the replication will be paused causing writing threads stuck on all the cluster) or, if for some reason the node is forced not to trigger flowcontrol event - it will be behind the cluster with inconsistent data and might require IST from consistent one.
In case the node was the only in the cluster, it will process queries anyway, during the SST or backup, regardless of "donor/desynced" state unless it reach the "Desyncing and pausing the provider" event. Node can be slow, writing threads will be stuck for periods, but thew will be still processed until the "Desyncing and pausing the provider" event.
But between the "Desyncing and pausing the provider" and "resuming" events no request will be processed, and for clients mariadb will be completely down. This is the problem I was initially talking about - the gap can be over 10 minutes, that is critical for client applications. And this event was not happening during SST or backup before 10.4.21.
In fact, the question about whether to set "wsrep_desync = ON" is completely different thing. It has misleading statement about "desyncing", but in fact it happens an the node that is already desynced. Sorry for the miscommunication.
So, the question now is - is there a way to avoid that "pausing" the node and, thus, having the client application downtime after all the changes made and discussed here?

Eugene added a comment - 2022-09-28 11:21 If node is just desynced, it does applies worksets and processes requests. The only thing you can't do on desynced node is start mariabackup. In case the node is donoring, running mariabackup will trigger the node to report "WSREP not ready". The state is "desynced", but the node usually still has consistent data and processes requests. The problem happens in case the node in the state when it a) runs mariabackup (that normally doesn't trigger "desynced" state) or performs SST (that triggers "desynced" state always, but node still processes requests and participates in replication) and b) it accepts writesets faster then it can apply them - for example, if big amount of small tables or databases is backed up or sent to joining node. In this case the node will either trigger flowcontrol (the replication will be paused causing writing threads stuck on all the cluster) or, if for some reason the node is forced not to trigger flowcontrol event - it will be behind the cluster with inconsistent data and might require IST from consistent one. In case the node was the only in the cluster, it will process queries anyway, during the SST or backup, regardless of "donor/desynced" state unless it reach the "Desyncing and pausing the provider" event. Node can be slow, writing threads will be stuck for periods, but thew will be still processed until the "Desyncing and pausing the provider" event. But between the "Desyncing and pausing the provider" and "resuming" events no request will be processed, and for clients mariadb will be completely down. This is the problem I was initially talking about - the gap can be over 10 minutes, that is critical for client applications. And this event was not happening during SST or backup before 10.4.21. In fact, the question about whether to set "wsrep_desync = ON" is completely different thing. It has misleading statement about "desyncing", but in fact it happens an the node that is already desynced. Sorry for the miscommunication. So, the question now is - is there a way to avoid that "pausing" the node and, thus, having the client application downtime after all the changes made and discussed here?

Stephan Vos added a comment - 2022-09-28 11:38

OK now we are on the same page
It seems then that perhaps I do not need to set "wsrep_desync = ON" during our backup process as write/read load is quite low at the time of backup.

I would expect that this new feature: SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' would then bypass the "Pause" at the end of backup but then instead just abort the backup if DDL (original reason for implementing the pause) were performed on the donor/backup source during the backup process.
This does mean that during SST you still would not want any DDL to happen on the cluster as it will otherwise abort.

@Jan
My question then would be will this wsrep_mode setting also then take effect during SST? I would assume this to be the case.

Also would perhaps using an older version of mariabackup (pre 10.4.21/10.5.12) be a viable interim option - not sure if this is possible?

Stephan Vos added a comment - 2022-09-28 11:38 OK now we are on the same page It seems then that perhaps I do not need to set "wsrep_desync = ON" during our backup process as write/read load is quite low at the time of backup. I would expect that this new feature: SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' would then bypass the "Pause" at the end of backup but then instead just abort the backup if DDL (original reason for implementing the pause) were performed on the donor/backup source during the backup process. This does mean that during SST you still would not want any DDL to happen on the cluster as it will otherwise abort. @Jan My question then would be will this wsrep_mode setting also then take effect during SST? I would assume this to be the case. Also would perhaps using an older version of mariabackup (pre 10.4.21/10.5.12) be a viable interim option - not sure if this is possible?

Ramesh Sivaraman added a comment - 2022-10-03 08:28

jplindst All test cases look good in bug fix branch. The backup failed in the last test case with the following message. Backup failed while reading LSN after copying all tables.

[00] 2022-10-03 08:15:53 Finished backing up non-InnoDB tables and files

[01] 2022-10-03 08:15:53 Copying ./aria_log_control to /home/vagrant/backup/aria_log_control

[01] 2022-10-03 08:15:53         ...done

[01] 2022-10-03 08:15:53 Copying ./aria_log.00000001 to /home/vagrant/backup/aria_log.00000001

[01] 2022-10-03 08:15:53         ...done

[00] FATAL ERROR: 2022-10-03 08:15:53 failed to execute query SELECT COUNT(*) FROM information_schema.plugins WHERE plugin_name='rocksdb': Server has gone away

vagrant@node1:~$

Ramesh Sivaraman added a comment - 2022-10-03 08:28 jplindst All test cases look good in bug fix branch. The backup failed in the last test case with the following message. Backup failed while reading LSN after copying all tables. [00] 2022-10-03 08:15:53 Finished backing up non-InnoDB tables and files [01] 2022-10-03 08:15:53 Copying ./aria_log_control to /home/vagrant/backup/aria_log_control [01] 2022-10-03 08:15:53 ...done [01] 2022-10-03 08:15:53 Copying ./aria_log.00000001 to /home/vagrant/backup/aria_log.00000001 [01] 2022-10-03 08:15:53 ...done [00] FATAL ERROR: 2022-10-03 08:15:53 failed to execute query SELECT COUNT(*) FROM information_schema.plugins WHERE plugin_name='rocksdb': Server has gone away vagrant@node1:~$

Ramesh Sivaraman made changes - 2022-10-03 08:29

Assignee	Ramesh Sivaraman [ JIRAUSER48189 ]	Jan Lindström [ jplindst ]
Status	In Testing [ 10301 ]	Stalled [ 10000 ]

Ramesh Sivaraman added a comment - 2022-10-06 08:09 - edited

jplindst seppo When we run RQG load (oltp ddl load) and enable wsrep_mode='BF_ABORT_MARIABACKUP' backup fails with following error

[00] 2022-10-06 10:25:03 Copying ./test/oltp14#P#p0.ibd to /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup/test/oltp14#P#p0.new

[00] 2022-10-06 10:25:03         ...done

[00] 2022-10-06 10:25:03 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...

[00] FATAL ERROR: 2022-10-06 10:25:03 failed to execute query FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS: Server has gone away

test case

perl gendata.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --spec=conf/mariadb/oltp.zz

perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --grammar=conf/mariadb/oltp_and_ddl.yy –-threads=32 --duration=1000 --queries=100000000 &

Initiate backup

/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/bin/mariabackup  --backup --user='root' --socket='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock' --target-dir='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup' --datadir=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1

Attaching error.log and back trace bt_all.txt from running process

Ramesh Sivaraman added a comment - 2022-10-06 08:09 - edited jplindst seppo When we run RQG load (oltp ddl load) and enable wsrep_mode='BF_ABORT_MARIABACKUP' backup fails with following error [00] 2022-10-06 10:25:03 Copying ./test/oltp14#P#p0.ibd to /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup/test/oltp14#P#p0.new [00] 2022-10-06 10:25:03 ...done [00] 2022-10-06 10:25:03 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... [00] FATAL ERROR: 2022-10-06 10:25:03 failed to execute query FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS: Server has gone away test case perl gendata.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --spec=conf/mariadb/oltp.zz perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --grammar=conf/mariadb/oltp_and_ddl.yy –-threads=32 --duration=1000 --queries=100000000 & Initiate backup /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/bin/mariabackup --backup --user='root' --socket='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock' --target-dir='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup' --datadir=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1 Attaching error.log and back trace bt_all.txt from running process

Ramesh Sivaraman made changes - 2022-10-06 08:10

Attachment		bt_all.txt [ 65590 ]
Attachment		error.log [ 65591 ]

Ramesh Sivaraman made changes - 2022-10-06 08:12

Assignee

Jan Lindström [ jplindst ]

Seppo Jaakola [ seppo ]

Seppo Jaakola made changes - 2022-12-02 12:16

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Seppo Jaakola added a comment - 2022-12-02 12:24

When using the new mode, wsrep_mode='BF_ABORT_MARIABACKUP' , then mariabackup execution is supposed to be aborted if mariabackup execution and replication encounter DDL level conflict. And this now happens in the above RQG testing. What would be the desired result of this test, taken in account that mariabackup must yield one way or other.?

Seppo Jaakola added a comment - 2022-12-02 12:24 When using the new mode, wsrep_mode='BF_ABORT_MARIABACKUP' , then mariabackup execution is supposed to be aborted if mariabackup execution and replication encounter DDL level conflict. And this now happens in the above RQG testing. What would be the desired result of this test, taken in account that mariabackup must yield one way or other.?

Seppo Jaakola added a comment - 2022-12-02 13:01

Looking more closely on the timestamps of cluster pause duration, of the original description:

2021-08-17 2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider
...
2021-08-17 2:57:56 18009831 [Note] WSREP: Resuming and resyncing the provider

gives ~3,5 minutes, which imo is too long, the DDL blocking should be short term and. Now mariabackup calls for BLOCK_DDL stage early and this DDL blocking state is not resumed until the backup is complete. mariabackup should be investigated to see if it possible to release DDL blocking earlier, before actual backup end stage..

Seppo Jaakola added a comment - 2022-12-02 13:01 Looking more closely on the timestamps of cluster pause duration, of the original description: 2021-08-17 2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider ... 2021-08-17 2:57:56 18009831 [Note] WSREP: Resuming and resyncing the provider gives ~3,5 minutes, which imo is too long, the DDL blocking should be short term and. Now mariabackup calls for BLOCK_DDL stage early and this DDL blocking state is not resumed until the backup is complete. mariabackup should be investigated to see if it possible to release DDL blocking earlier, before actual backup end stage..

Ramesh Sivaraman added a comment - 2022-12-02 13:09

seppo In the above test case, want to confirm that the backup failed correctly when wsrep_mode='BF_ABORT_MARIABACKUP' is set

Ramesh Sivaraman added a comment - 2022-12-02 13:09 seppo In the above test case, want to confirm that the backup failed correctly when wsrep_mode='BF_ABORT_MARIABACKUP' is set

Ramesh Sivaraman added a comment - 2022-12-06 11:17 - edited

seppo Couldn't see unusual cluster pause duration in 10.4 on local box. Used RQG and sysbench for DDL/OLTP load

2022-12-06 13:04:43 77 [Note] WSREP: Desyncing and pausing the provider
2022-12-06 13:04:43 0 [Note] WSREP: Member 1.0 (galapq) desyncs itself from group
2022-12-06 13:04:43 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 90431)
2022-12-06 13:04:43 53 [Note] Detected table cache mutex contention at instance 1: 38% waits. Additional table cache instance activated. Number of instances after activation: 2.
2022-12-06 13:04:43 77 [Note] WSREP: pause
2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at 9d84373d-7553-11ed-b8cf-db82095f3faf:90508 (12873)
2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at: 90508
2022-12-06 13:04:47 77 [Note] WSREP: Resuming and resyncing the provider
2022-12-06 13:04:47 77 [Note] WSREP: resume
2022-12-06 13:04:47 77 [Note] WSREP: resuming provider at 12873
2022-12-06 13:04:47 77 [Note] WSREP: Provider resumed.
2022-12-06 13:04:47 0 [Note] WSREP: Member 1.0 (galapq) resyncs itself to group.
2022-12-06 13:04:47 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 90730)
2022-12-06 13:04:47 0 [Note] WSREP: Processing event queue:... 0.0% ( 0/186 events) complete.
2022-12-06 13:04:48 69 [Note] Detected table cache mutex contention at instance 2: 66% waits. Additional table cache instance activated. Number of instances after activation: 3.
2022-12-06 13:04:57 0 [Warning] WSREP: Failed to report last committed 9d84373d-7553-11ed-b8cf-db82095f3faf:91080, -110 (Connection timed out)

Ramesh Sivaraman added a comment - 2022-12-06 11:17 - edited seppo Couldn't see unusual cluster pause duration in 10.4 on local box. Used RQG and sysbench for DDL/OLTP load 2022-12-06 13:04:43 77 [Note] WSREP: Desyncing and pausing the provider 2022-12-06 13:04:43 0 [Note] WSREP: Member 1.0 (galapq) desyncs itself from group 2022-12-06 13:04:43 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 90431) 2022-12-06 13:04:43 53 [Note] Detected table cache mutex contention at instance 1: 38% waits. Additional table cache instance activated. Number of instances after activation: 2. 2022-12-06 13:04:43 77 [Note] WSREP: pause 2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at 9d84373d-7553-11ed-b8cf-db82095f3faf:90508 (12873) 2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at: 90508 2022-12-06 13:04:47 77 [Note] WSREP: Resuming and resyncing the provider 2022-12-06 13:04:47 77 [Note] WSREP: resume 2022-12-06 13:04:47 77 [Note] WSREP: resuming provider at 12873 2022-12-06 13:04:47 77 [Note] WSREP: Provider resumed. 2022-12-06 13:04:47 0 [Note] WSREP: Member 1.0 (galapq) resyncs itself to group. 2022-12-06 13:04:47 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 90730) 2022-12-06 13:04:47 0 [Note] WSREP: Processing event queue:... 0.0% ( 0/186 events) complete. 2022-12-06 13:04:48 69 [Note] Detected table cache mutex contention at instance 2: 66% waits. Additional table cache instance activated. Number of instances after activation: 3. 2022-12-06 13:04:57 0 [Warning] WSREP: Failed to report last committed 9d84373d-7553-11ed-b8cf-db82095f3faf:91080, -110 (Connection timed out)

Seppo Jaakola made changes - 2022-12-14 07:41

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Jan Lindström (Inactive) made changes - 2023-01-16 10:03

Assignee

Seppo Jaakola [ seppo ]

Jan Lindström [ jplindst ]

Jan Lindström (Inactive) made changes - 2023-01-16 10:04

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2023-01-17 09:38

issue.field.resolutiondate

2023-01-17 09:38:58.0

2023-01-17 09:38:58.972

Jan Lindström (Inactive) made changes - 2023-01-17 09:38

Fix Version/s		10.6.12 [ 28513 ]
Fix Version/s		10.7.8 [ 28515 ]
Fix Version/s		10.9.5 [ 28519 ]
Fix Version/s		10.10.3 [ 28521 ]
Fix Version/s	10.4 [ 22408 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Stephan Vos added a comment - 2023-06-14 12:56

@Eugene
Have your backup problems been solved and are you using this new parameter?
This parameter does not seem to be present in 10.5 stream but we are still using wsrep_desync=ON as part of the backup and have not encountered any issues in our case.

Stephan Vos added a comment - 2023-06-14 12:56 @Eugene Have your backup problems been solved and are you using this new parameter? This parameter does not seem to be present in 10.5 stream but we are still using wsrep_desync=ON as part of the backup and have not encountered any issues in our case.

Eugene added a comment - 2023-06-16 08:27

We are remain with default settings, so backup still causes node desync for ~5 minutes:

2023-06-16  4:07:44 20156184 [Note] WSREP: Desyncing and pausing the provider

2023-06-16  4:07:44 0 [Note] WSREP: Member 0.0 (host_c) desyncs itself from group

2023-06-16  4:07:44 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 1056620529)

2023-06-16  4:07:44 20156184 [Note] WSREP: pause

2023-06-16  4:07:44 20156184 [Note] WSREP: Provider paused at eebbcc99-e2e2-1111-8484-ffdd99eb2a58:1056620529 (51043020)

2023-06-16  4:07:44 20156184 [Note] WSREP: Provider paused at: 1056620529

2023-06-16  4:12:50 20156184 [Note] WSREP: Resuming and resyncing the provider

2023-06-16  4:12:50 20156184 [Note] WSREP: resume

2023-06-16  4:12:50 20156184 [Note] WSREP: resuming provider at 51043020

2023-06-16  4:12:50 20156184 [Note] WSREP: Provider resumed.

2023-06-16  4:12:50 0 [Note] WSREP: Member 0.0 (host_c) resyncs itself to group.

2023-06-16  4:12:50 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 1056652398)

2023-06-16  4:12:50 0 [Note] WSREP: Processing event queue:...  0.0% (    0/31822 events) complete.

2023-06-16  4:13:00 30 [Note] WSREP: Processing event queue:... 48.6% (16720/34391 events) complete.

2023-06-16  4:13:05 0 [Note] WSREP: Member 0.0 (host_c) synced with group.

2023-06-16  4:13:05 0 [Note] WSREP: Processing event queue:...100.0% (35590/35590 events) complete.

2023-06-16  4:13:05 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1056655266)

2023-06-16  4:13:05 2 [Note] WSREP: Server host_c synced with group

Fortunately, there was no necessity to run cluster on single node within last months for long periods, so other nodes simply handle application requests while one is performing backup.
However, an option to stop backup if there's only one node running and DDL query issued, might be good. So we will probably use it in future.

Eugene added a comment - 2023-06-16 08:27 We are remain with default settings, so backup still causes node desync for ~5 minutes: 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Desyncing and pausing the provider 2023 - 06 - 16 4 : 07 : 44 0 [Note] WSREP: Member 0.0 (host_c) desyncs itself from group 2023 - 06 - 16 4 : 07 : 44 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 1056620529 ) 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: pause 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Provider paused at eebbcc99-e2e2- 1111 - 8484 -ffdd99eb2a58: 1056620529 ( 51043020 ) 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Provider paused at: 1056620529 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: Resuming and resyncing the provider 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: resume 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: resuming provider at 51043020 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: Provider resumed. 2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Member 0.0 (host_c) resyncs itself to group. 2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 1056652398 ) 2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Processing event queue:... 0.0 % ( 0 / 31822 events) complete. 2023 - 06 - 16 4 : 13 : 00 30 [Note] WSREP: Processing event queue:... 48.6 % ( 16720 / 34391 events) complete. 2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Member 0.0 (host_c) synced with group. 2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Processing event queue:... 100.0 % ( 35590 / 35590 events) complete. 2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1056655266 ) 2023 - 06 - 16 4 : 13 : 05 2 [Note] WSREP: Server host_c synced with group Fortunately, there was no necessity to run cluster on single node within last months for long periods, so other nodes simply handle application requests while one is performing backup. However, an option to stop backup if there's only one node running and DDL query issued, might be good. So we will probably use it in future.

Eric Hontz added a comment - 2023-07-12 15:14

I tried out the new wsrep_mode = BF_ABORT_MARIABACKUP setting and found that, when using wsrep_sst_method = mariabackup, a node never returns from Donor/Desynced to Synced after a serving as an SST donor.

Please see my comment on the commit: https://github.com/MariaDB/server/commit/95de5248c7f59f96039f96f5442142c79da27b20#r121407370

Eric Hontz added a comment - 2023-07-12 15:14 I tried out the new wsrep_mode = BF_ABORT_MARIABACKUP setting and found that, when using wsrep_sst_method = mariabackup , a node never returns from Donor/Desynced to Synced after a serving as an SST donor. Please see my comment on the commit: https://github.com/MariaDB/server/commit/95de5248c7f59f96039f96f5442142c79da27b20#r121407370

Jan Lindström added a comment - 2023-07-13 04:56

ehontz I looked it and not yet see a problem. Can you open a new bug and provide error logs from all nodes, node configuration and if you can show processlist from donor.

Jan Lindström added a comment - 2023-07-13 04:56 ehontz I looked it and not yet see a problem. Can you open a new bug and provide error logs from all nodes, node configuration and if you can show processlist from donor.

Eric Hontz added a comment - 2023-07-18 13:56

@janlindstrom,
I will open a new bug and provide details.

I'm able to reliably reproduce using a docker-compose environment.

Eric Hontz added a comment - 2023-07-18 13:56 @janlindstrom, I will open a new bug and provide details. I'm able to reliably reproduce using a docker-compose environment.

Eric Hontz added a comment - 2023-07-18 15:24

@janlindstrom: I opened ~~MDEV-31737~~

Eric Hontz added a comment - 2023-07-18 15:24 @janlindstrom: I opened MDEV-31737

Jira Automation (IT) made changes - 2024-07-04 02:11

Zendesk Related Tickets

180160 115217 162633

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Eugene

Votes:: 8 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 2021-08-17 20:51

Updated:: 2024-07-07 21:16

Resolved:: 2023-01-17 09:38

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration