Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26391

mariabackup always triggers node desync

Details

    Description

      This is most probably result of fixes related to MDEV-23080 as this behavior appeared after updating to version 10.4.21.

      We run galera cluster of 3 nodes. On every backup performed with mariabackup after update we get "Member desyncs itself from group" during backup final phase.

      Mariabackup is started with:

      mariabackup -u root -p PASSWORD --backup --galera-info --stream=xbstream --parallel 8 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql 2>>/var/log/mariabackup_copy.log| /usr/bin/zstd --fast -T8 -q -o /home/mariabackup/backup.zst
      

      While creating the backup, after phase of streaming InnoDB data, phase of streaming non-InnoDB data comes, for example:
      mariabackup log:

      [00] 2021-08-17 02:54:31 Acquiring BACKUP LOCKS...
      [00] 2021-08-17 02:54:34 Starting to backup non-InnoDB tables and files
      ...
      [00] 2021-08-17 02:57:53 Finished backing up non-InnoDB tables and files
      [00] 2021-08-17 02:57:53 Waiting for log copy thread to read lsn 32481906458304
      ...
      [00] 2021-08-17 02:57:56 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
      [00] 2021-08-17 02:57:56 mariabackup: The latest check point (for incremental): '32481906568593'
      [00] 2021-08-17 02:57:56 Executing BACKUP STAGE END
      [00] 2021-08-17 02:57:56 All tables unlocked
      [00] 2021-08-17 02:57:56 Streaming ib_buffer_pool to <STDOUT>
      [00] 2021-08-17 02:57:56 Backup created in directory '/xtrabackup_backupfiles/'
      [00] 2021-08-17 02:57:56 MySQL binlog position: filename 'mariadb-bin.019684', position '421', GTID of the last change ''
      [00] 2021-08-17 02:57:56 Streaming backup-my.cnf
      [00] 2021-08-17 02:57:56 Streaming xtrabackup_info
      [00] 2021-08-17 02:57:56 Redo log (from LSN 32481655310193 to 32481906568602) was copied.
      [00] 2021-08-17 02:57:56 completed OK!
      

      Within this phase, .TRG, .PAR, .FRM and other metadata files are copied. But after last update node started to report self-desync immediately when backup comes to this phase.
      Node remains desynced until backup finished and then synchronizes with others.
      mysqld log:

      2021-08-17  2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider
      2021-08-17  2:54:34 0 [Note] WSREP: Member 0.0 (node2.localdomain) desyncs itself from group
      2021-08-17  2:54:34 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 3821784561)
      2021-08-17  2:54:34 18009831 [Note] WSREP: pause
      2021-08-17  2:54:34 18009831 [Note] WSREP: Provider paused at 0ca12340-****-****-****-******ed4dfb:3821784561 (30463042)
      2021-08-17  2:54:34 18009831 [Note] WSREP: Provider paused at: 3821784561
      2021-08-17  2:57:56 18009831 [Note] WSREP: Resuming and resyncing the provider
      2021-08-17  2:57:56 18009831 [Note] WSREP: resume
      2021-08-17  2:57:56 18009831 [Note] WSREP: resuming provider at 30463042
      2021-08-17  2:57:56 18009831 [Note] WSREP: Provider resumed.
      2021-08-17  2:57:56 0 [Note] WSREP: Member 0.0 (node2.localdomain) resyncs itself to group.
      2021-08-17  2:57:56 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 3821789518)
      2021-08-17  2:57:57 0 [Note] WSREP: Member 0.0 (node2.localdomain) synced with group.
      2021-08-17  2:57:57 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 3821789533)
      2021-08-17  2:57:57 24 [Note] WSREP: Server node2.localdomain synced with group
      

      Note desync event just after starting to stream non-InnoDB files!

      As dataset includes significant number of databases (just several thousands), phase of streaming non-InnoDB data can take several minutes, real lentgh of this desynced phase depends on number of tables the cluster handles, can be even longer than what we hit. For all this time the node remains desynced.

      There are two questions:

      • Is it really necessary to put node into Desynced state while streaming non-InnoDB data (considering the fact that WSREP only replicates InnoDB transactions) and
      • is there any safe workaround on this that wouldn't render backup into inconsistent set of data?

      We use mariabackup for long time already, but there was no such a behavior before.
      Have to say that this issue neutralizes main mariabackup advantage of being non-blocking and artificially decreases cluster availability. Can this be fixed without breaking previous fixes for MDEV-23080?

      Attachments

        1. bt_all.txt
          78 kB
        2. error.log
          79 kB

        Issue Links

          Activity

            euglorg Eugene created issue -
            euglorg Eugene made changes -
            Field Original Value New Value
            stephanvos Stephan Vos added a comment -

            As far as I am aware its recommended to put the node on which the backup is taken in "desync/donor" mode before taking the backup.
            mysql -u admin --password=JUjmf5M6HNB8L4yy -e "SET GLOBAL wsrep_desync = ON;"

            That is what we do when we run our backup process.
            I was not aware that mariabackup would automatically do this?

            My question then is:
            Is or is'nt it recommended to desync the node before taking the backup?
            I would prefer if its not required because then we can stream and compress our backup all in one step instead of having to:
            1. desync
            2. backup
            3. resync
            4. compress and copy the backup

            We are doing compress and copy separately to minimize desync period.

            stephanvos Stephan Vos added a comment - As far as I am aware its recommended to put the node on which the backup is taken in "desync/donor" mode before taking the backup. mysql -u admin --password=JUjmf5M6HNB8L4yy -e "SET GLOBAL wsrep_desync = ON;" That is what we do when we run our backup process. I was not aware that mariabackup would automatically do this? My question then is: Is or is'nt it recommended to desync the node before taking the backup? I would prefer if its not required because then we can stream and compress our backup all in one step instead of having to: 1. desync 2. backup 3. resync 4. compress and copy the backup We are doing compress and copy separately to minimize desync period.
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 124361 ] MariaDB v4 [ 143098 ]
            elenst Elena Stepanova made changes -
            Fix Version/s 10.4 [ 22408 ]
            Assignee Jan Lindström [ jplindst ]
            julien.fritsch Julien Fritsch made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Seppo Jaakola [ seppo ]
            valerii Valerii Kravchuk made changes -
            Priority Minor [ 4 ] Major [ 3 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            seppo Seppo Jaakola made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            seppo Seppo Jaakola added a comment -

            It is true that MDEV-23080 has changed the behavior, and node now switches to desynced and paused state, when mariabackup acquires backup stage block DDL lock. Reason for this was that DDL replication could interfere with backup processing in this backup stage, and by pausing the node it was guaranteed that no DDL was executed in the node. However, this blocked all DML in the node with same go.

            This behavior might be possible to optimize, by one of following alternatives:

            • refactor mariabackup to lift the requirement to block DDL
            • not pausing the node, but honoring the block DDL locking. This would have the side effect that cluster would freeze if any DDL happens during mariabackup
            • not pausing the node, but letting the replication to abort ongoing backup process, if it has conflicting block DDL lock on
            seppo Seppo Jaakola added a comment - It is true that MDEV-23080 has changed the behavior, and node now switches to desynced and paused state, when mariabackup acquires backup stage block DDL lock. Reason for this was that DDL replication could interfere with backup processing in this backup stage, and by pausing the node it was guaranteed that no DDL was executed in the node. However, this blocked all DML in the node with same go. This behavior might be possible to optimize, by one of following alternatives: refactor mariabackup to lift the requirement to block DDL not pausing the node, but honoring the block DDL locking. This would have the side effect that cluster would freeze if any DDL happens during mariabackup not pausing the node, but letting the replication to abort ongoing backup process, if it has conflicting block DDL lock on
            seppo Seppo Jaakola made changes -
            Status In Progress [ 3 ] Needs Feedback [ 10501 ]
            valerii Valerii Kravchuk made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            euglorg Eugene added a comment -

            IMHO, way 2 and 3 (that lead to whole cluster block or backup failure on running single conflicting DDL query) are way bad idea. For examle, last days the phase of "desync" state during the backup lasts for several minutes. Having the cluster blocked for all this time can easily lead to client applications downtime of similar period. So the only way left is to adjust mariabackup itself.

            BTW. While performing the backup on single-node cluster (but with WSREP library loaded and active), the only node of the cluster is also put into "desync" state during the backup, making client applications failing for the whole period the node remains desynced.

            euglorg Eugene added a comment - IMHO, way 2 and 3 (that lead to whole cluster block or backup failure on running single conflicting DDL query) are way bad idea. For examle, last days the phase of "desync" state during the backup lasts for several minutes. Having the cluster blocked for all this time can easily lead to client applications downtime of similar period. So the only way left is to adjust mariabackup itself. BTW. While performing the backup on single-node cluster (but with WSREP library loaded and active), the only node of the cluster is also put into "desync" state during the backup, making client applications failing for the whole period the node remains desynced.
            seppo Seppo Jaakola made changes -
            Status Open [ 1 ] Needs Feedback [ 10501 ]
            valerii Valerii Kravchuk made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            seppo Seppo Jaakola made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            jplindst Jan Lindström (Inactive) made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Affects Version/s 10.4.21 [ 26030 ]
            Environment Linux 5.10.58-gentoo x86_64 AMD EPYC 7451
            Issue Type Bug [ 1 ] Task [ 3 ]
            ccalender Chris Calender (Inactive) made changes -
            Assignee Seppo Jaakola [ seppo ] Jan Lindström [ jplindst ]
            euglorg Eugene added a comment - - edited

            Looking at this all discussion, now I'd like to ask for advice.
            For technical reason we have had to reduce the cluster to one node.
            We use wsrep_sst_method=mariabackup because according to the documentation it's the only method that allows SST without taking the donor node down, at it really was.
            However, with mariadb behavior we have now it is impossible to add another node to the cluster without significant downtime, for instance, today it was 15 minutes:

            2022-08-18  6:28:39 379299 [Note] WSREP: Desyncing and pausing the provider
            2022-08-18  6:28:39 379299 [Note] WSREP: pause
            2022-08-18  6:28:39 379299 [Note] WSREP: Provider paused at 0ca58cd7-821e-11ea-aec3-23d843ed4dfb:8533940816 (1818998)
            2022-08-18  6:28:39 379299 [Note] WSREP: Provider paused at: 8533940816
            2022-08-18  6:45:53 379299 [Note] WSREP: Resuming and resyncing the provider
            2022-08-18  6:45:53 379299 [Note] WSREP: resume
            2022-08-18  6:45:53 379299 [Note] WSREP: resuming provider at 1818998
            2022-08-18  6:45:53 379299 [Note] WSREP: Provider resumed.
            2022-08-18  6:45:56 0 [Note] WSREP: SST sent: 0ca58cd7-821e-11ea-aec3-23d843ed4dfb:8532294770
            2022-08-18  6:45:56 0 [Note] WSREP: Server status change donor -> joined
            2022-08-18  6:45:56 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            WSREP_SST: [INFO] Total time on donor: 0 seconds (20220818 06:45:56.210)
            WSREP_SST: [INFO] mariabackup SST completed on donor (20220818 06:45:56.231)
            WSREP_SST: [INFO] Cleaning up temporary directories (20220818 06:45:56.247)
            2022-08-18  6:45:56 0 [Note] WSREP: Donor monitor thread ended with total time 20994 sec
            

            This was the only node in the cluster, so cluster and all the SQL-relying things were simply offline for all the 15 minutes since 6:28 till 6:45. So mariabackup is no longer the method that let's one joining another node without stopping the donoring one.
            The question is: is there any method to join a node to the cluster without having downtime or since the date this bug was reported there's no such a method?
            Again, before the date this bug appeared we performed SST without problems and related application downtime many times, until it got broken a year ago. Unfortunately, since that day every backup run is a problem.
            Can anyone advise some workaround if this can't be fixed?
            Maybe there's some better SST method that can be used without having guaranteed downtime?

            euglorg Eugene added a comment - - edited Looking at this all discussion, now I'd like to ask for advice. For technical reason we have had to reduce the cluster to one node. We use wsrep_sst_method=mariabackup because according to the documentation it's the only method that allows SST without taking the donor node down, at it really was. However, with mariadb behavior we have now it is impossible to add another node to the cluster without significant downtime, for instance, today it was 15 minutes: 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: Desyncing and pausing the provider 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: pause 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: Provider paused at 0ca58cd7-821e-11ea-aec3-23d843ed4dfb: 8533940816 ( 1818998 ) 2022 - 08 - 18 6 : 28 : 39 379299 [Note] WSREP: Provider paused at: 8533940816 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: Resuming and resyncing the provider 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: resume 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: resuming provider at 1818998 2022 - 08 - 18 6 : 45 : 53 379299 [Note] WSREP: Provider resumed. 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: SST sent: 0ca58cd7-821e-11ea-aec3-23d843ed4dfb: 8532294770 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: Server status change donor -> joined 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. WSREP_SST: [INFO] Total time on donor: 0 seconds ( 20220818 06 : 45 : 56.210 ) WSREP_SST: [INFO] mariabackup SST completed on donor ( 20220818 06 : 45 : 56.231 ) WSREP_SST: [INFO] Cleaning up temporary directories ( 20220818 06 : 45 : 56.247 ) 2022 - 08 - 18 6 : 45 : 56 0 [Note] WSREP: Donor monitor thread ended with total time 20994 sec This was the only node in the cluster, so cluster and all the SQL-relying things were simply offline for all the 15 minutes since 6:28 till 6:45. So mariabackup is no longer the method that let's one joining another node without stopping the donoring one. The question is: is there any method to join a node to the cluster without having downtime or since the date this bug was reported there's no such a method? Again, before the date this bug appeared we performed SST without problems and related application downtime many times, until it got broken a year ago. Unfortunately, since that day every backup run is a problem. Can anyone advise some workaround if this can't be fixed? Maybe there's some better SST method that can be used without having guaranteed downtime?
            ccalender Chris Calender (Inactive) made changes -
            Fix Version/s 10.4 [ 22408 ]
            stephanvos Stephan Vos added a comment - - edited

            @Seppo will this new flag also allow SST not to block the donor mode as where the case before 10.5.12?

            stephanvos Stephan Vos added a comment - - edited @Seppo will this new flag also allow SST not to block the donor mode as where the case before 10.5.12?
            stephanvos Stephan Vos added a comment -

            How does this current situation effect SST method = MariaBackup. will it freeze the donor node?
            Any chance this feature will make it into the next release of 10.5/10.6?

            stephanvos Stephan Vos added a comment - How does this current situation effect SST method = MariaBackup. will it freeze the donor node? Any chance this feature will make it into the next release of 10.5/10.6?
            jplindst Jan Lindström (Inactive) made changes -
            Fix Version/s 10.4 [ 22408 ]
            violuke Luke Cousins added a comment -

            Yes, it will freeze the doner node, it's pretty serious! We've found that if you can restart the doner node before making the SST, you can then complete the SST without a problem. This often means needing a minute of downtime as you find yourself all the way down to 1 node and a fresh bootstrap. Not good really and very much looking forward to this being fixed.

            violuke Luke Cousins added a comment - Yes, it will freeze the doner node, it's pretty serious! We've found that if you can restart the doner node before making the SST, you can then complete the SST without a problem. This often means needing a minute of downtime as you find yourself all the way down to 1 node and a fresh bootstrap. Not good really and very much looking forward to this being fixed.
            stephanvos Stephan Vos added a comment -

            Thanks for confirming Luke.
            Yes this is not good and will be a serious problem for us as well.

            stephanvos Stephan Vos added a comment - Thanks for confirming Luke. Yes this is not good and will be a serious problem for us as well.
            stephanvos Stephan Vos added a comment - - edited

            I just did a test with 2 node cluster both on 10.5.17.

            I stopped node 2 and removed the datadir and then started again after which it proceeded to do a SST via MariaBackup.
            node 1 (Doner) showed wsrep_local_state_comment=Donor/Desynced but I could insert into tables while it was busy being the donor, so it was actually not blocking like I expected it to!

            and SST was successful.

            I also tested a backup and do see the mentioned:
            "WSREP: Desyncing and pausing the provider" at the end.
            However in our case this only lasted a couple of seconds as we don't have any non-InnoDB tables.
            I also created a table during this backup which was picked up by the DDL monitor thread:
            [00] 2022-09-27 19:35:03 DDL tracking : create 11830 "./opmon/svt.ibd"

            Furthermore I have always issued:
            SET GLOBAL wsrep_desync = ON
            before executing mariabackup as was recommended to us when we started using it in 2019.
            I have never seen this cause any issues and as per my test above it does allow DML to take place.

            stephanvos Stephan Vos added a comment - - edited I just did a test with 2 node cluster both on 10.5.17. I stopped node 2 and removed the datadir and then started again after which it proceeded to do a SST via MariaBackup. node 1 (Doner) showed wsrep_local_state_comment=Donor/Desynced but I could insert into tables while it was busy being the donor, so it was actually not blocking like I expected it to! and SST was successful. I also tested a backup and do see the mentioned: "WSREP: Desyncing and pausing the provider" at the end. However in our case this only lasted a couple of seconds as we don't have any non-InnoDB tables. I also created a table during this backup which was picked up by the DDL monitor thread: [00] 2022-09-27 19:35:03 DDL tracking : create 11830 "./opmon/svt.ibd" Furthermore I have always issued: SET GLOBAL wsrep_desync = ON before executing mariabackup as was recommended to us when we started using it in 2019. I have never seen this cause any issues and as per my test above it does allow DML to take place.
            euglorg Eugene added a comment -

            The impact really depends on how busy the cluster is.
            The problem is when you have big dataset and intensive writes.
            In our case SST takes 3-5 hours. For all this time node is desynced, but serves requests, even creating new databases. The problem is final phase. The longer SST takes, the longer period between "Desyncing and pausing the provider" and "resuming" events will be. In case there are several nodes in the cluster already, donor will trigger flowcontrol. But in case the cluster has just bootstrapped and the donoring node is the only consistent (say, second one has datadir wiped for any reason), for all this time between "pausing" and "resuming" the joining nod is not yet ready to serve requests, and the donoring has provider paused. As a result, all the requests are stuck and in fact all the applications relying on Mariadb are down. The longer SST lasts, the longer the downtime is.
            Can't test 10.5 branch at the moment, but latest 10.4 has this behavior, so SST becomes a disaster in certain way.

            Also, really not sure whether it's safe to set "wsrep_desync = ON" for the cluster that has a node paused for 3-5 minutes and has amount of writes that trigger node desync within few seconds. If writes are not intensive, one can pause any component for any time safely. Will the donor perform IST once mariabackup is done?

            euglorg Eugene added a comment - The impact really depends on how busy the cluster is. The problem is when you have big dataset and intensive writes. In our case SST takes 3-5 hours. For all this time node is desynced, but serves requests, even creating new databases. The problem is final phase. The longer SST takes, the longer period between "Desyncing and pausing the provider" and "resuming" events will be. In case there are several nodes in the cluster already, donor will trigger flowcontrol. But in case the cluster has just bootstrapped and the donoring node is the only consistent (say, second one has datadir wiped for any reason), for all this time between "pausing" and "resuming" the joining nod is not yet ready to serve requests, and the donoring has provider paused. As a result, all the requests are stuck and in fact all the applications relying on Mariadb are down. The longer SST lasts, the longer the downtime is. Can't test 10.5 branch at the moment, but latest 10.4 has this behavior, so SST becomes a disaster in certain way. Also, really not sure whether it's safe to set "wsrep_desync = ON" for the cluster that has a node paused for 3-5 minutes and has amount of writes that trigger node desync within few seconds. If writes are not intensive, one can pause any component for any time safely. Will the donor perform IST once mariabackup is done?
            jplindst Jan Lindström (Inactive) added a comment - - edited

            ramesh Can you do testing for this fix:

            • branch: bb-10.6-MDEV-26391-galera
            • Using defaults set load on both donor and joiner and start backup and verify that donor desyncs.
            • Using defaults set load on both donor and joiner and start DDL and then backup and verify taht donor desyncs and that DDL does not crash
            • Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start backup and verify that donor does not desync
            • Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start DDL (if possible) and backup and verify that backup fails (hopefully with some clear error message)
            jplindst Jan Lindström (Inactive) added a comment - - edited ramesh Can you do testing for this fix: branch: bb-10.6- MDEV-26391 -galera Using defaults set load on both donor and joiner and start backup and verify that donor desyncs. Using defaults set load on both donor and joiner and start DDL and then backup and verify taht donor desyncs and that DDL does not crash Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start backup and verify that donor does not desync Using setting SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' set load both donor and joiner and start DDL (if possible) and backup and verify that backup fails (hopefully with some clear error message)
            jplindst Jan Lindström (Inactive) made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Ramesh Sivaraman [ JIRAUSER48189 ]
            stephanvos Stephan Vos added a comment -

            Eugene
            From my understanding all "wsrep_desync = ON" does is disable flow control so that the remaining nodes are not held up by the backup.
            The DONOR/Node performing backup will still accept and apply writesets.
            However I think that "WSREP: Desyncing and pausing the provider" at the end of the backup/SST does more than that and actually pauses replication which will be a problem for client applications.

            As to IST I would expect that the DONOR does not need to perform this as it applies writesets during SST.
            At the end of the SST DONOR nodes displays: "WSREP: async IST sender served" (not sure why as it performed a SST and there was no indication that a IST were performed after the SST)

            stephanvos Stephan Vos added a comment - Eugene From my understanding all "wsrep_desync = ON" does is disable flow control so that the remaining nodes are not held up by the backup. The DONOR/Node performing backup will still accept and apply writesets. However I think that "WSREP: Desyncing and pausing the provider" at the end of the backup/SST does more than that and actually pauses replication which will be a problem for client applications. As to IST I would expect that the DONOR does not need to perform this as it applies writesets during SST. At the end of the SST DONOR nodes displays: "WSREP: async IST sender served" (not sure why as it performed a SST and there was no indication that a IST were performed after the SST)
            euglorg Eugene added a comment - - edited

            According to documentation, flowcontrol is triggered when the node has buffer full and can't accept more writesets, so with "wsrep_desync = ON" it will not trigger flowcontrol (and thus, will not pause replication in the cluster), but new writesets can't be put to the buffer, so they can't be accepted and applied, thus node desyncs and falls out of the cluster, so client application should check data consistency on such a node. So this would be also a problem. And at least IST will be needed for the node to catch up. This is the case when you are performing a backup in from the node.

            In case it's the only consistent node in the cluster and you are joining second one, flowcontrol event doesn't happen as in fact there's no other node that could create writesets - instead, you will have client handling thread stuck, so the number of connected clients and number of threads will be increasing up to the maximum number of permitted client connections, then the node will simply stops accepting new connections and requests.

            And yes, "Desyncing and pausing the provider" event renders the node unusable for client application until it resumes. Unfortunately, this already triggered considerable amount of downtime.
            By the way, there was no problem and no downtime caused by SST before the "fix" in 10.4.21 that forced me to report this bug.

            euglorg Eugene added a comment - - edited According to documentation, flowcontrol is triggered when the node has buffer full and can't accept more writesets, so with "wsrep_desync = ON" it will not trigger flowcontrol (and thus, will not pause replication in the cluster), but new writesets can't be put to the buffer, so they can't be accepted and applied, thus node desyncs and falls out of the cluster, so client application should check data consistency on such a node. So this would be also a problem. And at least IST will be needed for the node to catch up. This is the case when you are performing a backup in from the node. In case it's the only consistent node in the cluster and you are joining second one, flowcontrol event doesn't happen as in fact there's no other node that could create writesets - instead, you will have client handling thread stuck, so the number of connected clients and number of threads will be increasing up to the maximum number of permitted client connections, then the node will simply stops accepting new connections and requests. And yes, "Desyncing and pausing the provider" event renders the node unusable for client application until it resumes. Unfortunately, this already triggered considerable amount of downtime. By the way, there was no problem and no downtime caused by SST before the "fix" in 10.4.21 that forced me to report this bug.
            stephanvos Stephan Vos added a comment -

            I don't completely agree that if a node is desynced/donor that it cannot apply writesets.
            It does seem to be the case only when flow control kicks in or the node is paused (At the end of backup for example).
            Unless I misunderstand stood your comment with regards to desync?

            https://www.percona.com/blog/2016/11/16/all-you-need-to-know-about-gcache-galera-cache/
            What if one of the node is DESYNCED and PAUSED?
            If a node desyncs, it will continue to received write-sets and apply them, so there is no major change in gcache handling.
            If the node is desynced and paused, that means the node can’t apply write-sets and needs to keep caching them. This will, of course, affect the desynced/paused node and the node will continue to create on-demand page store. Since one of the cluster nodes can’t proceed, it will not emit a “last committed” message. In turn, other nodes in the cluster (that can purge the entry) will continue to retain the write-sets, even if these nodes are not desynced and paused.

            I did a test again now to confirm that the above article.
            1. "SET GLOBAL wsrep_desync = ON" on node2 (wsrep_local_state_comment = Donor/Desynced)
            2. Update a record in a table on node1
            3. Select from table on node2 and confirmed that the update has been applied

            stephanvos Stephan Vos added a comment - I don't completely agree that if a node is desynced/donor that it cannot apply writesets. It does seem to be the case only when flow control kicks in or the node is paused (At the end of backup for example). Unless I misunderstand stood your comment with regards to desync? https://www.percona.com/blog/2016/11/16/all-you-need-to-know-about-gcache-galera-cache/ What if one of the node is DESYNCED and PAUSED? If a node desyncs, it will continue to received write-sets and apply them, so there is no major change in gcache handling. If the node is desynced and paused, that means the node can’t apply write-sets and needs to keep caching them. This will, of course, affect the desynced/paused node and the node will continue to create on-demand page store. Since one of the cluster nodes can’t proceed, it will not emit a “last committed” message. In turn, other nodes in the cluster (that can purge the entry) will continue to retain the write-sets, even if these nodes are not desynced and paused. I did a test again now to confirm that the above article. 1. "SET GLOBAL wsrep_desync = ON" on node2 (wsrep_local_state_comment = Donor/Desynced) 2. Update a record in a table on node1 3. Select from table on node2 and confirmed that the update has been applied
            euglorg Eugene added a comment -

            If node is just desynced, it does applies worksets and processes requests. The only thing you can't do on desynced node is start mariabackup. In case the node is donoring, running mariabackup will trigger the node to report "WSREP not ready".
            The state is "desynced", but the node usually still has consistent data and processes requests.
            The problem happens in case the node in the state when it a) runs mariabackup (that normally doesn't trigger "desynced" state) or performs SST (that triggers "desynced" state always, but node still processes requests and participates in replication) and b) it accepts writesets faster then it can apply them - for example, if big amount of small tables or databases is backed up or sent to joining node. In this case the node will either trigger flowcontrol (the replication will be paused causing writing threads stuck on all the cluster) or, if for some reason the node is forced not to trigger flowcontrol event - it will be behind the cluster with inconsistent data and might require IST from consistent one.
            In case the node was the only in the cluster, it will process queries anyway, during the SST or backup, regardless of "donor/desynced" state unless it reach the "Desyncing and pausing the provider" event. Node can be slow, writing threads will be stuck for periods, but thew will be still processed until the "Desyncing and pausing the provider" event.
            But between the "Desyncing and pausing the provider" and "resuming" events no request will be processed, and for clients mariadb will be completely down. This is the problem I was initially talking about - the gap can be over 10 minutes, that is critical for client applications. And this event was not happening during SST or backup before 10.4.21.
            In fact, the question about whether to set "wsrep_desync = ON" is completely different thing. It has misleading statement about "desyncing", but in fact it happens an the node that is already desynced. Sorry for the miscommunication.
            So, the question now is - is there a way to avoid that "pausing" the node and, thus, having the client application downtime after all the changes made and discussed here?

            euglorg Eugene added a comment - If node is just desynced, it does applies worksets and processes requests. The only thing you can't do on desynced node is start mariabackup. In case the node is donoring, running mariabackup will trigger the node to report "WSREP not ready". The state is "desynced", but the node usually still has consistent data and processes requests. The problem happens in case the node in the state when it a) runs mariabackup (that normally doesn't trigger "desynced" state) or performs SST (that triggers "desynced" state always, but node still processes requests and participates in replication) and b) it accepts writesets faster then it can apply them - for example, if big amount of small tables or databases is backed up or sent to joining node. In this case the node will either trigger flowcontrol (the replication will be paused causing writing threads stuck on all the cluster) or, if for some reason the node is forced not to trigger flowcontrol event - it will be behind the cluster with inconsistent data and might require IST from consistent one. In case the node was the only in the cluster, it will process queries anyway, during the SST or backup, regardless of "donor/desynced" state unless it reach the "Desyncing and pausing the provider" event. Node can be slow, writing threads will be stuck for periods, but thew will be still processed until the "Desyncing and pausing the provider" event. But between the "Desyncing and pausing the provider" and "resuming" events no request will be processed, and for clients mariadb will be completely down. This is the problem I was initially talking about - the gap can be over 10 minutes, that is critical for client applications. And this event was not happening during SST or backup before 10.4.21. In fact, the question about whether to set "wsrep_desync = ON" is completely different thing. It has misleading statement about "desyncing", but in fact it happens an the node that is already desynced. Sorry for the miscommunication. So, the question now is - is there a way to avoid that "pausing" the node and, thus, having the client application downtime after all the changes made and discussed here?
            stephanvos Stephan Vos added a comment -

            OK now we are on the same page
            It seems then that perhaps I do not need to set "wsrep_desync = ON" during our backup process as write/read load is quite low at the time of backup.

            I would expect that this new feature: SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' would then bypass the "Pause" at the end of backup but then instead just abort the backup if DDL (original reason for implementing the pause) were performed on the donor/backup source during the backup process.
            This does mean that during SST you still would not want any DDL to happen on the cluster as it will otherwise abort.

            @Jan
            My question then would be will this wsrep_mode setting also then take effect during SST? I would assume this to be the case.

            Also would perhaps using an older version of mariabackup (pre 10.4.21/10.5.12) be a viable interim option - not sure if this is possible?

            stephanvos Stephan Vos added a comment - OK now we are on the same page It seems then that perhaps I do not need to set "wsrep_desync = ON" during our backup process as write/read load is quite low at the time of backup. I would expect that this new feature: SET GLOBAL wsrep_mode='BF_ABORT_MARIABACKUP' would then bypass the "Pause" at the end of backup but then instead just abort the backup if DDL (original reason for implementing the pause) were performed on the donor/backup source during the backup process. This does mean that during SST you still would not want any DDL to happen on the cluster as it will otherwise abort. @Jan My question then would be will this wsrep_mode setting also then take effect during SST? I would assume this to be the case. Also would perhaps using an older version of mariabackup (pre 10.4.21/10.5.12) be a viable interim option - not sure if this is possible?

            jplindst All test cases look good in bug fix branch. The backup failed in the last test case with the following message. Backup failed while reading LSN after copying all tables.

            [00] 2022-10-03 08:15:53 Finished backing up non-InnoDB tables and files
            [01] 2022-10-03 08:15:53 Copying ./aria_log_control to /home/vagrant/backup/aria_log_control
            [01] 2022-10-03 08:15:53         ...done
            [01] 2022-10-03 08:15:53 Copying ./aria_log.00000001 to /home/vagrant/backup/aria_log.00000001
            [01] 2022-10-03 08:15:53         ...done
            [00] FATAL ERROR: 2022-10-03 08:15:53 failed to execute query SELECT COUNT(*) FROM information_schema.plugins WHERE plugin_name='rocksdb': Server has gone away
            vagrant@node1:~$ 
            

            ramesh Ramesh Sivaraman added a comment - jplindst All test cases look good in bug fix branch. The backup failed in the last test case with the following message. Backup failed while reading LSN after copying all tables. [00] 2022-10-03 08:15:53 Finished backing up non-InnoDB tables and files [01] 2022-10-03 08:15:53 Copying ./aria_log_control to /home/vagrant/backup/aria_log_control [01] 2022-10-03 08:15:53 ...done [01] 2022-10-03 08:15:53 Copying ./aria_log.00000001 to /home/vagrant/backup/aria_log.00000001 [01] 2022-10-03 08:15:53 ...done [00] FATAL ERROR: 2022-10-03 08:15:53 failed to execute query SELECT COUNT(*) FROM information_schema.plugins WHERE plugin_name='rocksdb': Server has gone away vagrant@node1:~$
            ramesh Ramesh Sivaraman made changes -
            Assignee Ramesh Sivaraman [ JIRAUSER48189 ] Jan Lindström [ jplindst ]
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            ramesh Ramesh Sivaraman added a comment - - edited

            jplindst seppo When we run RQG load (oltp ddl load) and enable wsrep_mode='BF_ABORT_MARIABACKUP' backup fails with following error

            [00] 2022-10-06 10:25:03 Copying ./test/oltp14#P#p0.ibd to /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup/test/oltp14#P#p0.new
            [00] 2022-10-06 10:25:03         ...done
            [00] 2022-10-06 10:25:03 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
            [00] FATAL ERROR: 2022-10-06 10:25:03 failed to execute query FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS: Server has gone away
            

            test case

            perl gendata.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --spec=conf/mariadb/oltp.zz
             
            perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --grammar=conf/mariadb/oltp_and_ddl.yy –-threads=32 --duration=1000 --queries=100000000 &
             
            Initiate backup
             
            /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/bin/mariabackup  --backup --user='root' --socket='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock' --target-dir='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup' --datadir=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1
            

            Attaching error.log and back trace bt_all.txt from running process

            ramesh Ramesh Sivaraman added a comment - - edited jplindst seppo When we run RQG load (oltp ddl load) and enable wsrep_mode='BF_ABORT_MARIABACKUP' backup fails with following error [00] 2022-10-06 10:25:03 Copying ./test/oltp14#P#p0.ibd to /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup/test/oltp14#P#p0.new [00] 2022-10-06 10:25:03 ...done [00] 2022-10-06 10:25:03 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... [00] FATAL ERROR: 2022-10-06 10:25:03 failed to execute query FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS: Server has gone away test case perl gendata.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --spec=conf/mariadb/oltp.zz   perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=11160:socket=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock:user=root:database=test --grammar=conf/mariadb/oltp_and_ddl.yy –-threads=32 --duration=1000 --queries=100000000 &   Initiate backup   /test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/bin/mariabackup --backup --user='root' --socket='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1/node1_socket.sock' --target-dir='/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/backup' --datadir=/test/mtest/GAL_MD061022-mariadb-10.6.11-linux-x86_64-opt/node1 Attaching error.log and back trace bt_all.txt from running process
            ramesh Ramesh Sivaraman made changes -
            Attachment bt_all.txt [ 65590 ]
            Attachment error.log [ 65591 ]
            ramesh Ramesh Sivaraman made changes -
            Assignee Jan Lindström [ jplindst ] Seppo Jaakola [ seppo ]
            seppo Seppo Jaakola made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            seppo Seppo Jaakola added a comment -

            When using the new mode, wsrep_mode='BF_ABORT_MARIABACKUP' , then mariabackup execution is supposed to be aborted if mariabackup execution and replication encounter DDL level conflict. And this now happens in the above RQG testing. What would be the desired result of this test, taken in account that mariabackup must yield one way or other.?

            seppo Seppo Jaakola added a comment - When using the new mode, wsrep_mode='BF_ABORT_MARIABACKUP' , then mariabackup execution is supposed to be aborted if mariabackup execution and replication encounter DDL level conflict. And this now happens in the above RQG testing. What would be the desired result of this test, taken in account that mariabackup must yield one way or other.?
            seppo Seppo Jaakola added a comment -

            Looking more closely on the timestamps of cluster pause duration, of the original description:

            2021-08-17 2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider
            ...
            2021-08-17 2:57:56 18009831 [Note] WSREP: Resuming and resyncing the provider

            gives ~3,5 minutes, which imo is too long, the DDL blocking should be short term and. Now mariabackup calls for BLOCK_DDL stage early and this DDL blocking state is not resumed until the backup is complete. mariabackup should be investigated to see if it possible to release DDL blocking earlier, before actual backup end stage..

            seppo Seppo Jaakola added a comment - Looking more closely on the timestamps of cluster pause duration, of the original description: 2021-08-17 2:54:34 18009831 [Note] WSREP: Desyncing and pausing the provider ... 2021-08-17 2:57:56 18009831 [Note] WSREP: Resuming and resyncing the provider gives ~3,5 minutes, which imo is too long, the DDL blocking should be short term and. Now mariabackup calls for BLOCK_DDL stage early and this DDL blocking state is not resumed until the backup is complete. mariabackup should be investigated to see if it possible to release DDL blocking earlier, before actual backup end stage..

            seppo In the above test case, want to confirm that the backup failed correctly when wsrep_mode='BF_ABORT_MARIABACKUP' is set

            ramesh Ramesh Sivaraman added a comment - seppo In the above test case, want to confirm that the backup failed correctly when wsrep_mode='BF_ABORT_MARIABACKUP' is set
            ramesh Ramesh Sivaraman added a comment - - edited

            seppo Couldn't see unusual cluster pause duration in 10.4 on local box. Used RQG and sysbench for DDL/OLTP load

            2022-12-06 13:04:43 77 [Note] WSREP: Desyncing and pausing the provider
            2022-12-06 13:04:43 0 [Note] WSREP: Member 1.0 (galapq) desyncs itself from group
            2022-12-06 13:04:43 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 90431)
            2022-12-06 13:04:43 53 [Note] Detected table cache mutex contention at instance 1: 38% waits. Additional table cache instance activated. Number of instances after activation: 2.
            2022-12-06 13:04:43 77 [Note] WSREP: pause
            2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at 9d84373d-7553-11ed-b8cf-db82095f3faf:90508 (12873)
            2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at: 90508
            2022-12-06 13:04:47 77 [Note] WSREP: Resuming and resyncing the provider
            2022-12-06 13:04:47 77 [Note] WSREP: resume
            2022-12-06 13:04:47 77 [Note] WSREP: resuming provider at 12873
            2022-12-06 13:04:47 77 [Note] WSREP: Provider resumed.
            2022-12-06 13:04:47 0 [Note] WSREP: Member 1.0 (galapq) resyncs itself to group.
            2022-12-06 13:04:47 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 90730)
            2022-12-06 13:04:47 0 [Note] WSREP: Processing event queue:... 0.0% ( 0/186 events) complete.
            2022-12-06 13:04:48 69 [Note] Detected table cache mutex contention at instance 2: 66% waits. Additional table cache instance activated. Number of instances after activation: 3.
            2022-12-06 13:04:57 0 [Warning] WSREP: Failed to report last committed 9d84373d-7553-11ed-b8cf-db82095f3faf:91080, -110 (Connection timed out)

            ramesh Ramesh Sivaraman added a comment - - edited seppo Couldn't see unusual cluster pause duration in 10.4 on local box. Used RQG and sysbench for DDL/OLTP load 2022-12-06 13:04:43 77 [Note] WSREP: Desyncing and pausing the provider 2022-12-06 13:04:43 0 [Note] WSREP: Member 1.0 (galapq) desyncs itself from group 2022-12-06 13:04:43 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 90431) 2022-12-06 13:04:43 53 [Note] Detected table cache mutex contention at instance 1: 38% waits. Additional table cache instance activated. Number of instances after activation: 2. 2022-12-06 13:04:43 77 [Note] WSREP: pause 2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at 9d84373d-7553-11ed-b8cf-db82095f3faf:90508 (12873) 2022-12-06 13:04:45 77 [Note] WSREP: Provider paused at: 90508 2022-12-06 13:04:47 77 [Note] WSREP: Resuming and resyncing the provider 2022-12-06 13:04:47 77 [Note] WSREP: resume 2022-12-06 13:04:47 77 [Note] WSREP: resuming provider at 12873 2022-12-06 13:04:47 77 [Note] WSREP: Provider resumed. 2022-12-06 13:04:47 0 [Note] WSREP: Member 1.0 (galapq) resyncs itself to group. 2022-12-06 13:04:47 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 90730) 2022-12-06 13:04:47 0 [Note] WSREP: Processing event queue:... 0.0% ( 0/186 events) complete. 2022-12-06 13:04:48 69 [Note] Detected table cache mutex contention at instance 2: 66% waits. Additional table cache instance activated. Number of instances after activation: 3. 2022-12-06 13:04:57 0 [Warning] WSREP: Failed to report last committed 9d84373d-7553-11ed-b8cf-db82095f3faf:91080, -110 (Connection timed out)
            seppo Seppo Jaakola made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Seppo Jaakola [ seppo ] Jan Lindström [ jplindst ]
            jplindst Jan Lindström (Inactive) made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            jplindst Jan Lindström (Inactive) made changes -
            issue.field.resolutiondate 2023-01-17 09:38:58.0 2023-01-17 09:38:58.972
            jplindst Jan Lindström (Inactive) made changes -
            Fix Version/s 10.6.12 [ 28513 ]
            Fix Version/s 10.7.8 [ 28515 ]
            Fix Version/s 10.9.5 [ 28519 ]
            Fix Version/s 10.10.3 [ 28521 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            stephanvos Stephan Vos added a comment -

            @Eugene
            Have your backup problems been solved and are you using this new parameter?
            This parameter does not seem to be present in 10.5 stream but we are still using wsrep_desync=ON as part of the backup and have not encountered any issues in our case.

            stephanvos Stephan Vos added a comment - @Eugene Have your backup problems been solved and are you using this new parameter? This parameter does not seem to be present in 10.5 stream but we are still using wsrep_desync=ON as part of the backup and have not encountered any issues in our case.
            euglorg Eugene added a comment -

            We are remain with default settings, so backup still causes node desync for ~5 minutes:

            2023-06-16  4:07:44 20156184 [Note] WSREP: Desyncing and pausing the provider
            2023-06-16  4:07:44 0 [Note] WSREP: Member 0.0 (host_c) desyncs itself from group
            2023-06-16  4:07:44 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 1056620529)
            2023-06-16  4:07:44 20156184 [Note] WSREP: pause
            2023-06-16  4:07:44 20156184 [Note] WSREP: Provider paused at eebbcc99-e2e2-1111-8484-ffdd99eb2a58:1056620529 (51043020)
            2023-06-16  4:07:44 20156184 [Note] WSREP: Provider paused at: 1056620529
            2023-06-16  4:12:50 20156184 [Note] WSREP: Resuming and resyncing the provider
            2023-06-16  4:12:50 20156184 [Note] WSREP: resume
            2023-06-16  4:12:50 20156184 [Note] WSREP: resuming provider at 51043020
            2023-06-16  4:12:50 20156184 [Note] WSREP: Provider resumed.
            2023-06-16  4:12:50 0 [Note] WSREP: Member 0.0 (host_c) resyncs itself to group.
            2023-06-16  4:12:50 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 1056652398)
            2023-06-16  4:12:50 0 [Note] WSREP: Processing event queue:...  0.0% (    0/31822 events) complete.
            2023-06-16  4:13:00 30 [Note] WSREP: Processing event queue:... 48.6% (16720/34391 events) complete.
            2023-06-16  4:13:05 0 [Note] WSREP: Member 0.0 (host_c) synced with group.
            2023-06-16  4:13:05 0 [Note] WSREP: Processing event queue:...100.0% (35590/35590 events) complete.
            2023-06-16  4:13:05 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1056655266)
            2023-06-16  4:13:05 2 [Note] WSREP: Server host_c synced with group
            

            Fortunately, there was no necessity to run cluster on single node within last months for long periods, so other nodes simply handle application requests while one is performing backup.
            However, an option to stop backup if there's only one node running and DDL query issued, might be good. So we will probably use it in future.

            euglorg Eugene added a comment - We are remain with default settings, so backup still causes node desync for ~5 minutes: 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Desyncing and pausing the provider 2023 - 06 - 16 4 : 07 : 44 0 [Note] WSREP: Member 0.0 (host_c) desyncs itself from group 2023 - 06 - 16 4 : 07 : 44 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 1056620529 ) 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: pause 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Provider paused at eebbcc99-e2e2- 1111 - 8484 -ffdd99eb2a58: 1056620529 ( 51043020 ) 2023 - 06 - 16 4 : 07 : 44 20156184 [Note] WSREP: Provider paused at: 1056620529 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: Resuming and resyncing the provider 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: resume 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: resuming provider at 51043020 2023 - 06 - 16 4 : 12 : 50 20156184 [Note] WSREP: Provider resumed. 2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Member 0.0 (host_c) resyncs itself to group. 2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 1056652398 ) 2023 - 06 - 16 4 : 12 : 50 0 [Note] WSREP: Processing event queue:... 0.0 % ( 0 / 31822 events) complete. 2023 - 06 - 16 4 : 13 : 00 30 [Note] WSREP: Processing event queue:... 48.6 % ( 16720 / 34391 events) complete. 2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Member 0.0 (host_c) synced with group. 2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Processing event queue:... 100.0 % ( 35590 / 35590 events) complete. 2023 - 06 - 16 4 : 13 : 05 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1056655266 ) 2023 - 06 - 16 4 : 13 : 05 2 [Note] WSREP: Server host_c synced with group Fortunately, there was no necessity to run cluster on single node within last months for long periods, so other nodes simply handle application requests while one is performing backup. However, an option to stop backup if there's only one node running and DDL query issued, might be good. So we will probably use it in future.
            ehontz Eric Hontz added a comment -

            I tried out the new wsrep_mode = BF_ABORT_MARIABACKUP setting and found that, when using wsrep_sst_method = mariabackup, a node never returns from Donor/Desynced to Synced after a serving as an SST donor.

            Please see my comment on the commit: https://github.com/MariaDB/server/commit/95de5248c7f59f96039f96f5442142c79da27b20#r121407370

            ehontz Eric Hontz added a comment - I tried out the new wsrep_mode = BF_ABORT_MARIABACKUP setting and found that, when using wsrep_sst_method = mariabackup , a node never returns from Donor/Desynced to Synced after a serving as an SST donor. Please see my comment on the commit: https://github.com/MariaDB/server/commit/95de5248c7f59f96039f96f5442142c79da27b20#r121407370

            ehontz I looked it and not yet see a problem. Can you open a new bug and provide error logs from all nodes, node configuration and if you can show processlist from donor.

            janlindstrom Jan Lindström added a comment - ehontz I looked it and not yet see a problem. Can you open a new bug and provide error logs from all nodes, node configuration and if you can show processlist from donor.
            ehontz Eric Hontz added a comment -

            @janlindstrom,
            I will open a new bug and provide details.

            I'm able to reliably reproduce using a docker-compose environment.

            ehontz Eric Hontz added a comment - @janlindstrom, I will open a new bug and provide details. I'm able to reliably reproduce using a docker-compose environment.
            ehontz Eric Hontz added a comment -

            @janlindstrom: I opened MDEV-31737

            ehontz Eric Hontz added a comment - @janlindstrom: I opened MDEV-31737
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 180160 115217 162633

            People

              jplindst Jan Lindström (Inactive)
              euglorg Eugene
              Votes:
              8 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.