Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15442

xtrabackup-v2/mariabackup SST donor stuck in DONOR/DESYNCED state when joiner is killed

Details

    Description

      A user experienced an issue where a joiner node was performing an SST using xtrabackup-v2. The joiner node appeared to be killed while the SST was ongoing. This seemed to cause the donor node to get stuck in the donor state for over an hour.

      This seems to be similar to MDEV-9554.

      We can see that the SST started out normally on the donor:

      2018-02-28 16:55:17 139977323177728 [Note] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com'. Selected 2.1 (node000000499.domain.com)(SYNCED) as donor.
      2018-02-28 16:55:17 139977323177728 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 564418265)
      2018-02-28 16:55:17 139979096304384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2018-02-28 16:55:17 139698538264320 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.77:4444/xtrabackup_sst//1' --socket '/mariadb/mysql.sock' --datadir '/mariadb/data/' --binlog '/mariadb/binlogs/mysql-bin' --gtid '718503a8-3c6d-11e6-843e-9e8fbaf5936d:564418265' --gtid-domain-id '0''
      2018-02-28 16:55:17 139979096304384 [Note] WSREP: sst_donor_thread signaled with 0
      WSREP_SST: [INFO] Streaming with xbstream (20180228 16:55:17.941)
      WSREP_SST: [INFO] Using socat as streamer (20180228 16:55:17.945)
      WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20180228 16:55:17.954)
      WSREP_SST: [INFO] Encrypting with cert=/mariadb/conf/mariadbSST.pem, key=/mariadb/conf/mariadbSST.pem, cafile=/mariadb/source/dbautils/templates//etc/ca.pem (20180228 16:55:17.958)
      WSREP_SST: [INFO] Using /tmp/tmp.74s1DWyGqh as innobackupex temporary directory (20180228 16:55:17.975)
      WSREP_SST: [INFO] Streaming GTID file before SST (20180228 16:55:17.979)
      WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:17.983)
      WSREP_SST: [INFO] Sleeping before data transfer for SST (20180228 16:55:18.081)
      2018-02-28 16:55:19 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
      WSREP_SST: [INFO] Streaming the backup to joiner at 10.1.1.77 4444 (20180228 16:55:28.087)
      WSREP_SST: [INFO] Evaluating innobackupex --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:28.090)
      

      But then there was a connectivity issue with the joiner node:

      2018-02-28 17:00:02 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.1.1.77:4567
      2018-02-28 17:00:03 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') reconnecting to 265867d1 (tcp://10.1.1.77:4567), attempt 0
      2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspecting node: 265867d1
      2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspected node without join message, declaring inactive
      2018-02-28 17:00:08 139977333667584 [Note] WSREP: declaring 8d81bba2 at tcp://10.1.1.66:4567 stable
      2018-02-28 17:00:08 139977333667584 [Note] WSREP: Node 8d81bba2 state prim
      2018-02-28 17:00:08 139977333667584 [Note] WSREP: view(view_id(PRIM,8d81bba2,103) memb {
      8d81bba2,0
      a4daa93d,1
      } joined {
      } left {
      } partitioned {
      265867d1,0
      })
      

      From the joiner's log, it looks like mysqld was killed:

      WSREP_SST: [INFO] Waiting for SST streaming to complete! (20180228 16:55:21.135)
      180228 17:00:02 mysqld_safe mysqld from pid file /mariadb/logs/mariadb.pid ended
      

      It is not clear who or what killed mysqld. The syslogs do not seem to point to OOM killer.

      We can see that the donor node seemed to get stuck in the donor state due to this failure. So when the joiner was restarted and tried to request an SST again, the donor node would not allow it, because the node believed that it was already serving an SST. The donor's log contained a lot of messages like the following:

      2018-02-28 17:02:29 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
      2018-02-28 17:02:30 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
      2018-02-28 17:02:31 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2018-02-28 17:02:31 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
      

      It looks like the problem on the donor fixed itself after over an hour:

      2018-02-28 18:09:36 139977323177728 [Warning] WSREP: Could not find peer: 265867d1-1ca8-11e8-afe2-eaff2a8edd28
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: 1.1 (node000000499.domain.com): State transfer to -1.-1 (left the group) complete.
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 564431662)
      WSREP_SST: [INFO] Cleaning up temporary directories (20180228 18:09:36.440)
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: Member 1.1 (node000000499.domain.com) synced with group.
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 564431662)
      

      Attachments

        Issue Links

          Activity

            This is problematic as same donor could be serving several joiner's. To fix this properly, there should be a watchdog where donor and joiner would register that would then monitor if both are alive regularly and if not kill the other. Currently, there is no such watchdog. As joiner is not part of the cluster before it has received the state from donor, it will timeout when configured and stop. Donor is part of the cluster, but if that node is not reachable by joiner, we can't add more nodes to cluster. I do not know, we could call this "works as designed" i.e. after timeout joiner will give up and not come up.

            jplindst Jan Lindström (Inactive) added a comment - This is problematic as same donor could be serving several joiner's. To fix this properly, there should be a watchdog where donor and joiner would register that would then monitor if both are alive regularly and if not kill the other. Currently, there is no such watchdog. As joiner is not part of the cluster before it has received the state from donor, it will timeout when configured and stop. Donor is part of the cluster, but if that node is not reachable by joiner, we can't add more nodes to cluster. I do not know, we could call this "works as designed" i.e. after timeout joiner will give up and not come up.

            This is problematic as same donor could be serving several joiner's.

            As far as I know, a donor node can only serve an SST to one joiner node at a time with the current implementation of Galera.

            To fix this properly, there should be a watchdog where donor and joiner would register that would then monitor if both are alive regularly and if not kill the other. Currently, there is no such watchdog.

            It sounds like this "watchdog" refers to a third server that is neither the donor node or joiner node. This seems like overkill.

            To fix this, it seems like we just need to implement something like the following:

            • Let's say that node X is acting as SST donor for node Y.
            • Node X starts up a wsrep_sst_mariabackup process to serve a backup to node Y.
            • Node X notices that node Y has left the cluster.
            • Node X notices that its own wsrep_sst_mariabackup process is still running.
            • Node X knows that its own wsrep_sst_mariabackup process can't do anything useful if node Y is no longer in the cluster.
            • Node X kills its own wsrep_sst_mariabackup process.

            Is there something wrong with this approach? It seems fairly straight forward.

            GeoffMontee Geoff Montee (Inactive) added a comment - This is problematic as same donor could be serving several joiner's. As far as I know, a donor node can only serve an SST to one joiner node at a time with the current implementation of Galera. To fix this properly, there should be a watchdog where donor and joiner would register that would then monitor if both are alive regularly and if not kill the other. Currently, there is no such watchdog. It sounds like this "watchdog" refers to a third server that is neither the donor node or joiner node. This seems like overkill. To fix this, it seems like we just need to implement something like the following: Let's say that node X is acting as SST donor for node Y. Node X starts up a wsrep_sst_mariabackup process to serve a backup to node Y. Node X notices that node Y has left the cluster. Node X notices that its own wsrep_sst_mariabackup process is still running. Node X knows that its own wsrep_sst_mariabackup process can't do anything useful if node Y is no longer in the cluster. Node X kills its own wsrep_sst_mariabackup process. Is there something wrong with this approach? It seems fairly straight forward.

            Correct, donor can serve only one joiner at the time. However, here is the problem in your example joiner Y is not part of the cluster before it has received the state. So, node X will not see that Y has left the cluster as Y has not yet joined. From node X point of view only way to know is when timeout happens.

            That's not true though. A joiner node joins the cluster before it even requests its SST. The joiner node just doesn't transition to the SYNCED state until after it has successfully received an SST.

            For example, here, we see a joiner node join the cluster:

            2018-12-03  4:29:52 140673198053408 [Note] WSREP: gcomm: connecting to group 'MariaDBCluster', peer 'node000002512.domain.com:,node000000499.domain.com:,node000002513.domain.com:'
            2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to 13c0b219 tcp://10.1.1.66:4567
            2018-12-03  4:29:52 140673198053408 [Warning] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') address 'tcp://10.1.1.66:4567' points to own listening address, blacklisting
            2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to af7c90af tcp://10.1.1.77:4567
            2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
            2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to 5b5327f9 tcp://10.1.1.21:4567
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: declaring 5b5327f9 at tcp://10.1.1.21:4567 stable
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: declaring af7c90af at tcp://10.1.1.77:4567 stable
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: Node 5b5327f9 state prim
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: view(view_id(PRIM,13c0b219,354) memb {
            	13c0b219,0
            	5b5327f9,1
            	af7c90af,0
            } joined {
            } left {
            } partitioned {
            })
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: save pc into disk
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: gcomm: connected
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
            2018-12-03  4:29:54 140673198053408 [Note] WSREP: Opened channel 'MariaDBCluster'
            

            And then the joiner node only requests its SST after it has already joined the cluster:

            2018-12-03  4:29:54 140671432148736 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 929354715)
            2018-12-03  4:29:54 140671411284736 [Note] WSREP: State transfer required: 
            	Group state: 718503a8-3c6d-11e6-843e-9e8fbaf5936d:929354715
            	Local state: 00000000-0000-0000-0000-000000000000:-1
            2018-12-03  4:29:54 140671411284736 [Note] WSREP: New cluster view: global state: 718503a8-3c6d-11e6-843e-9e8fbaf5936d:929354715, view# 115: Primary, number of nodes: 3, my index: 0, protocol version 3
            2018-12-03  4:29:54 140671411284736 [Warning] WSREP: Gap in state sequence. Need state transfer.
            2018-12-03  4:29:54 140671398704896 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir ' /mariadb_data /'   --parent '9645' --binlog '/mariadb_instance/binlogs/mysql-bin' '
            

            And then if the SST fails on the joiner node:

            WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 130 1 (20181203 04:36:45.811)
            WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20181203 04:36:45.814)
            WSREP_SST: [ERROR] Cleanup after exit with status:2 (20181203 04:36:45.817)
            2018-12-03  4:36:45 140671398704896 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir ' /mariadb_data /'   --parent '9645' --binlog '/mariadb_instance/binlogs/mysql-bin' : 2 (No such file or directory)
            2018-12-03  4:36:45 140671398704896 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
            2018-12-03  4:36:45 140673198053408 [ERROR] WSREP: SST failed: 2 (No such file or directory)
            2018-12-03  4:36:45 140673198053408 [ERROR] Aborting
            

            Then the donor's log shows the joiner node leaving the cluster:

            2018-12-03  4:36:46 140539697555200 [Note] WSREP: view(view_id(PRIM,5b5327f9,355) memb {
            	5b5327f9,1
            	af7c90af,0
            } joined {
            } left {
            } partitioned {
            	13c0b219,0
            })
            2018-12-03  4:36:46 140539697555200 [Note] WSREP: save pc into disk
            2018-12-03  4:36:46 140539697555200 [Note] WSREP: forgetting 13c0b219 (tcp://10.1.1.66:4567)
            

            So clearly:

            • A joiner node is already part of the cluster before it even requests an SST.
            • A donor node sees the joiner node leave the cluster when its SST fails.

            From wsrep_sst_mariabackup script point of view it can know in both nodes is mysqld process there or not. But, currently it does not know if network between node X and node Y is not reachable.

            I'm not sure that I understand what you're trying to say.

            Regardless, I don't expect the wsrep_sst_mariabackup script itself to have enough information to know when a node has left the cluster. That information is known by mysqld. Therefore, this fix would probably have to be implemented in mysqld, not in wsrep_sst_mariabackup.

            For example:

            • If node X is acting as an SST donor for node Y, then mysqld on node X will know that it is a SST donor.
            • If node Y leaves the cluster, then mysqld on node X will also know that node Y has left, even if the process running the wsrep_sst_mariabackup script for the SST on node X is still trying to send data to node Y.
            • mysqld on node X should also know the process ID of the process running the wsrep_sst_mariabackup script for the SST.
            • Therefore, if mysqld on node X notices that node Y has left the cluster, and if mysqld on node X also notices that the process running the wsrep_sst_mariabackup script for the SST is still alive, then it should kill that process.

            Is it feasible to make mysqld kill a leftover SST process if the SST was associated with a node that is no longer part of the cluster?

            GeoffMontee Geoff Montee (Inactive) added a comment - Correct, donor can serve only one joiner at the time. However, here is the problem in your example joiner Y is not part of the cluster before it has received the state. So, node X will not see that Y has left the cluster as Y has not yet joined. From node X point of view only way to know is when timeout happens. That's not true though. A joiner node joins the cluster before it even requests its SST. The joiner node just doesn't transition to the SYNCED state until after it has successfully received an SST. For example, here, we see a joiner node join the cluster: 2018-12-03 4:29:52 140673198053408 [Note] WSREP: gcomm: connecting to group 'MariaDBCluster', peer 'node000002512.domain.com:,node000000499.domain.com:,node000002513.domain.com:' 2018-12-03 4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to 13c0b219 tcp://10.1.1.66:4567 2018-12-03 4:29:52 140673198053408 [Warning] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') address 'tcp://10.1.1.66:4567' points to own listening address, blacklisting 2018-12-03 4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to af7c90af tcp://10.1.1.77:4567 2018-12-03 4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 2018-12-03 4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to 5b5327f9 tcp://10.1.1.21:4567 2018-12-03 4:29:54 140673198053408 [Note] WSREP: declaring 5b5327f9 at tcp://10.1.1.21:4567 stable 2018-12-03 4:29:54 140673198053408 [Note] WSREP: declaring af7c90af at tcp://10.1.1.77:4567 stable 2018-12-03 4:29:54 140673198053408 [Note] WSREP: Node 5b5327f9 state prim 2018-12-03 4:29:54 140673198053408 [Note] WSREP: view(view_id(PRIM,13c0b219,354) memb { 13c0b219,0 5b5327f9,1 af7c90af,0 } joined { } left { } partitioned { }) 2018-12-03 4:29:54 140673198053408 [Note] WSREP: save pc into disk 2018-12-03 4:29:54 140673198053408 [Note] WSREP: gcomm: connected 2018-12-03 4:29:54 140673198053408 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636 2018-12-03 4:29:54 140673198053408 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0) 2018-12-03 4:29:54 140673198053408 [Note] WSREP: Opened channel 'MariaDBCluster' And then the joiner node only requests its SST after it has already joined the cluster: 2018-12-03 4:29:54 140671432148736 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 929354715) 2018-12-03 4:29:54 140671411284736 [Note] WSREP: State transfer required: Group state: 718503a8-3c6d-11e6-843e-9e8fbaf5936d:929354715 Local state: 00000000-0000-0000-0000-000000000000:-1 2018-12-03 4:29:54 140671411284736 [Note] WSREP: New cluster view: global state: 718503a8-3c6d-11e6-843e-9e8fbaf5936d:929354715, view# 115: Primary, number of nodes: 3, my index: 0, protocol version 3 2018-12-03 4:29:54 140671411284736 [Warning] WSREP: Gap in state sequence. Need state transfer. 2018-12-03 4:29:54 140671398704896 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir ' /mariadb_data /' --parent '9645' --binlog '/mariadb_instance/binlogs/mysql-bin' ' And then if the SST fails on the joiner node: WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 130 1 (20181203 04:36:45.811) WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20181203 04:36:45.814) WSREP_SST: [ERROR] Cleanup after exit with status:2 (20181203 04:36:45.817) 2018-12-03 4:36:45 140671398704896 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir ' /mariadb_data /' --parent '9645' --binlog '/mariadb_instance/binlogs/mysql-bin' : 2 (No such file or directory) 2018-12-03 4:36:45 140671398704896 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script. 2018-12-03 4:36:45 140673198053408 [ERROR] WSREP: SST failed: 2 (No such file or directory) 2018-12-03 4:36:45 140673198053408 [ERROR] Aborting Then the donor's log shows the joiner node leaving the cluster: 2018-12-03 4:36:46 140539697555200 [Note] WSREP: view(view_id(PRIM,5b5327f9,355) memb { 5b5327f9,1 af7c90af,0 } joined { } left { } partitioned { 13c0b219,0 }) 2018-12-03 4:36:46 140539697555200 [Note] WSREP: save pc into disk 2018-12-03 4:36:46 140539697555200 [Note] WSREP: forgetting 13c0b219 (tcp://10.1.1.66:4567) So clearly: A joiner node is already part of the cluster before it even requests an SST. A donor node sees the joiner node leave the cluster when its SST fails. From wsrep_sst_mariabackup script point of view it can know in both nodes is mysqld process there or not. But, currently it does not know if network between node X and node Y is not reachable. I'm not sure that I understand what you're trying to say. Regardless, I don't expect the wsrep_sst_mariabackup script itself to have enough information to know when a node has left the cluster. That information is known by mysqld. Therefore, this fix would probably have to be implemented in mysqld, not in wsrep_sst_mariabackup. For example: If node X is acting as an SST donor for node Y, then mysqld on node X will know that it is a SST donor. If node Y leaves the cluster, then mysqld on node X will also know that node Y has left, even if the process running the wsrep_sst_mariabackup script for the SST on node X is still trying to send data to node Y. mysqld on node X should also know the process ID of the process running the wsrep_sst_mariabackup script for the SST. Therefore, if mysqld on node X notices that node Y has left the cluster, and if mysqld on node X also notices that the process running the wsrep_sst_mariabackup script for the SST is still alive, then it should kill that process. Is it feasible to make mysqld kill a leftover SST process if the SST was associated with a node that is no longer part of the cluster?

            seppo Is there anything we could do here ?

            jplindst Jan Lindström (Inactive) added a comment - seppo Is there anything we could do here ?

            10.1 is EOL

            jplindst Jan Lindström (Inactive) added a comment - 10.1 is EOL

            People

              jplindst Jan Lindström (Inactive)
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.