Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15442

xtrabackup-v2/mariabackup SST donor stuck in DONOR/DESYNCED state when joiner is killed

    Details

      Description

      A user experienced an issue where a joiner node was performing an SST using xtrabackup-v2. The joiner node appeared to be killed while the SST was ongoing. This seemed to cause the donor node to get stuck in the donor state for over an hour.

      This seems to be similar to MDEV-9554.

      We can see that the SST started out normally on the donor:

      2018-02-28 16:55:17 139977323177728 [Note] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com'. Selected 2.1 (node000000499.domain.com)(SYNCED) as donor.
      2018-02-28 16:55:17 139977323177728 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 564418265)
      2018-02-28 16:55:17 139979096304384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2018-02-28 16:55:17 139698538264320 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.77:4444/xtrabackup_sst//1' --socket '/mariadb/mysql.sock' --datadir '/mariadb/data/' --binlog '/mariadb/binlogs/mysql-bin' --gtid '718503a8-3c6d-11e6-843e-9e8fbaf5936d:564418265' --gtid-domain-id '0''
      2018-02-28 16:55:17 139979096304384 [Note] WSREP: sst_donor_thread signaled with 0
      WSREP_SST: [INFO] Streaming with xbstream (20180228 16:55:17.941)
      WSREP_SST: [INFO] Using socat as streamer (20180228 16:55:17.945)
      WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20180228 16:55:17.954)
      WSREP_SST: [INFO] Encrypting with cert=/mariadb/conf/mariadbSST.pem, key=/mariadb/conf/mariadbSST.pem, cafile=/mariadb/source/dbautils/templates//etc/ca.pem (20180228 16:55:17.958)
      WSREP_SST: [INFO] Using /tmp/tmp.74s1DWyGqh as innobackupex temporary directory (20180228 16:55:17.975)
      WSREP_SST: [INFO] Streaming GTID file before SST (20180228 16:55:17.979)
      WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:17.983)
      WSREP_SST: [INFO] Sleeping before data transfer for SST (20180228 16:55:18.081)
      2018-02-28 16:55:19 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
      WSREP_SST: [INFO] Streaming the backup to joiner at 10.1.1.77 4444 (20180228 16:55:28.087)
      WSREP_SST: [INFO] Evaluating innobackupex --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:28.090)
      

      But then there was a connectivity issue with the joiner node:

      2018-02-28 17:00:02 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.1.1.77:4567
      2018-02-28 17:00:03 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') reconnecting to 265867d1 (tcp://10.1.1.77:4567), attempt 0
      2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspecting node: 265867d1
      2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspected node without join message, declaring inactive
      2018-02-28 17:00:08 139977333667584 [Note] WSREP: declaring 8d81bba2 at tcp://10.1.1.66:4567 stable
      2018-02-28 17:00:08 139977333667584 [Note] WSREP: Node 8d81bba2 state prim
      2018-02-28 17:00:08 139977333667584 [Note] WSREP: view(view_id(PRIM,8d81bba2,103) memb {
      8d81bba2,0
      a4daa93d,1
      } joined {
      } left {
      } partitioned {
      265867d1,0
      })
      

      From the joiner's log, it looks like mysqld was killed:

      WSREP_SST: [INFO] Waiting for SST streaming to complete! (20180228 16:55:21.135)
      180228 17:00:02 mysqld_safe mysqld from pid file /mariadb/logs/mariadb.pid ended
      

      It is not clear who or what killed mysqld. The syslogs do not seem to point to OOM killer.

      We can see that the donor node seemed to get stuck in the donor state due to this failure. So when the joiner was restarted and tried to request an SST again, the donor node would not allow it, because the node believed that it was already serving an SST. The donor's log contained a lot of messages like the following:

      2018-02-28 17:02:29 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
      2018-02-28 17:02:30 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
      2018-02-28 17:02:31 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2018-02-28 17:02:31 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
      

      It looks like the problem on the donor fixed itself after over an hour:

      2018-02-28 18:09:36 139977323177728 [Warning] WSREP: Could not find peer: 265867d1-1ca8-11e8-afe2-eaff2a8edd28
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: 1.1 (node000000499.domain.com): State transfer to -1.-1 (left the group) complete.
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 564431662)
      WSREP_SST: [INFO] Cleaning up temporary directories (20180228 18:09:36.440)
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: Member 1.1 (node000000499.domain.com) synced with group.
      2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 564431662)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mkaruza Mario Karuza
                Reporter:
                GeoffMontee Geoff Montee
              • Votes:
                2 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: