Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Fix
-
10.1.28, 10.1.35
Description
A user experienced an issue where a joiner node was performing an SST using xtrabackup-v2. The joiner node appeared to be killed while the SST was ongoing. This seemed to cause the donor node to get stuck in the donor state for over an hour.
This seems to be similar to MDEV-9554.
We can see that the SST started out normally on the donor:
2018-02-28 16:55:17 139977323177728 [Note] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com'. Selected 2.1 (node000000499.domain.com)(SYNCED) as donor.
|
2018-02-28 16:55:17 139977323177728 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 564418265)
|
2018-02-28 16:55:17 139979096304384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
2018-02-28 16:55:17 139698538264320 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.77:4444/xtrabackup_sst//1' --socket '/mariadb/mysql.sock' --datadir '/mariadb/data/' --binlog '/mariadb/binlogs/mysql-bin' --gtid '718503a8-3c6d-11e6-843e-9e8fbaf5936d:564418265' --gtid-domain-id '0''
|
2018-02-28 16:55:17 139979096304384 [Note] WSREP: sst_donor_thread signaled with 0
|
WSREP_SST: [INFO] Streaming with xbstream (20180228 16:55:17.941)
|
WSREP_SST: [INFO] Using socat as streamer (20180228 16:55:17.945)
|
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20180228 16:55:17.954)
|
WSREP_SST: [INFO] Encrypting with cert=/mariadb/conf/mariadbSST.pem, key=/mariadb/conf/mariadbSST.pem, cafile=/mariadb/source/dbautils/templates//etc/ca.pem (20180228 16:55:17.958)
|
WSREP_SST: [INFO] Using /tmp/tmp.74s1DWyGqh as innobackupex temporary directory (20180228 16:55:17.975)
|
WSREP_SST: [INFO] Streaming GTID file before SST (20180228 16:55:17.979)
|
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:17.983)
|
WSREP_SST: [INFO] Sleeping before data transfer for SST (20180228 16:55:18.081)
|
2018-02-28 16:55:19 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
|
WSREP_SST: [INFO] Streaming the backup to joiner at 10.1.1.77 4444 (20180228 16:55:28.087)
|
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:28.090)
|
But then there was a connectivity issue with the joiner node:
2018-02-28 17:00:02 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.1.1.77:4567
|
2018-02-28 17:00:03 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') reconnecting to 265867d1 (tcp://10.1.1.77:4567), attempt 0
|
2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspecting node: 265867d1
|
2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspected node without join message, declaring inactive
|
2018-02-28 17:00:08 139977333667584 [Note] WSREP: declaring 8d81bba2 at tcp://10.1.1.66:4567 stable
|
2018-02-28 17:00:08 139977333667584 [Note] WSREP: Node 8d81bba2 state prim
|
2018-02-28 17:00:08 139977333667584 [Note] WSREP: view(view_id(PRIM,8d81bba2,103) memb {
|
8d81bba2,0
|
a4daa93d,1
|
} joined {
|
} left {
|
} partitioned {
|
265867d1,0
|
})
|
From the joiner's log, it looks like mysqld was killed:
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20180228 16:55:21.135)
|
180228 17:00:02 mysqld_safe mysqld from pid file /mariadb/logs/mariadb.pid ended
|
It is not clear who or what killed mysqld. The syslogs do not seem to point to OOM killer.
We can see that the donor node seemed to get stuck in the donor state due to this failure. So when the joiner was restarted and tried to request an SST again, the donor node would not allow it, because the node believed that it was already serving an SST. The donor's log contained a lot of messages like the following:
2018-02-28 17:02:29 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
|
2018-02-28 17:02:30 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
|
2018-02-28 17:02:31 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
|
2018-02-28 17:02:31 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
|
It looks like the problem on the donor fixed itself after over an hour:
2018-02-28 18:09:36 139977323177728 [Warning] WSREP: Could not find peer: 265867d1-1ca8-11e8-afe2-eaff2a8edd28
|
2018-02-28 18:09:36 139977323177728 [Note] WSREP: 1.1 (node000000499.domain.com): State transfer to -1.-1 (left the group) complete.
|
2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 564431662)
|
WSREP_SST: [INFO] Cleaning up temporary directories (20180228 18:09:36.440)
|
2018-02-28 18:09:36 139977323177728 [Note] WSREP: Member 1.1 (node000000499.domain.com) synced with group.
|
2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 564431662)
|
Attachments
Issue Links
- relates to
-
MDEV-9554 Rsync SST donor stuck in DONOR/DESYNCED state when joiner crashes
- Closed
- links to