[MDEV-15442] xtrabackup-v2/mariabackup SST donor stuck in DONOR/DESYNCED state when joiner is killed Created: 2018-02-28  Updated: 2021-09-21  Resolved: 2021-09-21

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST, wsrep
Affects Version/s: 10.1.28, 10.1.35
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 2
Labels: galera, sst, wsrep

Attachments: File two-socat-situation.log    
Issue Links:
Relates
relates to MDEV-9554 Rsync SST donor stuck in DONOR/DESYNC... Closed

 Description   

A user experienced an issue where a joiner node was performing an SST using xtrabackup-v2. The joiner node appeared to be killed while the SST was ongoing. This seemed to cause the donor node to get stuck in the donor state for over an hour.

This seems to be similar to MDEV-9554.

We can see that the SST started out normally on the donor:

2018-02-28 16:55:17 139977323177728 [Note] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com'. Selected 2.1 (node000000499.domain.com)(SYNCED) as donor.
2018-02-28 16:55:17 139977323177728 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 564418265)
2018-02-28 16:55:17 139979096304384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-02-28 16:55:17 139698538264320 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.77:4444/xtrabackup_sst//1' --socket '/mariadb/mysql.sock' --datadir '/mariadb/data/' --binlog '/mariadb/binlogs/mysql-bin' --gtid '718503a8-3c6d-11e6-843e-9e8fbaf5936d:564418265' --gtid-domain-id '0''
2018-02-28 16:55:17 139979096304384 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Streaming with xbstream (20180228 16:55:17.941)
WSREP_SST: [INFO] Using socat as streamer (20180228 16:55:17.945)
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20180228 16:55:17.954)
WSREP_SST: [INFO] Encrypting with cert=/mariadb/conf/mariadbSST.pem, key=/mariadb/conf/mariadbSST.pem, cafile=/mariadb/source/dbautils/templates//etc/ca.pem (20180228 16:55:17.958)
WSREP_SST: [INFO] Using /tmp/tmp.74s1DWyGqh as innobackupex temporary directory (20180228 16:55:17.975)
WSREP_SST: [INFO] Streaming GTID file before SST (20180228 16:55:17.979)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:17.983)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20180228 16:55:18.081)
2018-02-28 16:55:19 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 10.1.1.77 4444 (20180228 16:55:28.087)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:28.090)

But then there was a connectivity issue with the joiner node:

2018-02-28 17:00:02 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.1.1.77:4567
2018-02-28 17:00:03 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') reconnecting to 265867d1 (tcp://10.1.1.77:4567), attempt 0
2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspecting node: 265867d1
2018-02-28 17:00:07 139977333667584 [Note] WSREP: evs::proto(a4daa93d, GATHER, view_id(REG,265867d1,102)) suspected node without join message, declaring inactive
2018-02-28 17:00:08 139977333667584 [Note] WSREP: declaring 8d81bba2 at tcp://10.1.1.66:4567 stable
2018-02-28 17:00:08 139977333667584 [Note] WSREP: Node 8d81bba2 state prim
2018-02-28 17:00:08 139977333667584 [Note] WSREP: view(view_id(PRIM,8d81bba2,103) memb {
8d81bba2,0
a4daa93d,1
} joined {
} left {
} partitioned {
265867d1,0
})

From the joiner's log, it looks like mysqld was killed:

WSREP_SST: [INFO] Waiting for SST streaming to complete! (20180228 16:55:21.135)
180228 17:00:02 mysqld_safe mysqld from pid file /mariadb/logs/mariadb.pid ended

It is not clear who or what killed mysqld. The syslogs do not seem to point to OOM killer.

We can see that the donor node seemed to get stuck in the donor state due to this failure. So when the joiner was restarted and tried to request an SST again, the donor node would not allow it, because the node believed that it was already serving an SST. The donor's log contained a lot of messages like the following:

2018-02-28 17:02:29 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:30 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:31 139977333667584 [Note] WSREP: (a4daa93d, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-02-28 17:02:31 139977323177728 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable

It looks like the problem on the donor fixed itself after over an hour:

2018-02-28 18:09:36 139977323177728 [Warning] WSREP: Could not find peer: 265867d1-1ca8-11e8-afe2-eaff2a8edd28
2018-02-28 18:09:36 139977323177728 [Note] WSREP: 1.1 (node000000499.domain.com): State transfer to -1.-1 (left the group) complete.
2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 564431662)
WSREP_SST: [INFO] Cleaning up temporary directories (20180228 18:09:36.440)
2018-02-28 18:09:36 139977323177728 [Note] WSREP: Member 1.1 (node000000499.domain.com) synced with group.
2018-02-28 18:09:36 139977323177728 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 564431662)



 Comments   
Comment by Sachin Setiya (Inactive) [ 2018-04-30 ]

Hi GeoffMontee

Can I get cnf file for nodes

Comment by Jan Lindström (Inactive) [ 2018-05-08 ]

seppo I will assign this issue to you as I do not know after reading all that we have.

Comment by Seppo Jaakola [ 2018-06-08 ]

Is this installation using systemd? The scenario could be that systemd killed inactive mysqld in joiner node, but this did not stop receiving socat stream. So from donor's (SST script) perspective the joiner was fine. Otherwise socat would end in broken pipe error.

Comment by Geoff Montee (Inactive) [ 2018-06-08 ]

This installation was not using systemd.

Comment by Seppo Jaakola [ 2018-06-12 ]

So it remains a mystery why mylsqd died in joiner node, in the first place.

Because socat could keep on streaming in donor node, I assume that SST joiner process was executing in joiner node, despite that mysqld had shutdown. Such scenario can be detected and prevented, I will create a pull request for a fix proposal around this.

Comment by Geoff Montee (Inactive) [ 2018-06-12 ]

Because socat could keep on streaming in donor node, I assume that SST joiner process was executing in joiner node, despite that mysqld had shutdown.

I had the same suspicion, but it doesn't look like that was the case because the joiner was able to start another wsrep_sst_xtrabackup-v2 process immediately after restarting:

2018-02-28 17:02:29 139852185578240 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-02-28 17:02:29 139850384668416 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.40.114.77' --datadir '/mariadb/data/'   --parent '12252' --binlog '/mariadb/binlogs/mysql-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20180228 17:02:29.386)
WSREP_SST: [INFO] Using socat as streamer (20180228 17:02:29.389)
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20180228 17:02:29.394)
WSREP_SST: [INFO] Decrypting with cert=/mariadb/conf/mariadbSST.pem, key=/mariadb/conf/mariadbSST.pem, cafile=/mariadb/source/dbautils/templates//etc/ca.pem (20180228 17:02:29.396)
WSREP_SST: [INFO] Stale sst_in_progress file: /mariadb/data//sst_in_progress (20180228 17:02:29.400)
WSREP_SST: [INFO] Evaluating timeout -s9 100 socat -u openssl-listen:4444,reuseaddr,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180228 17:02:29.436)
2018-02-28 17:02:29 139852185578240 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.40.114.77:4444/xtrabackup_sst//1
2018-02-28 17:02:29 139852185578240 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-02-28 17:02:29 139852185578240 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-02-28 17:02:29 139850485020416 [Note] WSREP: Service thread queue flushed.
2018-02-28 17:02:29 139852185578240 [Note] WSREP: Assign initial position for certification: 564420536, protocol version: 3
2018-02-28 17:02:29 139850485020416 [Note] WSREP: Service thread queue flushed.
2018-02-28 17:02:29 139852185578240 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (718503a8-3c6d-11e6-843e-9e8fbaf5936d): 1 (Operation not permitted)
	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2018-02-28 17:02:29 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable

If the old SST process were still running, then I would have expected this to fail with an "address already in use" error.

You can see that it picked the same donor node too, but that node was unable to be a donor because it was still supposedly streaming a backup to this node. The log filled up with these "Resource temporarily unavailable" error messages.

2018-02-28 17:02:29 139852185578240 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
2018-02-28 17:02:30 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:31 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:32 139850426611456 [Note] WSREP: (265867d1, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-02-28 17:02:32 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:33 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:34 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:35 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:36 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:37 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:38 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:39 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:40 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:41 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:42 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:43 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:44 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-02-28 17:02:45 139850416121600 [Warning] WSREP: Member 0.0 (node000002512.domain.com) requested state transfer from 'node000000499.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable

Comment by Seppo Jaakola [ 2018-06-12 ]

ok, this joiner side log complicates the scenario. Joiner was able to establish socat listen socket, so we can assume that earlier socat connection has ended at this point. Have you checked innobackup log files in donor and joiner side?

Comment by Geoff Montee (Inactive) [ 2018-06-12 ]

innobackup.backup.log shows the donor supposedly streaming the backup with no errors until 18:09. This corresponds to the time that the donor's state changed back to SYNCED. I don't know how it would have been streaming the backup if mysqld and the wsrep_sst_xtrabackup-v2 process on the joiner were constantly stopping and restarting. Yurchenko previously speculated that it seemed like the donor was streaming to nowhere, and that maybe the socat call made by the donor in the SST script was missing some timeout option.

I believe the joiner's innobackup logs would have been overwritten by the subsequent SST requests.

Comment by Mario Karuza (Inactive) [ 2018-06-19 ]

I have analysed this bug and tried to reproduce it locally. It turns out that it is possible to run two socat process.

After SIGKILL or SIGUSR1 is sent to mysqld, wsrep_sst_xtrabackup-v2 will continue execution. Initial socat will still receive data from donor side.

After msyqld restart, we can open another socat, and wait for connection.

Attached document shows this situation. Question is how the joiner is killed in first place ?

Comment by Geoff Montee (Inactive) [ 2018-06-20 ]

We do not know why mysqld on the joiner was killed. This bug can be fixed without knowing the specific reason that mysqld was killed, right?

Comment by Mario Karuza (Inactive) [ 2018-06-21 ]

Pull request, fix, handles situation where mysqld process dies during SST process. This should fix this issue.

Comment by Geoff Montee (Inactive) [ 2018-06-21 ]

Thanks, mkaruza. It looks like jplindst is assigned to the PR, so I've assigned this issue to him as well.

Comment by Geoff Montee (Inactive) [ 2018-12-04 ]

The same user who originally saw this issue with xtrabackup-v2 SSTs on MariaDB 10.1.28 just saw the same issue with mariabackup SSTs on MariaDB 10.1.35.

The SST started normally on the joiner:

2018-12-03 21:20:38 140149631478528 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-12-03 21:20:38 140149618898688 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir '/mariadb_data/'   --parent '3429' --binlog '/mariadb_instance/binlogs/mysql-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20181203 21:20:39.486)
WSREP_SST: [INFO] Using socat as streamer (20181203 21:20:39.489)
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20181203 21:20:39.497)
WSREP_SST: [INFO] Decrypting with cert=/mariadb_instance/conf/mariadbSST.pem, key=/mariadb_instance/conf/mariadbSST.pem, cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem (20181203 21:20:39.501)
WSREP_SST: [INFO] Evaluating timeout -s9 100 socat -u openssl-listen:4444,reuseaddr,cert=/mariadb_instance/conf/mariadbSST.pem,key=/mariadb_instance/conf/mariadbSST.pem,cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem stdio | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20181203 21:20:39.549)
2018-12-03 21:20:39 140149631478528 [Note] WSREP: Prepared SST request: mariabackup|10.1.1.66:4444/xtrabackup_sst//1
2018-12-03 21:20:39 140149631478528 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-12-03 21:20:39 140149631478528 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-12-03 21:20:39 140149631478528 [Note] WSREP: Assign initial position for certification: 930107461, protocol version: 3
2018-12-03 21:20:39 140149714421504 [Note] WSREP: Service thread queue flushed.
2018-12-03 21:20:39 140149631478528 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (718503a8-3c6d-11e6-843e-9e8fbaf5936d): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2018-12-03 21:20:39 140149652342528 [Note] WSREP: Member 1.0 (node2.intranet.domain.com) requested state transfer from 'node1.intranet.domain.com'. Selected 0.1 (node1.intranet.domain.com)(SYNCED) as donor.
2018-12-03 21:20:39 140149652342528 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 930107462)
2018-12-03 21:20:39 140149631478528 [Note] WSREP: Requesting state transfer: success, donor: 0
2018-12-03 21:20:39 140149631478528 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 718503a8-3c6d-11e6-843e-9e8fbaf5936d:930107461
WSREP_SST: [INFO] Proceeding with SST (20181203 21:20:40.635)
WSREP_SST: [INFO] Evaluating socat -u openssl-listen:4444,reuseaddr,cert=/mariadb_instance/conf/mariadbSST.pem,key=/mariadb_instance/conf/mariadbSST.pem,cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem stdio | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20181203 21:20:40.635)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20181203 21:20:40.637)

And the donor:

2018-12-03 21:20:39 140539684869888 [Note] WSREP: Member 1.0 (node2.intranet.domain.com) requested state transfer from 'node1.intranet.domain.com'. Selected 0.1 (node1.intranet.domain.com)(SYNCED) as donor.
2018-12-03 21:20:39 140539684869888 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 930107462)
2018-12-03 21:20:39 140541449579264 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-12-03 21:20:39 140499774064384 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'donor' --address '10.1.1.66:4444/xtrabackup_sst//1' --socket '/mariadb_instance/mysql.sock' --datadir '/mariadb_data/'    --binlog '/mariadb_instance/binlogs/mysql-bin' --gtid '718503a8-3c6d-11e6-843e-9e8fbaf5936d:930107462' --gtid-domain-id '0''
2018-12-03 21:20:39 140541449579264 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Streaming with xbstream (20181203 21:20:40.440)
WSREP_SST: [INFO] Using socat as streamer (20181203 21:20:40.443)
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20181203 21:20:40.451)
WSREP_SST: [INFO] Encrypting with cert=/mariadb_instance/conf/mariadbSST.pem, key=/mariadb_instance/conf/mariadbSST.pem, cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem (20181203 21:20:40.454)
WSREP_SST: [INFO] Using /tmp/tmp.OUq1nlyatW as innobackupex temporary directory (20181203 21:20:40.475)
WSREP_SST: [INFO] Streaming GTID file before SST (20181203 21:20:40.478)
WSREP_SST: [INFO] Evaluating mbstream -c ${INFO_FILE} | socat -u stdio openssl-connect:10.1.1.66:4444,cert=/mariadb_instance/conf/mariadbSST.pem,key=/mariadb_instance/conf/mariadbSST.pem,cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20181203 21:20:40.482)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20181203 21:20:40.566)
2018-12-03 21:20:41 140539697555200 [Note] WSREP: (5b5327f9, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 10.1.1.66 4444 (20181203 21:20:50.571)
WSREP_SST: [INFO] Evaluating mariabackup --innobackupex      $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:10.1.1.66:4444,cert=/mariadb_instance/conf/mariadbSST.pem,key=/mariadb_instance/conf/mariadbSST.pem,cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20181203 21:20:50.574)

But then the SST failed for mysterious reasons on the joiner:

WSREP_SST: [ERROR] Cleanup after exit with status:1 (20181203 21:20:43.759)

And the donor's log clearly shows the node leaving the cluster:

2018-12-03 21:48:16 140539697555200 [Note] WSREP: (5b5327f9, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.1.1.66:4567
2018-12-03 21:48:17 140539697555200 [Note] WSREP: (5b5327f9, 'tcp://0.0.0.0:4567') reconnecting to 857c15f7 (tcp://10.1.1.66:4567), attempt 0
2018-12-03 21:48:20 140539697555200 [Note] WSREP: evs::proto(5b5327f9, OPERATIONAL, view_id(REG,5b5327f9,363)) suspecting node: 857c15f7
2018-12-03 21:48:20 140539697555200 [Note] WSREP: evs::proto(5b5327f9, OPERATIONAL, view_id(REG,5b5327f9,363)) suspected node without join message, declaring inactive
2018-12-03 21:48:21 140539697555200 [Note] WSREP: declaring af7c90af at tcp://10.1.1.77:4567 stable
2018-12-03 21:48:21 140539697555200 [Note] WSREP: Node 5b5327f9 state prim
2018-12-03 21:48:21 140539697555200 [Note] WSREP: view(view_id(PRIM,5b5327f9,364) memb {
        5b5327f9,1
        af7c90af,0
} joined {
} left {
} partitioned {
        857c15f7,0
})

But the donor still remained in the donor state.

This means that when the joiner tried to rejoin and requested another SST, it failed:

2018-12-03 21:49:11 140359709915904 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-12-03 21:49:11 140357916423936 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir '/mariadb_data/'   --parent '24454' --binlog '/mariadb_instance/binlogs/mysql-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20181203 21:49:11.909)
WSREP_SST: [INFO] Using socat as streamer (20181203 21:49:11.912)
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20181203 21:49:11.920)
WSREP_SST: [INFO] Decrypting with cert=/mariadb_instance/conf/mariadbSST.pem, key=/mariadb_instance/conf/mariadbSST.pem, cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem (20181203 21:49:11.923)
WSREP_SST: [INFO] Stale sst_in_progress file: /mariadb_data//sst_in_progress (20181203 21:49:11.928)
WSREP_SST: [INFO] Evaluating timeout -s9 100 socat -u openssl-listen:4444,reuseaddr,cert=/mariadb_instance/conf/mariadbSST.pem,key=/mariadb_instance/conf/mariadbSST.pem,cafile=/mariadb_instance/source/dbautils/templates//etc/ca.pem stdio | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20181203 21:49:11.970)
2018-12-03 21:49:12 140359709915904 [Note] WSREP: Prepared SST request: mariabackup|10.1.1.66:4444/xtrabackup_sst//1
2018-12-03 21:49:12 140359709915904 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-12-03 21:49:12 140359709915904 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-12-03 21:49:12 140359709915904 [Note] WSREP: Assign initial position for certification: 930110877, protocol version: 3
2018-12-03 21:49:12 140358007752448 [Note] WSREP: Service thread queue flushed.
2018-12-03 21:49:12 140359709915904 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (718503a8-3c6d-11e6-843e-9e8fbaf5936d): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2018-12-03 21:49:12 140357945673472 [Warning] WSREP: Member 1.0 (node2.intranet.domain.com) requested state transfer from 'node1.intranet.domain.com', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2018-12-03 21:49:12 140359709915904 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)

innobackup.backup.log from the donor does show that it was working on the SST for almost 2.5 hours, even though the SST failed on the joiner in under a minute:

181203 21:20:50 innobackupex: Starting the backup operation
...
181203 23:44:39 completed OK!

The original fix for MDEV-15442 added code to the SST scripts that made the SST process on the joiner kill itself if it noticed that the parent mysqld process died. Shouldn't the donor side have a check that is similar? Like, if a node leaves the cluster, and one of the remaining live nodes was acting as an SST donor for that dead node, shouldn't the live node kill its SST process? I understand that usually these kinds of issues should be prevented when the donor node fails with broken pipe or timeout errors. I'm not sure why that isn't happening in this case.

Comment by Julien Fritsch [ 2018-12-07 ]

Thank you very much GeoffMontee

Comment by Jan Lindström (Inactive) [ 2019-01-28 ]

This is problematic as same donor could be serving several joiner's. To fix this properly, there should be a watchdog where donor and joiner would register that would then monitor if both are alive regularly and if not kill the other. Currently, there is no such watchdog. As joiner is not part of the cluster before it has received the state from donor, it will timeout when configured and stop. Donor is part of the cluster, but if that node is not reachable by joiner, we can't add more nodes to cluster. I do not know, we could call this "works as designed" i.e. after timeout joiner will give up and not come up.

Comment by Geoff Montee (Inactive) [ 2019-01-29 ]

This is problematic as same donor could be serving several joiner's.

As far as I know, a donor node can only serve an SST to one joiner node at a time with the current implementation of Galera.

To fix this properly, there should be a watchdog where donor and joiner would register that would then monitor if both are alive regularly and if not kill the other. Currently, there is no such watchdog.

It sounds like this "watchdog" refers to a third server that is neither the donor node or joiner node. This seems like overkill.

To fix this, it seems like we just need to implement something like the following:

  • Let's say that node X is acting as SST donor for node Y.
  • Node X starts up a wsrep_sst_mariabackup process to serve a backup to node Y.
  • Node X notices that node Y has left the cluster.
  • Node X notices that its own wsrep_sst_mariabackup process is still running.
  • Node X knows that its own wsrep_sst_mariabackup process can't do anything useful if node Y is no longer in the cluster.
  • Node X kills its own wsrep_sst_mariabackup process.

Is there something wrong with this approach? It seems fairly straight forward.

Comment by Geoff Montee (Inactive) [ 2019-01-29 ]

Correct, donor can serve only one joiner at the time. However, here is the problem in your example joiner Y is not part of the cluster before it has received the state. So, node X will not see that Y has left the cluster as Y has not yet joined. From node X point of view only way to know is when timeout happens.

That's not true though. A joiner node joins the cluster before it even requests its SST. The joiner node just doesn't transition to the SYNCED state until after it has successfully received an SST.

For example, here, we see a joiner node join the cluster:

2018-12-03  4:29:52 140673198053408 [Note] WSREP: gcomm: connecting to group 'MariaDBCluster', peer 'node000002512.domain.com:,node000000499.domain.com:,node000002513.domain.com:'
2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to 13c0b219 tcp://10.1.1.66:4567
2018-12-03  4:29:52 140673198053408 [Warning] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') address 'tcp://10.1.1.66:4567' points to own listening address, blacklisting
2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to af7c90af tcp://10.1.1.77:4567
2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2018-12-03  4:29:52 140673198053408 [Note] WSREP: (13c0b219, 'tcp://0.0.0.0:4567') connection established to 5b5327f9 tcp://10.1.1.21:4567
2018-12-03  4:29:54 140673198053408 [Note] WSREP: declaring 5b5327f9 at tcp://10.1.1.21:4567 stable
2018-12-03  4:29:54 140673198053408 [Note] WSREP: declaring af7c90af at tcp://10.1.1.77:4567 stable
2018-12-03  4:29:54 140673198053408 [Note] WSREP: Node 5b5327f9 state prim
2018-12-03  4:29:54 140673198053408 [Note] WSREP: view(view_id(PRIM,13c0b219,354) memb {
	13c0b219,0
	5b5327f9,1
	af7c90af,0
} joined {
} left {
} partitioned {
})
2018-12-03  4:29:54 140673198053408 [Note] WSREP: save pc into disk
2018-12-03  4:29:54 140673198053408 [Note] WSREP: gcomm: connected
2018-12-03  4:29:54 140673198053408 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2018-12-03  4:29:54 140673198053408 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2018-12-03  4:29:54 140673198053408 [Note] WSREP: Opened channel 'MariaDBCluster'

And then the joiner node only requests its SST after it has already joined the cluster:

2018-12-03  4:29:54 140671432148736 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 929354715)
2018-12-03  4:29:54 140671411284736 [Note] WSREP: State transfer required: 
	Group state: 718503a8-3c6d-11e6-843e-9e8fbaf5936d:929354715
	Local state: 00000000-0000-0000-0000-000000000000:-1
2018-12-03  4:29:54 140671411284736 [Note] WSREP: New cluster view: global state: 718503a8-3c6d-11e6-843e-9e8fbaf5936d:929354715, view# 115: Primary, number of nodes: 3, my index: 0, protocol version 3
2018-12-03  4:29:54 140671411284736 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-12-03  4:29:54 140671398704896 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir ' /mariadb_data /'   --parent '9645' --binlog '/mariadb_instance/binlogs/mysql-bin' '

And then if the SST fails on the joiner node:

WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 130 1 (20181203 04:36:45.811)
WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20181203 04:36:45.814)
WSREP_SST: [ERROR] Cleanup after exit with status:2 (20181203 04:36:45.817)
2018-12-03  4:36:45 140671398704896 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.1.66' --datadir ' /mariadb_data /'   --parent '9645' --binlog '/mariadb_instance/binlogs/mysql-bin' : 2 (No such file or directory)
2018-12-03  4:36:45 140671398704896 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
2018-12-03  4:36:45 140673198053408 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2018-12-03  4:36:45 140673198053408 [ERROR] Aborting

Then the donor's log shows the joiner node leaving the cluster:

2018-12-03  4:36:46 140539697555200 [Note] WSREP: view(view_id(PRIM,5b5327f9,355) memb {
	5b5327f9,1
	af7c90af,0
} joined {
} left {
} partitioned {
	13c0b219,0
})
2018-12-03  4:36:46 140539697555200 [Note] WSREP: save pc into disk
2018-12-03  4:36:46 140539697555200 [Note] WSREP: forgetting 13c0b219 (tcp://10.1.1.66:4567)

So clearly:

  • A joiner node is already part of the cluster before it even requests an SST.
  • A donor node sees the joiner node leave the cluster when its SST fails.

From wsrep_sst_mariabackup script point of view it can know in both nodes is mysqld process there or not. But, currently it does not know if network between node X and node Y is not reachable.

I'm not sure that I understand what you're trying to say.

Regardless, I don't expect the wsrep_sst_mariabackup script itself to have enough information to know when a node has left the cluster. That information is known by mysqld. Therefore, this fix would probably have to be implemented in mysqld, not in wsrep_sst_mariabackup.

For example:

  • If node X is acting as an SST donor for node Y, then mysqld on node X will know that it is a SST donor.
  • If node Y leaves the cluster, then mysqld on node X will also know that node Y has left, even if the process running the wsrep_sst_mariabackup script for the SST on node X is still trying to send data to node Y.
  • mysqld on node X should also know the process ID of the process running the wsrep_sst_mariabackup script for the SST.
  • Therefore, if mysqld on node X notices that node Y has left the cluster, and if mysqld on node X also notices that the process running the wsrep_sst_mariabackup script for the SST is still alive, then it should kill that process.

Is it feasible to make mysqld kill a leftover SST process if the SST was associated with a node that is no longer part of the cluster?

Comment by Jan Lindström (Inactive) [ 2019-01-30 ]

seppo Is there anything we could do here ?

Comment by Jan Lindström (Inactive) [ 2021-09-21 ]

10.1 is EOL

Generated at Thu Feb 08 08:21:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.