[MDEV-18621] wsrep_sst_mariabackup socat dead connection Created: 2019-02-18  Updated: 2023-10-10

Status: Stalled
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.2.22
Fix Version/s: 10.2

Type: Bug Priority: Major
Reporter: Martin Vit Assignee: Julius Goryavsky
Resolution: Unresolved Votes: 1
Labels: None
Environment:

debian 9



 Description   

When doing SST on joiner the socat receiver once donor completes SST transfer the joiner hangs in TCP connection ESTABLISHED while the socat on donor already ends. The solution is to wait 7200 seconds which is default timeout for dead tcp connections or manually kill socat on joiner which will continue with SST or my workaround currently is configuration on joiner in my.cnf -

[sst]
sockopt=,keepalive,keepidle=10,keepintvl=10,keepcnt=2

which will close dead tcp connection. I suggest to put this keepalive directly into the /usr/bin/wsrep_sst_mariabackup

I also suggest to revise or find out why the socat on donor does not sends FIN or sends EOF over the network to joiner.

Here are some logs:

on Donor:

Feb 18 04:04:50 s1 -innobackupex-backup: [00] 2019-02-18 04:04:50 completed OK!
Feb 18 04:04:50 s1 -wsrep-sst-donor: Total time on donor: 0 seconds
Feb 18 04:04:50 s1 -wsrep-sst-donor: Cleaning up temporary directories

on Joiner:

Feb 18 02:20:31 s3 -wsrep-sst-joiner: Waiting for SST streaming to complete!
Feb 18 04:08:04 s3 -wsrep-sst-joiner: 2019/02/18 04:08:04 socat[20811] E read(7, 0x55845e0c55b0, 8192): Connection timed out Feb 18 04:08:04 s3 -wsrep-sst-joiner: [00] 2019-02-18 04:08:04 xb_stream_read_chunk(): my_read() failed. Feb 18 04:08:04 s3 -wsrep-sst-joiner: Error while getting data from donor node: exit codes: 1 1 Feb 18 04:08:04 s3 -wsrep-sst-joiner: Preparing the backup at /data/mysql//.sst Feb 18 04:08:04 s3 -wsrep-sst-joiner: Evaluating /usr//bin/mariabackup --innobackupex --apply-log $rebuildcmd ${DATA} 2>&1 | logger -p daemon.err -t -innobackupex-apply
Feb 18 04:08:04 s3 -innobackupex-apply: 190218 04:08:04 innobackupex: Starting the apply-log operation

without the extra socket option (,keepalive,keepidle=10,keepintvl=10,keepcnt=2) the timeout will happen after 2 hours and not that fast



 Comments   
Comment by Jan Lindström (Inactive) [ 2019-06-14 ]

There is workaround for this problem so this issue is not a critical.

Generated at Thu Feb 08 08:45:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.