Details
- 
    
Bug
 - 
    Status: Closed (View Workflow)
 - 
    
Major
 - 
    Resolution: Duplicate
 - 
    10.1.31
 - 
    None
 - 
    Linux observed on both Ubuntu and CentOS
 
Description
upgrading to mariadb-server 10.1.31 break the Galera SST/IST transfer.
The joiner node script gets executed but the donor never starts:
Joiner side:
					2018-02-19 18:07:09 140541460941568 [Note] WSREP: State transfer required:
			 | 
		
						Group state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676
			 | 
		
						Local state: ba08f7ac-1589-11e8-8944-27e143bb408f:273759
			 | 
		
					2018-02-19 18:07:09 140541460941568 [Note] WSREP: New cluster view: global state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676, view# 33: Primary, number of nodes: 3, my index: 0, protocol version 3
			 | 
		
					2018-02-19 18:07:09 140541460941568 [Warning] WSREP: Gap in state sequence. Need state transfer.
			 | 
		
					2018-02-19 18:07:09 140541165565696 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.55.101' --datadir '/var/lib/mysql/'   --parent '4754'  '' '
			 | 
		
					WSREP_SST: [INFO] Streaming with xbstream (20180219 18:07:09.327)
			 | 
		
					WSREP_SST: [INFO] Using socat as streamer (20180219 18:07:09.329)
			 | 
		
					WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20180219 18:07:09.332)
			 | 
		
					WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180219 18:07:09.354)
			 | 
		
					2018-02-19 18:07:11 140541190731520 [Note] WSREP: (b2ffafb5, 'tcp://0.0.0.0:4567') connection to peer b2ffafb5 with addr tcp://192.168.55.101:4567 timed out, no messages seen in PT3S
			 | 
		
					2018-02-19 18:07:11 140541190731520 [Note] WSREP: (b2ffafb5, 'tcp://0.0.0.0:4567') turning message relay requesting off
			 | 
		
					WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180219 18:08:49.362)
			 | 
		
					WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180219 18:08:49.364)
			 | 
		
					2018-02-19 18:08:49 140541165565696 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.55.101' --datadir '/var/lib/mysql/'   --parent '4754'  ''
			 | 
		
						Read: '(null)'
			 | 
		
					2018-02-19 18:08:49 140541165565696 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.55.101' --datadir '/var/lib/mysql/'   --parent '4754'  '' : 32 (Broken pipe)
			 | 
		
					2018-02-19 18:08:49 140541460941568 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
			 | 
		
					2018-02-19 18:08:49 140541460941568 [ERROR] Aborting
			 | 
		
Donor side:
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: (c9241678, ‘tcp://0.0.0.0:4567’) connection established to b2ffafb5 tcp://192.168.55.101:4567
			 | 
		
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: (c9241678, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers:
			 | 
		
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: declaring b2ffafb5 at tcp://192.168.55.101:4567 stable
			 | 
		
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: declaring b91cf7da at tcp://192.168.55.102:4567 stable
			 | 
		
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: Node b91cf7da state prim
			 | 
		
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: view(view_id(PRIM,b2ffafb5,33) memb {
			 | 
		
						b2ffafb5,0
			 | 
		
						b91cf7da,0
			 | 
		
						c9241678,0
			 | 
		
					} joined {
			 | 
		
					} left {
			 | 
		
					} partitioned {
			 | 
		
					})
			 | 
		
					2018-02-19 18:07:08 139902381455104 [Note] WSREP: save pc into disk
			 | 
		
					2018-02-19 18:07:08 139902373062400 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
			 | 
		
					2018-02-19 18:07:08 139902373062400 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: sent state msg: b3995870-159f-11e8-8bdd-c675484f2591
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: b3995870-159f-11e8-8bdd-c675484f2591 from 0 (galera2)
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: b3995870-159f-11e8-8bdd-c675484f2591 from 1 (galera3)
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: b3995870-159f-11e8-8bdd-c675484f2591 from 2 (galera1)
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: Quorum results:
			 | 
		
						version    = 4,
			 | 
		
						component  = PRIMARY,
			 | 
		
						conf_id    = 32,
			 | 
		
						members    = 2/3 (joined/total),
			 | 
		
						act_id     = 285676,
			 | 
		
						last_appl. = 0,
			 | 
		
						protocols  = 0/7/3 (gcs/repl/appl),
			 | 
		
						group UUID = ba08f7ac-1589-11e8-8944-27e143bb408f
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: Flow-control interval: [28, 28]
			 | 
		
					2018-02-19 18:07:09 139902373062400 [Note] WSREP: Trying to continue unpaused monitor
			 | 
		
					2018-02-19 18:07:09 139902653332224 [Note] WSREP: New cluster view: global state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676, view# 33: Primary, number of nodes: 3, my index: 2, protocol version 3
			 | 
		
					2018-02-19 18:07:09 139902653332224 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
			 | 
		
					2018-02-19 18:07:09 139902653332224 [Note] WSREP: REPL Protocols: 7 (3, 2)
			 | 
		
					2018-02-19 18:07:09 139902653332224 [Note] WSREP: Assign initial position for certification: 285676, protocol version: 3
			 | 
		
					2018-02-19 18:07:09 139902431176448 [Note] WSREP: Service thread queue flushed.
			 | 
		
					2018-02-19 18:07:11 139902381455104 [Note] WSREP: (c9241678, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
			 | 
		
					2018-02-19 18:08:51 139902381455104 [Note] WSREP: declaring b91cf7da at tcp://192.168.55.102:4567 stable
			 | 
		
					2018-02-19 18:08:51 139902381455104 [Note] WSREP: forgetting b2ffafb5 (tcp://192.168.55.101:4567)
			 | 
		
					2018-02-19 18:08:51 139902381455104 [Note] WSREP: Node b91cf7da state prim
			 | 
		
					2018-02-19 18:08:51 139902381455104 [Note] WSREP: view(view_id(PRIM,b91cf7da,34) memb {
			 | 
		
						b91cf7da,0
			 | 
		
						c9241678,0
			 | 
		
					} joined {
			 | 
		
					} left {
			 | 
		
					} partitioned {
			 | 
		
						b2ffafb5,0
			 | 
		
					})
			 | 
		
					2018-02-19 18:08:51 139902381455104 [Note] WSREP: save pc into disk
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
			 | 
		
					2018-02-19 18:08:51 139902381455104 [Note] WSREP: forgetting b2ffafb5 (tcp://192.168.55.101:4567)
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: STATE EXCHANGE: sent state msg: f080587a-159f-11e8-ad75-9fb8ddac3bd7
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: f080587a-159f-11e8-ad75-9fb8ddac3bd7 from 0 (galera3)
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: f080587a-159f-11e8-ad75-9fb8ddac3bd7 from 1 (galera1)
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: Quorum results:
			 | 
		
						version    = 4,
			 | 
		
						component  = PRIMARY,
			 | 
		
						conf_id    = 33,
			 | 
		
						members    = 2/2 (joined/total),
			 | 
		
						act_id     = 285676,
			 | 
		
						last_appl. = 0,
			 | 
		
						protocols  = 0/7/3 (gcs/repl/appl),
			 | 
		
						group UUID = ba08f7ac-1589-11e8-8944-27e143bb408f
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: Flow-control interval: [23, 23]
			 | 
		
					2018-02-19 18:08:51 139902373062400 [Note] WSREP: Trying to continue unpaused monitor
			 | 
		
					2018-02-19 18:08:51 139902653332224 [Note] WSREP: New cluster view: global state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676, view# 34: Primary, number of nodes: 2, my index: 1, protocol version 3
			 | 
		
					2018-02-19 18:08:51 139902653332224 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
			 | 
		
					2018-02-19 18:08:51 139902653332224 [Note] WSREP: REPL Protocols: 7 (3, 2)
			 | 
		
					2018-02-19 18:08:51 139902653332224 [Note] WSREP: Assign initial position for certification: 285676, protocol version: 3
			 | 
		
					2018-02-19 18:08:51 139902431176448 [Note] WSREP: Service thread queue flushed.
			 | 
		
					2018-02-19 18:08:56 139902381455104 [Note] WSREP:  cleaning up b2ffafb5 (tcp://192.168.55.101:4567)
			 | 
		
I traced it back to the function wait_for_listen in the wsrep_sst_xtrabackup-v2 script
v10.1.30 that works is
					# waits ~1 minute for nc/socat to open the port and then reports ready
			 | 
		
					# (regardless of timeout)
			 | 
		
					wait_for_listen()
			 | 
		
					{
			 | 
		
					    local HOST=$1
			 | 
		
					    local PORT=$2
			 | 
		
					    local MODULE=$3
			 | 
		
| 
					 | 
		
					    for i in {1..300}
			 | 
		
					    do
			 | 
		
					        ss -p state listening "( sport = :$PORT )" | grep -qE 'socat|nc' && break
			 | 
		
					        sleep 0.2
			 | 
		
					    done
			 | 
		
| 
					 | 
		
					    echo "ready ${HOST}:${PORT}/${MODULE}//$sst_ver"
			 | 
		
					}
			 | 
		
v10.1.31 that is broken
					# waits ~1 minute for nc/socat to open the port and then reports ready
			 | 
		
					# (regardless of timeout)
			 | 
		
					wait_for_listen()
			 | 
		
					{
			 | 
		
					    local HOST=$1
			 | 
		
					    local PORT=$2
			 | 
		
					    local MODULE=$3
			 | 
		
					    local LSOF_OUT
			 | 
		
| 
					 | 
		
					    for i in {1..300}
			 | 
		
					    do
			 | 
		
					        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c)
			 | 
		
					        [ -n "${LSOF_OUT}" ] && break
			 | 
		
					        sleep 0.2
			 | 
		
					    done
			 | 
		
| 
					 | 
		
					    echo "ready ${HOST}:${PORT}/${MODULE}//${WSREP_SST_OPT_SST_VER:-1}"
			 | 
		
					}
			 | 
		
Attachments
Issue Links
- is caused by
 - 
                    
MDEV-13789 mariabackup galera SST fail
-         
 - Closed
 
 -         
 
- relates to
 - 
                    
MDEV-15254 10.1.31 does not join an existing cluster with SST xtrabackup-v2
-         
 - Closed
 
 -