[MDEV-22908] After OOM event on one node, entire cluster becomes NON-PRIMARY Created: 2020-06-16  Updated: 2021-09-24  Resolved: 2021-03-28

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.13
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 3
Labels: need_feedback

Issue Links:
Relates
relates to MDEV-22724 Galera cluster losing primary component Closed
relates to MDEV-22998 memory leak in Galera-4 Closed

 Description   

One customer of ours had an OOM event on one of its 3 nodes galera cluster.
When then node restarted, entire cluster became non-primary, causing a total outage; it needed a manual shutdown of entire cluster to recover.

I was able to reproduce the situation in my lab: 3 VMs running MariaDB 10.4.13, on one of the nodes I assigned a very small amount of memory and then ran some inserts and selects until the OOM killer fired and killed the mysqld process.

Process was restarted by systemd but when it came back, entire cluster went into non-primary. This is exactly what happened to the customer.
Said customer is also experiencing a memory leak in wsrep library (we are still trying to reproduce that) so that's a bug that triggered another bug it seems.

Needless to say, this is a very critical issue.
I am attaching logs from the 3 nodes, plus OOM event logs from node 3.

Environment: 3 x Centos7 VMs, 2G, 2G, 128M memory (to reproduce the issue; happens with any memory size when OOM fires).

PLEASE NOTE: the hostnames of the nodes mention docker but docker is not used in this test, I just recycled some VMs.

Node 1 log

2020-06-16  9:45:55 0 [Note] WSREP: (9a21602a-96c2, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.2.92:4567 
2020-06-16  9:45:56 0 [Note] WSREP: (9a21602a-96c2, 'tcp://0.0.0.0:4567') reconnecting to 34ea9378-9a6c (tcp://192.168.2.92:4567), attempt 0
2020-06-16  9:45:59 0 [Note] WSREP: evs::proto(9a21602a-96c2, OPERATIONAL, view_id(REG,2ef04c9f-96d5,153)) suspecting node: 34ea9378-9a6c
2020-06-16  9:45:59 0 [Note] WSREP: evs::proto(9a21602a-96c2, OPERATIONAL, view_id(REG,2ef04c9f-96d5,153)) suspected node without join message, declaring inactive
2020-06-16  9:46:07 0 [Warning] WSREP: evs::proto(9a21602a-96c2, GATHER, view_id(REG,2ef04c9f-96d5,153)) install timer expired
evs::proto(evs::proto(9a21602a-96c2, GATHER, view_id(REG,2ef04c9f-96d5,153)), GATHER) {
current_view=view(view_id(REG,2ef04c9f-96d5,153) memb {
	2ef04c9f-96d5,0
	34ea9378-9a6c,0
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=2091,safe_seq=2091,node_index=node: {idx=0,range=[2098,2097],safe_seq=2091} node: {idx=1,range=[2092,2091],safe_seq=2091} node: {idx=2,range=[2098,2097],safe_seq=2091} },
fifo_seq=11372,
last_sent=2097,
known:
2ef04c9f-96d5 at tcp://192.168.2.91:4567
{o=1,s=0,i=0,fs=12032,jm=
{v=1,t=4,ut=255,o=1,s=2091,sr=-1,as=2091,f=4,src=2ef04c9f-96d5,srcvid=view_id(REG,2ef04c9f-96d5,153),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=12032,nl=(
	2ef04c9f-96d5, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,2ef04c9f-96d5,153),ss=2091,ir=[2098,2097],}
	34ea9378-9a6c, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,2ef04c9f-96d5,153),ss=2091,ir=[2092,2091],}
	9a21602a-96c2, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,2ef04c9f-96d5,153),ss=2091,ir=[2098,2097],}
)
},
}
34ea9378-9a6c at tcp://192.168.2.92:4567
{o=0,s=1,i=0,fs=4630,}
9a21602a-96c2 at 
{o=1,s=0,i=0,fs=-1,jm=
{v=1,t=4,ut=255,o=1,s=2091,sr=-1,as=2091,f=0,src=9a21602a-96c2,srcvid=view_id(REG,2ef04c9f-96d5,153),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=11372,nl=(
	2ef04c9f-96d5, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,2ef04c9f-96d5,153),ss=2091,ir=[2098,2097],}
	34ea9378-9a6c, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,2ef04c9f-96d5,153),ss=2091,ir=[2092,2091],}
	9a21602a-96c2, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,2ef04c9f-96d5,153),ss=2091,ir=[2098,2097],}
)
},
}
 }
2020-06-16  9:46:07 0 [Note] WSREP: no install message received
2020-06-16  9:46:07 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,153) memb {
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	2ef04c9f-96d5,0
	34ea9378-9a6c,0
})
2020-06-16  9:46:07 0 [Note] WSREP: view(view_id(NON_PRIM,9a21602a-96c2,154) memb {
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	2ef04c9f-96d5,0
	34ea9378-9a6c,0
})
2020-06-16  9:46:07 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-06-16  9:46:07 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-06-16  9:46:07 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:07 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 130)
2020-06-16  9:46:07 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-06-16  9:46:07 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-06-16  9:46:07 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:07 2 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 9a21602a-afa0-11ea-96c2-5f57fad80d47, docker1
=================================================
2020-06-16  9:46:07 2 [Note] WSREP: Non-primary view
2020-06-16  9:46:07 2 [Note] WSREP: Server status change synced -> connected
2020-06-16  9:46:07 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:07 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:07 2 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 9a21602a-afa0-11ea-96c2-5f57fad80d47, docker1
=================================================
2020-06-16  9:46:07 2 [Note] WSREP: Non-primary view
2020-06-16  9:46:07 2 [Note] WSREP: Server status change connected -> connected
2020-06-16  9:46:07 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:07 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:11 0 [Note] WSREP: declaring 2ef04c9f-96d5 at tcp://192.168.2.91:4567 stable
2020-06-16  9:46:11 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,155) memb {
	2ef04c9f-96d5,0
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
})
2020-06-16  9:46:11 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
2020-06-16  9:46:11 0 [Note] WSREP: Flow-control interval: [23, 23]
2020-06-16  9:46:11 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:11 2 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 1
  members(2):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, unspecified
	1: 9a21602a-afa0-11ea-96c2-5f57fad80d47, docker1
=================================================
2020-06-16  9:46:11 2 [Note] WSREP: Non-primary view
2020-06-16  9:46:11 2 [Note] WSREP: Server status change connected -> connected
2020-06-16  9:46:11 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:11 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:30 0 [Note] WSREP: (9a21602a-96c2, 'tcp://0.0.0.0:4567') reconnecting to 34ea9378-9a6c (tcp://192.168.2.92:4567), attempt 30
2020-06-16  9:46:34 0 [Note] WSREP: (9a21602a-96c2, 'tcp://0.0.0.0:4567') connection established to 34ea9378-9a6d tcp://192.168.2.92:4567
2020-06-16  9:46:34 0 [Note] WSREP: remote endpoint tcp://192.168.2.92:4567 changed identity 34ea9378-afa4-11ea-9a6c-db5f63f7a91e -> 34ea9378-afa4-11ea-9a6d-db5f63f7a91e
2020-06-16  9:46:34 0 [Note] WSREP: declaring 2ef04c9f-96d5 at tcp://192.168.2.91:4567 stable
2020-06-16  9:46:34 0 [Note] WSREP: declaring 34ea9378-9a6d at tcp://192.168.2.92:4567 stable
2020-06-16  9:46:34 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,156) memb {
	2ef04c9f-96d5,0
	34ea9378-9a6d,0
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
})
2020-06-16  9:46:34 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 2, memb_num = 3
2020-06-16  9:46:34 0 [Note] WSREP: Flow-control interval: [28, 28]
2020-06-16  9:46:34 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:34 2 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 2
  members(3):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, unspecified
	1: 34ea9378-afa4-11ea-9a6d-db5f63f7a91e, unspecified
	2: 9a21602a-afa0-11ea-96c2-5f57fad80d47, docker1
=================================================
2020-06-16  9:46:34 2 [Note] WSREP: Non-primary view
2020-06-16  9:46:34 2 [Note] WSREP: Server status change connected -> connected
2020-06-16  9:46:34 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:34 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:37 0 [Note] WSREP: (9a21602a-96c2, 'tcp://0.0.0.0:4567') turning message relay requesting off

Node 2 log

2020-06-16  9:45:55 0 [Note] WSREP: (2ef04c9f-96d5, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.2.92:4567 
2020-06-16  9:45:56 0 [Note] WSREP: (2ef04c9f-96d5, 'tcp://0.0.0.0:4567') reconnecting to 34ea9378-9a6c (tcp://192.168.2.92:4567), attempt 0
2020-06-16  9:46:09 0 [Note] WSREP: evs::proto(2ef04c9f-96d5, GATHER, view_id(REG,2ef04c9f-96d5,153)) suspecting node: 34ea9378-9a6c
2020-06-16  9:46:09 0 [Note] WSREP: evs::proto(2ef04c9f-96d5, GATHER, view_id(REG,2ef04c9f-96d5,153)) suspected node without join message, declaring inactive
2020-06-16  9:46:10 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,153) memb {
	2ef04c9f-96d5,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
	9a21602a-96c2,0
})
2020-06-16  9:46:10 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,154) memb {
	2ef04c9f-96d5,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
	9a21602a-96c2,0
})
2020-06-16  9:46:10 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-06-16  9:46:10 0 [Note] WSREP: Flow-control interval: [256, 256]
2020-06-16  9:46:10 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:10 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 130)
2020-06-16  9:46:10 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-06-16  9:46:10 0 [Note] WSREP: Flow-control interval: [256, 256]
2020-06-16  9:46:10 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:10 14 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, docker2
=================================================
2020-06-16  9:46:10 14 [Note] WSREP: Non-primary view
2020-06-16  9:46:10 14 [Note] WSREP: Server status change synced -> connected
2020-06-16  9:46:10 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:10 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:10 14 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, docker2
=================================================
2020-06-16  9:46:10 14 [Note] WSREP: Non-primary view
2020-06-16  9:46:10 14 [Note] WSREP: Server status change connected -> connected
2020-06-16  9:46:10 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:10 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:11 0 [Note] WSREP: declaring 9a21602a-96c2 at tcp://192.168.2.90:4567 stable
2020-06-16  9:46:11 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,155) memb {
	2ef04c9f-96d5,0
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
})
2020-06-16  9:46:11 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
2020-06-16  9:46:11 0 [Note] WSREP: Flow-control interval: [256, 256]
2020-06-16  9:46:11 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:11 14 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(2):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, docker2
	1: 9a21602a-afa0-11ea-96c2-5f57fad80d47, unspecified
=================================================
2020-06-16  9:46:11 14 [Note] WSREP: Non-primary view
2020-06-16  9:46:11 14 [Note] WSREP: Server status change connected -> connected
2020-06-16  9:46:11 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:11 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:31 0 [Note] WSREP: (2ef04c9f-96d5, 'tcp://0.0.0.0:4567') reconnecting to 34ea9378-9a6c (tcp://192.168.2.92:4567), attempt 30
2020-06-16  9:46:34 0 [Note] WSREP: (2ef04c9f-96d5, 'tcp://0.0.0.0:4567') connection established to 34ea9378-9a6d tcp://192.168.2.92:4567
2020-06-16  9:46:34 0 [Note] WSREP: remote endpoint tcp://192.168.2.92:4567 changed identity 34ea9378-afa4-11ea-9a6c-db5f63f7a91e -> 34ea9378-afa4-11ea-9a6d-db5f63f7a91e
2020-06-16  9:46:34 0 [Note] WSREP: declaring 34ea9378-9a6d at tcp://192.168.2.92:4567 stable
2020-06-16  9:46:34 0 [Note] WSREP: declaring 9a21602a-96c2 at tcp://192.168.2.90:4567 stable
2020-06-16  9:46:34 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,156) memb {
	2ef04c9f-96d5,0
	34ea9378-9a6d,0
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
})
2020-06-16  9:46:34 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
2020-06-16  9:46:34 0 [Note] WSREP: Flow-control interval: [256, 256]
2020-06-16  9:46:34 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:34 14 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, docker2
	1: 34ea9378-afa4-11ea-9a6d-db5f63f7a91e, unspecified
	2: 9a21602a-afa0-11ea-96c2-5f57fad80d47, unspecified
=================================================
2020-06-16  9:46:34 14 [Note] WSREP: Non-primary view
2020-06-16  9:46:34 14 [Note] WSREP: Server status change connected -> connected
2020-06-16  9:46:34 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:34 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:37 0 [Note] WSREP: (2ef04c9f-96d5, 'tcp://0.0.0.0:4567') turning message relay requesting off

Node 3 log ( OOM happened here)

wsrep loader: [INFO] wsrep_load(): loading provider library '/usr/lib64/galera-4/libgalera_smm.so'
wsrep loader: [INFO] wsrep_load(): Galera 26.4.4(r4599) by Codership Oy <info@codership.com> loaded successfully.
2020-06-16  9:46:33 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-06-16  9:46:33 0 [Note] WSREP: Found saved state: cf61cf68-aef7-11ea-88db-1bc466429584:-1, safe_to_bootstrap: 0
2020-06-16  9:46:33 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: cf61cf68-aef7-11ea-88db-1bc466429584
Seqno: -1 - -1
Offset: -1
Synced: 0
2020-06-16  9:46:33 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: cf61cf68-aef7-11ea-88db-1bc466429584, offset: -1
2020-06-16  9:46:33 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2020-06-16  9:46:33 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2020-06-16  9:46:33 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 16-130
2020-06-16  9:46:34 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (       0/20643208 bytes) complete.
2020-06-16  9:46:34 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (20643208/20643208 bytes) complete.
2020-06-16  9:46:34 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 2/117 locked buffers
2020-06-16  9:46:34 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 20643208/134217728
2020-06-16  9:46:34 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.2.92; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT30S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT15S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 256; gcs.fc_master_slave = yes; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.check...
2020-06-16  9:46:34 0 [Note] WSREP: Service thread queue flushed.
2020-06-16  9:46:34 0 [Note] WSREP: ####### Assign initial position for certification: cf61cf68-aef7-11ea-88db-1bc466429584:130, protocol version: -1
2020-06-16  9:46:34 0 [Note] WSREP: Start replication
2020-06-16  9:46:34 0 [Note] WSREP: Connecting with bootstrap option: 0
2020-06-16  9:46:34 0 [Note] WSREP: Setting GCS initial position to cf61cf68-aef7-11ea-88db-1bc466429584:130
2020-06-16  9:46:34 0 [Note] WSREP: protonet asio version 0
2020-06-16  9:46:34 0 [Note] WSREP: Using CRC-32C for message checksums.
2020-06-16  9:46:34 0 [Note] WSREP: backend: asio
2020-06-16  9:46:34 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2020-06-16  9:46:34 0 [Note] WSREP: restore pc from disk successfully
2020-06-16  9:46:34 0 [Note] WSREP: GMCast version 0
2020-06-16  9:46:34 0 [Note] WSREP: (34ea9378-9a6d, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2020-06-16  9:46:34 0 [Note] WSREP: (34ea9378-9a6d, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2020-06-16  9:46:34 0 [Note] WSREP: EVS version 1
2020-06-16  9:46:34 0 [Note] WSREP: gcomm: connecting to group 'rick's lab', peer '192.168.2.90:'
2020-06-16  9:46:34 0 [Note] WSREP: (34ea9378-9a6d, 'tcp://0.0.0.0:4567') connection established to 9a21602a-96c2 tcp://192.168.2.90:4567
2020-06-16  9:46:34 0 [Note] WSREP: (34ea9378-9a6d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.2.91:4567 
2020-06-16  9:46:34 0 [Note] WSREP: (34ea9378-9a6d, 'tcp://0.0.0.0:4567') connection established to 2ef04c9f-96d5 tcp://192.168.2.91:4567
2020-06-16  9:46:34 0 [Note] WSREP: EVS version upgrade 0 -> 1
2020-06-16  9:46:34 0 [Note] WSREP: declaring 2ef04c9f-96d5 at tcp://192.168.2.91:4567 stable
2020-06-16  9:46:34 0 [Note] WSREP: declaring 9a21602a-96c2 at tcp://192.168.2.90:4567 stable
2020-06-16  9:46:34 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2020-06-16  9:46:34 0 [Note] WSREP: view(view_id(NON_PRIM,2ef04c9f-96d5,156) memb {
	2ef04c9f-96d5,0
	34ea9378-9a6d,0
	9a21602a-96c2,0
} joined {
} left {
} partitioned {
	34ea9378-9a6c,0
})
2020-06-16  9:46:34 0 [Warning] WSREP: node uuid: 2ef04c9f-96d5 last_prim(type: 3, uuid: 2ef04c9f-96d5) is inconsistent to restored view(type: V_NON_PRIM, uuid: 2ef04c9f-96d5
2020-06-16  9:46:35 0 [Note] WSREP: gcomm: connected
2020-06-16  9:46:35 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2020-06-16  9:46:35 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2020-06-16  9:46:35 0 [Note] WSREP: Opened channel 'rick's lab'
2020-06-16  9:46:35 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 3
2020-06-16  9:46:35 0 [Note] WSREP: Flow-control interval: [256, 256]
2020-06-16  9:46:35 0 [Note] WSREP: Received NON-PRIMARY.
2020-06-16  9:46:35 2 [Note] WSREP: Starting applier thread 2
2020-06-16  9:46:35 2 [Note] WSREP: ================================================
View:
  id: cf61cf68-aef7-11ea-88db-1bc466429584:-1
  status: non-primary
  protocol_version: -1
  capabilities: 
  final: no
  own_index: 1
  members(3):
	0: 2ef04c9f-afa0-11ea-96d5-223dafaac5ab, unspecified
	1: 34ea9378-afa4-11ea-9a6d-db5f63f7a91e, docker3
	2: 9a21602a-afa0-11ea-96c2-5f57fad80d47, unspecified
=================================================
2020-06-16  9:46:35 2 [Note] WSREP: Non-primary view
2020-06-16  9:46:35 2 [Note] WSREP: Server status change disconnected -> connected
2020-06-16  9:46:35 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:35 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-16  9:46:35 1 [Note] WSREP: Starting rollbacker thread 1
2020-06-16  9:46:37 0 [Note] WSREP: (34ea9378-9a6d, 'tcp://0.0.0.0:4567') turning message relay requesting off

/var/log/messages on Node 3 (OOM)

Jun 16 09:45:55 docker3 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jun 16 09:45:55 docker3 kernel: mysqld cpuset=/ mems_allowed=0
Jun 16 09:45:55 docker3 kernel: CPU: 0 PID: 1864 Comm: mysqld Tainted: G               ------------ T 3.10.0-862.el7.x86_64 #1
Jun 16 09:45:55 docker3 kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Jun 16 09:45:55 docker3 kernel: Call Trace:
Jun 16 09:45:55 docker3 kernel: [<ffffffff9970d768>] dump_stack+0x19/0x1b
Jun 16 09:45:55 docker3 kernel: [<ffffffff997090ea>] dump_header+0x90/0x229
Jun 16 09:45:55 docker3 kernel: [<ffffffff990f76e2>] ? ktime_get_ts64+0x52/0xf0
Jun 16 09:45:55 docker3 kernel: [<ffffffff9914e0bf>] ? delayacct_end+0x8f/0xb0
Jun 16 09:45:55 docker3 kernel: [<ffffffff99197904>] oom_kill_process+0x254/0x3d0
Jun 16 09:45:55 docker3 kernel: [<ffffffff991973ad>] ? oom_unkillable_task+0xcd/0x120
Jun 16 09:45:55 docker3 kernel: [<ffffffff99197456>] ? find_lock_task_mm+0x56/0xc0
Jun 16 09:45:55 docker3 kernel: [<ffffffff99198146>] out_of_memory+0x4b6/0x4f0
Jun 16 09:45:55 docker3 kernel: [<ffffffff9919ea64>] __alloc_pages_nodemask+0xaa4/0xbb0
Jun 16 09:45:55 docker3 kernel: [<ffffffff991e8528>] alloc_pages_current+0x98/0x110
Jun 16 09:45:55 docker3 kernel: [<ffffffff99193cf7>] __page_cache_alloc+0x97/0xb0
Jun 16 09:45:55 docker3 kernel: [<ffffffff991963f8>] filemap_fault+0x298/0x490
Jun 16 09:45:55 docker3 kernel: [<ffffffffc020385f>] xfs_filemap_fault+0x5f/0xe0 [xfs]
Jun 16 09:45:55 docker3 kernel: [<ffffffff991c05fa>] __do_fault.isra.58+0x8a/0x100
Jun 16 09:45:55 docker3 kernel: [<ffffffff991c0b5c>] do_read_fault.isra.60+0x4c/0x1a0
Jun 16 09:45:55 docker3 kernel: [<ffffffff991c52dc>] handle_pte_fault+0x2dc/0xc30
Jun 16 09:45:55 docker3 kernel: [<ffffffff991c747d>] handle_mm_fault+0x39d/0x9b0
Jun 16 09:45:55 docker3 kernel: [<ffffffff9971a597>] __do_page_fault+0x197/0x4f0
Jun 16 09:45:55 docker3 kernel: [<ffffffff9971a925>] do_page_fault+0x35/0x90
Jun 16 09:45:55 docker3 kernel: [<ffffffff99716ab6>] ? error_swapgs+0xa7/0xbd
Jun 16 09:45:55 docker3 kernel: [<ffffffff99716768>] page_fault+0x28/0x30
Jun 16 09:45:55 docker3 kernel: Mem-Info:
Jun 16 09:45:55 docker3 kernel: active_anon:7105 inactive_anon:7118 isolated_anon:0#012 active_file:1075 inactive_file:2070 isolated_file:0#012 unevictable:0 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:2609 slab_unreclaimable:2799#012 mapped:206 shmem:1 pagetables:992 bounce:0#012 free:416 free_pcp:3 free_cma:0
Jun 16 09:45:55 docker3 kernel: Node 0 DMA free:548kB min:192kB low:240kB high:288kB active_anon:3048kB inactive_anon:3064kB active_file:960kB inactive_file:1760kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:212kB shmem:0kB slab_reclaimable:1128kB slab_unreclaimable:1732kB kernel_stack:1872kB pagetables:512kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13975 all_unreclaimable? yes
Jun 16 09:45:55 docker3 kernel: lowmem_reserve[]: 0 90 90 90
Jun 16 09:45:55 docker3 kernel: Node 0 DMA32 free:1116kB min:1124kB low:1404kB high:1684kB active_anon:25372kB inactive_anon:25408kB active_file:3340kB inactive_file:6520kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:114624kB managed:96460kB mlocked:0kB dirty:0kB writeback:0kB mapped:612kB shmem:4kB slab_reclaimable:9308kB slab_unreclaimable:9464kB kernel_stack:2048kB pagetables:3456kB unstable:0kB bounce:0kB free_pcp:12kB local_pcp:12kB free_cma:0kB writeback_tmp:0kB pages_scanned:90373 all_unreclaimable? yes
Jun 16 09:45:55 docker3 kernel: lowmem_reserve[]: 0 0 0 0
Jun 16 09:45:55 docker3 kernel: Node 0 DMA: 21*4kB (UEM) 24*8kB (UM) 15*16kB (M) 1*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 548kB
Jun 16 09:45:55 docker3 kernel: Node 0 DMA32: 225*4kB (UEM) 27*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1116kB
Jun 16 09:45:55 docker3 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 16 09:45:55 docker3 kernel: 9876 total pagecache pages
Jun 16 09:45:55 docker3 kernel: 6719 pages in swap cache
Jun 16 09:45:55 docker3 kernel: Swap cache stats: add 350160, delete 343441, find 113374/140473
Jun 16 09:45:55 docker3 kernel: Free swap  = 1695520kB
Jun 16 09:45:55 docker3 kernel: Total swap = 2052092kB
Jun 16 09:45:55 docker3 kernel: 32654 pages RAM
Jun 16 09:45:55 docker3 kernel: 0 pages HighMem/MovableOnly
Jun 16 09:45:55 docker3 kernel: 4562 pages reserved
Jun 16 09:45:55 docker3 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jun 16 09:45:55 docker3 kernel: [  452]     0   452     8745        0      22       79             0 systemd-journal
Jun 16 09:45:55 docker3 kernel: [  468]     0   468    49631        0      29      103             0 lvmetad
Jun 16 09:45:55 docker3 kernel: [  479]     0   479    11084        1      23      144         -1000 systemd-udevd
Jun 16 09:45:55 docker3 kernel: [  580]     0   580    13877        0      27      120         -1000 auditd
Jun 16 09:45:55 docker3 kernel: [  604]    81   604    14549        0      31      162          -900 dbus-daemon
Jun 16 09:45:55 docker3 kernel: [  605]     0   605   119617       13      88      581             0 NetworkManager
Jun 16 09:45:55 docker3 kernel: [  607]     0   607     6594        0      18       84             0 systemd-logind
Jun 16 09:45:55 docker3 kernel: [  610]   999   610   134635        0      61      883             0 polkitd
Jun 16 09:45:55 docker3 kernel: [  618]     0   618    31571        0      20      155             0 crond
Jun 16 09:45:55 docker3 kernel: [  622]    38   622    11809        0      28      177             0 ntpd
Jun 16 09:45:55 docker3 kernel: [  626]     0   626    27522        1       9       32             0 agetty
Jun 16 09:45:55 docker3 kernel: [  873]     0   873   143453      119      97     2647             0 tuned
Jun 16 09:45:55 docker3 kernel: [  874]     0   874    28199        0      59      257         -1000 sshd
Jun 16 09:45:55 docker3 kernel: [  875]     0   875    53602        1      41      384             0 rsyslogd
Jun 16 09:45:55 docker3 kernel: [ 1163]     0  1163    40350       29      81      326             0 sshd
Jun 16 09:45:55 docker3 kernel: [ 1165]     0  1165    28859        1      14      104             0 bash
Jun 16 09:45:55 docker3 kernel: [ 1470]   996  1470   380637    11850     281    76501             0 mysqld
Jun 16 09:45:55 docker3 kernel: [ 1889]     0  1889    40409       50      35      306             0 mysql
Jun 16 09:45:55 docker3 kernel: Out of memory: Kill process 1470 (mysqld) score 163 or sacrifice child
Jun 16 09:45:55 docker3 kernel: Killed process 1470 (mysqld) total-vm:1522548kB, anon-rss:47400kB, file-rss:0kB, shmem-rss:0kB
Jun 16 09:45:55 docker3 systemd: mariadb.service: main process exited, code=killed, status=9/KILL
Jun 16 09:45:55 docker3 systemd: Unit mariadb.service entered failed state.
Jun 16 09:45:55 docker3 systemd: mariadb.service failed.
Jun 16 09:46:00 docker3 systemd: mariadb.service holdoff time over, scheduling restart.
Jun 16 09:46:00 docker3 systemd: Starting MariaDB 10.4.13 database server...
Jun 16 09:46:33 docker3 sh: WSREP: Recovered position cf61cf68-aef7-11ea-88db-1bc466429584:130
Jun 16 09:46:33 docker3 mysqld: 2020-06-16  9:46:33 0 [Note] /usr/sbin/mysqld (mysqld 10.4.13-MariaDB-log) starting as process 2248 ...



 Comments   
Comment by Rick Pizzi [ 2020-06-16 ]

Config file (same on all nodes except server-id and wsrep_cluster_address)

[mysql]
socket                         = /var/lib/mysql/mysql.sock
 
[mysqld]
user                           = mysql
basedir                        = /usr
datadir                        = /var/lib/mysql
socket                         = /var/lib/mysql/mysql.sock
server_id                      = 3
bind-address                   = 0.0.0.0
skip-name-resolve              = true
memlock                        = 0
sysdate-is-now                 = 1
max-connections                = 8000
thread-cache-size              = 512
query-cache-type               = 0
query-cache-size               = 0
lower-case-table-names         = 1
wait_timeout                   = 130
open_files_limit               = 65535
table_open_cache               = 20000
table_definition_cache         = 20000
key_buffer_size                = 24M
tmp_table_size                 = 64M
max_heap_table_size            = 64M
max-allowed-packet             = 512M
sql_mode=''
 
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_stats_on_metadata = 0
innodb_io_capacity = 1000
innodb_file_format = Barracuda
innodb_thread_concurrency = 16
innodb-buffer-pool-size=128M
innodb-flush-log-at-trx-commit=2
innodb-file-per-table=1
innodb-data-file-path = ibdata1:12M:autoextend
innodb-read-io-threads=4
innodb-write-io-threads=4
innodb-doublewrite=1
innodb-log-file-size=2000M
innodb-log-buffer-size=64M
innodb-log-files-in-group=2
innodb-file-format=barracuda
innodb-flush-method = O_DIRECT
innodb-locks-unsafe-for-binlog=1
innodb_open_files=4096
binlog_format=ROW
slow_query_log = 0  
long_query_time = 20
log_slow_verbosity=query_plan,innodb
log_warnings=2 
log_bin=/var/lib/mysql/mariadb-bin.log
log_error=/var/log/mysql/mysqld.err
log_slave_updates=1
expire_logs_days=3
 
gtid_strict_mode=1
 
wsrep_on                       = ON
wsrep_gtid_mode                = ON
wsrep_gtid_domain_id           = 1
wsrep_provider                 = /usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address          = "gcomm://192.168.2.90"
#wsrep_cluster_address="gcomm://"
wsrep_sst_method               = mariabackup
wsrep_sst_auth                 = sst:sst
wsrep_cluster_name             = "rick's lab"
wsrep_slave_threads            = 8
wsrep_provider_options         = "gcache.size=128M;evs.inactive_timeout=PT30S;evs.suspect_timeout=PT15S;gcs.fc_master_slave=yes;gcs.fc_limit=256"
 
[xtrabackup]
parallel=4
 
[mysqldump]
max-allowed-packet = 512M

Comment by Julius Goryavsky [ 2020-07-21 ]

As far as I understand, the primary component is lost due to lack of memory on one of the nodes (Node 3). It is difficult to say for sure in advance, but perhaps this problem has a common cause with the MDEV-22998 issue. If a memory leak is confirmed in the MDEV-22998 test, then this may be an explanation for both problems.

Comment by Jan Lindström (Inactive) [ 2021-01-28 ]

rpizzi Can we have full error logs from all nodes and please could you use -mysqld=-wsrep-debug=1

Comment by Jan Lindström (Inactive) [ 2021-01-29 ]

I could not repeat the issue with 3-node cluster and limiting memory size on one node. That one node naturally crashed but all other nodes still are in Primary state. This with mariadb-10.4.18 and Galera library 26.4.7.

Comment by Doug Whitfield [ 2021-09-22 ]

Has anyone seen this behavior in 10.3?

Comment by Doug Whitfield [ 2021-09-24 ]

if this is the same issue as MDEV-22998, then this isn't going to appear in 10.3, because the code that fixed MDEV-22998 did not exist in 10.3.

Generated at Thu Feb 08 09:18:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.