[MDEV-22724] Galera cluster losing primary component Created: 2020-05-26  Updated: 2023-10-09  Resolved: 2023-10-09

Status: Closed
Project: MariaDB Server
Component/s: Galera
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Vincent Milum Jr Assignee: Jan Lindström
Resolution: Incomplete Votes: 0
Labels: galera

Issue Links:
Relates
relates to MDEV-22908 After OOM event on one node, entire c... Closed

 Description   

Unsure why this keeps happening. The cluster keeps shifting all nodes from PRIMARY to NON-PRIMARY at the exact same time. The logs are vaguely implying that updates are only being pushed to half the cluster, and not the other half, so it is splitting itself.

Currently the cluster is 6 nodes spread across 3 data centers with 2 nodes per data center.

When this occurs, the log message gets massively flooded. Overnight, each server generated anywhere in the neighborhood of ~50-100MB in the error log file, flooded with the same text over and over again showing the unstable cluster state. This makes it difficult to find where/how things are breaking.

2020-05-26 10:18:47 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.16.8.12:4567
2020-05-26 10:18:47 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection established to f57bde06 tcp://172.16.8.12:4567
2020-05-26 10:18:47 0 [Note] WSREP: declaring 3cfcb3bd at tcp://172.16.12.61:4567 stable
2020-05-26 10:18:47 0 [Note] WSREP: declaring 454edd57 at tcp://172.16.20.61:4567 stable
2020-05-26 10:18:47 0 [Note] WSREP: declaring 54695c9f at tcp://172.16.12.62:4567 stable
2020-05-26 10:18:47 0 [Note] WSREP: declaring 74c795ad at tcp://172.16.8.11:4567 stable
2020-05-26 10:18:47 0 [Note] WSREP: declaring f57bde06 at tcp://172.16.8.12:4567 stable
2020-05-26 10:18:47 0 [Note] WSREP: Node 14cf7875 state prim
2020-05-26 10:18:47 0 [Note] WSREP: view(view_id(PRIM,14cf7875,7) memb {
    14cf7875,2
    3cfcb3bd,1
    454edd57,2
    54695c9f,1
    74c795ad,8
    f57bde06,8
} joined {
} left {
} partitioned {
})
2020-05-26 10:18:47 0 [Note] WSREP: save pc into disk
2020-05-26 10:18:47 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 6
2020-05-26 10:18:47 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: f5acf28b-9f74-11ea-a86a-d6a84667ccd7
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: sent state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 4 (puy-maria-01)
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 0 (sea-maria-02)
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 2 (sea-maria-01)
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 3 (tac-maria-02)
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 1 (tac-cospix-maria-01)
2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 5 (puy-maria-02)
2020-05-26 10:18:47 0 [Note] WSREP: Quorum results:
    version    = 5,
    component  = PRIMARY,
    conf_id    = 5,
    members    = 5/6 (joined/total),
    act_id     = 159998,
    last_appl. = 159942,
    protocols  = 1/10/4 (gcs/repl/appl),
    vote policy= 0,
    group UUID = 00c6c8ed-9281-11ea-b296-5b01975a9d80
2020-05-26 10:18:47 0 [Note] WSREP: Writing down CC checksum: c3bcc75b e521c376 e088241f 2abed3ee at offset 456
2020-05-26 10:18:47 0 [Note] WSREP: Flow-control interval: [39, 39]
2020-05-26 10:18:47 0 [Note] WSREP: Trying to continue unpaused monitor
2020-05-26 10:18:47 2 [Note] WSREP: ####### processing CC 159999, local, ordered
2020-05-26 10:18:47 2 [Note] WSREP: ####### drain monitors upto 159998
2020-05-26 10:18:47 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
2020-05-26 10:18:47 2 [Note] WSREP: ####### My UUID: 14cf7875-9f6d-11ea-be91-3b28d841cd17
2020-05-26 10:18:47 2 [Note] WSREP: ####### ST not required
2020-05-26 10:18:47 2 [Note] WSREP: Skipping cert index reset
2020-05-26 10:18:47 2 [Note] WSREP: ####### Adjusting cert position: 159998 -> 159999
2020-05-26 10:18:47 0 [Note] WSREP: Service thread queue flushed.
2020-05-26 10:18:47 2 [Note] WSREP: ####### Setting monitor position to 159999
2020-05-26 10:18:47 2 [Note] WSREP: Lowest cert indnex boundary for CC from group: 159943
2020-05-26 10:18:47 2 [Note] WSREP: Min available from gcache for CC from group: 156743
2020-05-26 10:18:47 2 [Note] WSREP: ================================================
View:
  id: 00c6c8ed-9281-11ea-b296-5b01975a9d80:159999
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(6):
    0: 14cf7875-9f6d-11ea-be91-3b28d841cd17, sea-maria-02
    1: 3cfcb3bd-9f66-11ea-b50e-4306f593b013, tac-cospix-maria-01
    2: 454edd57-9f6b-11ea-aeea-2bf542f463e4, sea-maria-01
    3: 54695c9f-9f66-11ea-a4d8-7ba7a1c6775f, tac-maria-02
    4: 74c795ad-9f6e-11ea-989d-9313028d7e05, puy-maria-01
    5: f57bde06-9f74-11ea-968d-af0c6e831d4e, puy-maria-02
=================================================
2020-05-26 10:18:47 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-05-26 10:18:47 0 [Note] WSREP: Member 5.8 (puy-maria-02) requested state transfer from '*any*'. Selected 1.1 (tac-cospix-maria-01)(SYNCED) as donor.
2020-05-26 10:18:48 0 [Note] WSREP: 1.1 (tac-cospix-maria-01): State transfer to 5.8 (puy-maria-02) complete.
2020-05-26 10:18:48 0 [Note] WSREP: Member 1.1 (tac-cospix-maria-01) synced with group.
2020-05-26 10:18:50 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-05-26 10:33:52 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection to peer 74c795ad with addr tcp://172.16.8.11:4567 timed out, no messages seen in PT3S
2020-05-26 10:33:52 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.16.8.11:4567
2020-05-26 10:33:53 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') reconnecting to 74c795ad (tcp://172.16.8.11:4567), attempt 0
2020-05-26 10:33:56 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection to peer 74c795ad with addr tcp://172.16.8.11:4567 timed out, no messages seen in PT3S
2020-05-26 10:33:56 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection established to 74c795ad tcp://172.16.8.11:4567
2020-05-26 10:33:59 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-05-26 10:51:00 0 [Note] WSREP: view(view_id(NON_PRIM,14cf7875,7) memb {
    14cf7875,2
    3cfcb3bd,1
    54695c9f,1
} joined {
} left {
} partitioned {
    454edd57,2
    74c795ad,8
    f57bde06,8
})
2020-05-26 10:51:00 0 [Note] WSREP: declaring 3cfcb3bd at tcp://172.16.12.61:4567 stable
2020-05-26 10:51:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
2020-05-26 10:51:00 0 [Note] WSREP: declaring 54695c9f at tcp://172.16.12.62:4567 stable
2020-05-26 10:51:00 0 [Note] WSREP: Writing down CC checksum: d8cb29c2 cdf070a4 64dba1ce 6fc2471e at offset 264
2020-05-26 10:51:00 0 [Note] WSREP: Flow-control interval: [28, 28]
2020-05-26 10:51:00 0 [Note] WSREP: Trying to continue unpaused monitor
2020-05-26 10:51:00 0 [Note] WSREP: Received NON-PRIMARY.
2020-05-26 10:51:00 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 161343)
2020-05-26 10:51:00 2 [Note] WSREP: ####### processing CC -1, local, ordered
2020-05-26 10:51:00 2 [Note] WSREP: ####### drain monitors upto 161343
2020-05-26 10:51:00 2 [Note] WSREP: ####### My UUID: 14cf7875-9f6d-11ea-be91-3b28d841cd17
2020-05-26 10:51:00 2 [Note] WSREP: ####### ST not required
2020-05-26 10:51:00 2 [Note] WSREP: ================================================
View:
  id: 00c6c8ed-9281-11ea-b296-5b01975a9d80:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
    0: 14cf7875-9f6d-11ea-be91-3b28d841cd17, sea-maria-02
    1: 3cfcb3bd-9f66-11ea-b50e-4306f593b013, tac-cospix-maria-01
    2: 54695c9f-9f66-11ea-a4d8-7ba7a1c6775f, tac-maria-02
=================================================
2020-05-26 10:51:00 2 [Note] WSREP: Non-primary view
2020-05-26 10:51:00 2 [Note] WSREP: Server status change synced -> connected
2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-05-26 10:51:00 0 [Note] WSREP: view(view_id(NON_PRIM,14cf7875,8) memb {
    14cf7875,2
    3cfcb3bd,1
    54695c9f,1
} joined {
} left {
} partitioned {
    454edd57,2
    74c795ad,8
    f57bde06,8
})
2020-05-26 10:51:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
2020-05-26 10:51:00 0 [Note] WSREP: Writing down CC checksum: d8cb29c2 cdf070a4 64dba1ce 6fc2471e at offset 264
2020-05-26 10:51:00 0 [Note] WSREP: Flow-control interval: [28, 28]
2020-05-26 10:51:00 0 [Note] WSREP: Trying to continue unpaused monitor
2020-05-26 10:51:00 0 [Note] WSREP: Received NON-PRIMARY.
2020-05-26 10:51:00 2 [Note] WSREP: ####### processing CC -1, local, ordered
2020-05-26 10:51:00 2 [Note] WSREP: ####### drain monitors upto 161343
2020-05-26 10:51:00 2 [Note] WSREP: ####### My UUID: 14cf7875-9f6d-11ea-be91-3b28d841cd17
2020-05-26 10:51:00 2 [Note] WSREP: ####### ST not required
2020-05-26 10:51:00 2 [Note] WSREP: ================================================
View:
  id: 00c6c8ed-9281-11ea-b296-5b01975a9d80:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
    0: 14cf7875-9f6d-11ea-be91-3b28d841cd17, sea-maria-02
    1: 3cfcb3bd-9f66-11ea-b50e-4306f593b013, tac-cospix-maria-01
    2: 54695c9f-9f66-11ea-a4d8-7ba7a1c6775f, tac-maria-02
=================================================
2020-05-26 10:51:00 2 [Note] WSREP: Non-primary view
2020-05-26 10:51:00 2 [Note] WSREP: Server status change connected -> connected
2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.



 Comments   
Comment by Vincent Milum Jr [ 2020-06-09 ]

The cluster has been perfectly stable at 4-nodes since the initial bug report a couple weeks ago. As soon as two more nodes are brought back in, all hell breaks loose within ~30 minutes. It generates ~2-3MB worth of logs of attempting quorum over and over and over again.

https://gist.github.com/darkain/c5cfca7d9338ae7e8508f008f1034b96

Comment by Doug Whitfield [ 2021-09-22 ]

@vincent, it what version was this happening?

Comment by Jan Lindström [ 2023-06-06 ]

darkain Please try more recent version of MariaDB and Galera library. If problem reproduces, please provide full unedited error log, node configuration and show processlist output.

Generated at Thu Feb 08 09:16:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.