[MDEV-17935] Loss of connection every 180 seconds under load - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 10.2.17
Fix Version/s: 10.2.19
Component/s: Galera
Labels:
- galera

Description

Client community choice financial is running into a problem where during times of high traffic, one or two of the nodes will drop out and then after rejoining they error every 180 seconds exactly with a loss of connection to other members of the cluster. Here is an example of the error snippet:

2018-12-08 1:36:04 140465462552320 [Note] WSREP: (2619e469, 'ssl://0.0.0.0:4567') connection to peer 4ec5a287 with addr ssl://10.225.17.115:4567 timed out, no messages seen in PT3S
2018-12-08 1:36:04 140465462552320 [Note] WSREP: (2619e469, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.225.17.115:4567
2018-12-08 1:36:05 140465462552320 [Note] WSREP: (2619e469, 'ssl://0.0.0.0:4567') reconnecting to 4ec5a287 (ssl://10.225.17.115:4567), attempt 0
2018-12-08 1:36:06 140465462552320 [Note] WSREP: evs::proto(2619e469, GATHER, view_id(REG,0fb31d1c,854)) suspecting node: 4ec5a287
2018-12-08 1:36:06 140465462552320 [Note] WSREP: evs::proto(2619e469, GATHER, view_id(REG,0fb31d1c,854)) suspected node without join message, declaring inactive
2018-12-08 1:36:07 140465462552320 [Note] WSREP: declaring 0fb31d1c at ssl://10.225.16.156:4567 stable
2018-12-08 1:36:07 140465462552320 [Note] WSREP: declaring 1804a1ab at ssl://10.225.18.13:4567 stable
2018-12-08 1:36:07 140465462552320 [Note] WSREP: declaring c6aa9036 at ssl://10.225.17.83:4567 stable
2018-12-08 1:36:07 140465462552320 [Note] WSREP: Node 0fb31d1c state prim
2018-12-08 1:36:07 140465462552320 [Note] WSREP: view(view_id(PRIM,0fb31d1c,855) memb

{ 0fb31d1c,0 1804a1ab,0 2619e469,0 c6aa9036,0 }

joined {
} left {
} partitioned

{ 4ec5a287,0 }

)
2018-12-08 1:36:07 140465462552320 [Note] WSREP: save pc into disk
2018-12-08 1:36:07 140465462552320 [Note] WSREP: forgetting 4ec5a287 (ssl://10.225.17.115:4567)
2018-12-08 1:36:07 140465462552320 [Note] WSREP: deleting entry ssl://10.225.17.115:4567
2018-12-08 1:36:07 140465454159616 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 4
2018-12-08 1:36:07 140465454159616 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-12-08 1:36:07 140465462552320 [Note] WSREP: (2619e469, 'ssl://0.0.0.0:4567') turning message relay requesting off
2018-12-08 1:36:07 140465454159616 [Note] WSREP: STATE EXCHANGE: sent state msg: a23b93b6-fa89-11e8-afc7-fedac3aae8d8
2018-12-08 1:36:07 140465454159616 [Note] WSREP: STATE EXCHANGE: got state msg: a23b93b6-fa89-11e8-afc7-fedac3aae8d8 from 0 (ecash-db-d)
2018-12-08 1:36:07 140465454159616 [Note] WSREP: STATE EXCHANGE: got state msg: a23b93b6-fa89-11e8-afc7-fedac3aae8d8 from 1 (ecash-db-c)
2018-12-08 1:36:07 140465454159616 [Note] WSREP: STATE EXCHANGE: got state msg: a23b93b6-fa89-11e8-afc7-fedac3aae8d8 from 2 (ecash-db-a)
2018-12-08 1:36:07 140465454159616 [Note] WSREP: STATE EXCHANGE: got state msg: a23b93b6-fa89-11e8-afc7-fedac3aae8d8 from 3 (ecash-db-e)
2018-12-08 1:36:07 140465454159616 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 701,
members = 4/4 (joined/total),
act_id = 311915421,
last_appl. = 311915330,
protocols = 0/8/3 (gcs/repl/appl),
group UUID = 5388b583-0c4f-11e8-8644-9f4517a87e4a
2018-12-08 1:36:07 140465454159616 [Note] WSREP: Flow-control interval: [16, 16]
2018-12-08 1:36:07 140465454159616 [Note] WSREP: Trying to continue unpaused monitor
2018-12-08 1:36:07 140465431856896 [Note] WSREP: New cluster view: global state: 5388b583-0c4f-11e8-8644-9f4517a87e4a:311915421, view# 702: Primary, number of nodes: 4, my index: 2, protocol version 3
2018-12-08 1:36:07 140465431856896 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-12-08 1:36:07 140465431856896 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-12-08 1:36:07 140465431856896 [Note] WSREP: Assign initial position for certification: 311915421, protocol version: 3
2018-12-08 1:36:07 140465496123136 [Note] WSREP: Service thread queue flushed.
2018-12-08 1:36:09 140465462552320 [Note] WSREP: SSL handshake successful, remote endpoint ssl://10.225.17.115:56884 local endpoint ssl://10.225.16.150:4567 cipher: AES128-SHA compression: none
2018-12-08 1:36:09 140465462552320 [Note] WSREP: (2619e469, 'ssl://0.0.0.0:4567') connection established to 4ec5a287 ssl://10.225.17.115:4567
2018-12-08 1:36:09 140465462552320 [Warning] WSREP: discarding established (time wait) 4ec5a287 (ssl://10.225.17.115:4567)
2018-12-08 1:36:10 140465462552320 [Note] WSREP: cleaning up 4ec5a287 (ssl://10.225.17.115:4567)
2018-12-08 1:36:12 140465462552320 [Note] WSREP: (2619e469, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.225.17.115:4567
2018-12-08 1:36:12 140465462552320 [Note] WSREP: SSL handshake successful, remote endpoint ssl://10.225.17.115:4567 local endpoint ssl://10.225.16.150:43646 cipher: AES128-SHA compression: none

This will happen repeatedly until an indeterminate amount of time has passed and the host becomes stable again.
Client has had AWS look at underlying hardware and systems and they have not found any limits that are hit or any network issues.

Attachments

Activity

Isaac Venn (Inactive) added a comment - 2018-12-21 17:01

This issue appears to have been related to the UDP traffic being blocked from the galera node to the maxscale host.

Isaac Venn (Inactive) added a comment - 2018-12-21 17:01 This issue appears to have been related to the UDP traffic being blocked from the galera node to the maxscale host.

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Isaac Venn (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2018-12-08 02:07

Updated:: 2024-07-07 23:24

Resolved:: 2018-12-21 17:01

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server