Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.5.12
-
None
-
redhat 7 on VMware
Description
Hi,
Our DB config is 2 data nodes + 1 arbitrator. In one incident, one DB data got VM reboot. We found long declare stable on remaining good node.
After checking, the difference is from parameter evs.suspect_timeout.
If we set evs.suspect_timeout=PT5S (default) , declare of good remaining node takes a few seconds.
If we set evs.suspect_timeout=PT30S , declare of good remaining node takes more than 20 seconds.
What causing this long declare stable on good remaining node behavior?
Can we have keep large evs.suspect_timeout and short declare stable time?
Good Case:
49a4a26a-b4f0 is node got rebooted. declare 3974f500-ba41 stable just a few seconds later.
gmcast.peer_timeout=PT15S;
evs.inactive_check_period=PT2.5S;
evs.keepalive_period=PT1S;
evs.suspect_timeout=PT5S;
evs.inactive_timeout=PT1M;
evs.install_timeout=PT1M;
2021-11-15 23:05:58 0 [Note] WSREP: evs::proto(49a4a26a-b4f0, GATHER, view_id(REG,3974f500-ba41,175)) suspecting node: 5256b203-ba1b
2021-11-15 23:05:58 0 [Note] WSREP: evs::proto(49a4a26a-b4f0, GATHER, view_id(REG,3974f500-ba41,175)) suspected node without join message, declaring inactive
2021-11-15 23:05:59 0 [Note] WSREP: declaring 3974f500-ba41 at ssl://172.25.100.205:18301 stable
Long Declare Stable Case:
8f049c9f-9b55 is node got rebooted. declare 8144286d-abff stable 19 seconds later.
gmcast.peer_timeout=PT15S;
evs.inactive_check_period=PT2.5S;
evs.keepalive_period=PT1S;
evs.suspect_timeout=PT30S;
evs.inactive_timeout=PT1M;
evs.install_timeout=PT1M;
2021-11-15 23:14:25 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') connection to peer 8f049c9f-9b55 with addr ssl://172.25.100.203:18301 timed out, no messages seen in PT15S, socket stats: rtt: 7793 rttvar: 11996 rto: 13312000 lost: 1 last_data_recv: 15310 cwnd: 1 last_queued_since: 315444649 last_delivered_since: 15310482757 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2021-11-15 23:14:25 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') turning message relay requesting on, nonlive peers: ssl://172.25.100.203:18301
2021-11-15 23:14:26 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') reconnecting to 8f049c9f-9b55 (ssl://172.25.100.203:18301), attempt 0
2021-11-15 23:14:39 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') connection established to 8f049c9f-9b56 ssl://172.25.100.203:18301
2021-11-15 23:14:39 0 [Note] WSREP: remote endpoint ssl://172.25.100.203:18301 changed identity 8f049c9f-4626-11ec-9b55-632fa20da5bf -> 8f049c9f-4626-11ec-9b56-632fa20da5bf
2021-11-15 23:14:41 0 [Note] WSREP: evs::proto(8980a003-82ed, OPERATIONAL, view_id(REG,8144286d-abff,183)) suspecting node: 8f049c9f-9b55
2021-11-15 23:14:41 0 [Note] WSREP: evs::proto(8980a003-82ed, OPERATIONAL, view_id(REG,8144286d-abff,183)) suspected node without join message, declaring inactive
2021-11-15 23:14:44 0 [Note] WSREP: declaring 8144286d-abff at ssl://172.25.100.205:18301 stable