[MDEV-27051] large evs.suspect_timeout causing long delcare stable on remaining good node Created: 2021-11-15  Updated: 2022-11-01

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.12
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: William Wong Assignee: Teemu Ollakka
Resolution: Unresolved Votes: 0
Labels: None
Environment:

redhat 7 on VMware



 Description   

Hi,

Our DB config is 2 data nodes + 1 arbitrator. In one incident, one DB data got VM reboot. We found long declare stable on remaining good node.
After checking, the difference is from parameter evs.suspect_timeout.
If we set evs.suspect_timeout=PT5S (default) , declare of good remaining node takes a few seconds.
If we set evs.suspect_timeout=PT30S , declare of good remaining node takes more than 20 seconds.

What causing this long declare stable on good remaining node behavior?

Can we have keep large evs.suspect_timeout and short declare stable time?

Good Case:

49a4a26a-b4f0 is node got rebooted. declare 3974f500-ba41 stable just a few seconds later.

gmcast.peer_timeout=PT15S;
evs.inactive_check_period=PT2.5S;
evs.keepalive_period=PT1S;
evs.suspect_timeout=PT5S;
evs.inactive_timeout=PT1M;
evs.install_timeout=PT1M;

2021-11-15 23:05:58 0 [Note] WSREP: evs::proto(49a4a26a-b4f0, GATHER, view_id(REG,3974f500-ba41,175)) suspecting node: 5256b203-ba1b
2021-11-15 23:05:58 0 [Note] WSREP: evs::proto(49a4a26a-b4f0, GATHER, view_id(REG,3974f500-ba41,175)) suspected node without join message, declaring inactive
2021-11-15 23:05:59 0 [Note] WSREP: declaring 3974f500-ba41 at ssl://172.25.100.205:18301 stable

Long Declare Stable Case:

8f049c9f-9b55 is node got rebooted. declare 8144286d-abff stable 19 seconds later.

gmcast.peer_timeout=PT15S;
evs.inactive_check_period=PT2.5S;
evs.keepalive_period=PT1S;
evs.suspect_timeout=PT30S;
evs.inactive_timeout=PT1M;
evs.install_timeout=PT1M;

2021-11-15 23:14:25 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') connection to peer 8f049c9f-9b55 with addr ssl://172.25.100.203:18301 timed out, no messages seen in PT15S, socket stats: rtt: 7793 rttvar: 11996 rto: 13312000 lost: 1 last_data_recv: 15310 cwnd: 1 last_queued_since: 315444649 last_delivered_since: 15310482757 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2021-11-15 23:14:25 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') turning message relay requesting on, nonlive peers: ssl://172.25.100.203:18301
2021-11-15 23:14:26 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') reconnecting to 8f049c9f-9b55 (ssl://172.25.100.203:18301), attempt 0
2021-11-15 23:14:39 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') connection established to 8f049c9f-9b56 ssl://172.25.100.203:18301
2021-11-15 23:14:39 0 [Note] WSREP: remote endpoint ssl://172.25.100.203:18301 changed identity 8f049c9f-4626-11ec-9b55-632fa20da5bf -> 8f049c9f-4626-11ec-9b56-632fa20da5bf
2021-11-15 23:14:41 0 [Note] WSREP: evs::proto(8980a003-82ed, OPERATIONAL, view_id(REG,8144286d-abff,183)) suspecting node: 8f049c9f-9b55
2021-11-15 23:14:41 0 [Note] WSREP: evs::proto(8980a003-82ed, OPERATIONAL, view_id(REG,8144286d-abff,183)) suspected node without join message, declaring inactive
2021-11-15 23:14:44 0 [Note] WSREP: declaring 8144286d-abff at ssl://172.25.100.205:18301 stable


Generated at Thu Feb 08 09:49:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.