Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27051

large evs.suspect_timeout causing long delcare stable on remaining good node

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.5.12
    • 10.5
    • Galera
    • None
    • redhat 7 on VMware

    Description

      Hi,

      Our DB config is 2 data nodes + 1 arbitrator. In one incident, one DB data got VM reboot. We found long declare stable on remaining good node.
      After checking, the difference is from parameter evs.suspect_timeout.
      If we set evs.suspect_timeout=PT5S (default) , declare of good remaining node takes a few seconds.
      If we set evs.suspect_timeout=PT30S , declare of good remaining node takes more than 20 seconds.

      What causing this long declare stable on good remaining node behavior?

      Can we have keep large evs.suspect_timeout and short declare stable time?

      Good Case:

      49a4a26a-b4f0 is node got rebooted. declare 3974f500-ba41 stable just a few seconds later.

      gmcast.peer_timeout=PT15S;
      evs.inactive_check_period=PT2.5S;
      evs.keepalive_period=PT1S;
      evs.suspect_timeout=PT5S;
      evs.inactive_timeout=PT1M;
      evs.install_timeout=PT1M;

      2021-11-15 23:05:58 0 [Note] WSREP: evs::proto(49a4a26a-b4f0, GATHER, view_id(REG,3974f500-ba41,175)) suspecting node: 5256b203-ba1b
      2021-11-15 23:05:58 0 [Note] WSREP: evs::proto(49a4a26a-b4f0, GATHER, view_id(REG,3974f500-ba41,175)) suspected node without join message, declaring inactive
      2021-11-15 23:05:59 0 [Note] WSREP: declaring 3974f500-ba41 at ssl://172.25.100.205:18301 stable

      Long Declare Stable Case:

      8f049c9f-9b55 is node got rebooted. declare 8144286d-abff stable 19 seconds later.

      gmcast.peer_timeout=PT15S;
      evs.inactive_check_period=PT2.5S;
      evs.keepalive_period=PT1S;
      evs.suspect_timeout=PT30S;
      evs.inactive_timeout=PT1M;
      evs.install_timeout=PT1M;

      2021-11-15 23:14:25 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') connection to peer 8f049c9f-9b55 with addr ssl://172.25.100.203:18301 timed out, no messages seen in PT15S, socket stats: rtt: 7793 rttvar: 11996 rto: 13312000 lost: 1 last_data_recv: 15310 cwnd: 1 last_queued_since: 315444649 last_delivered_since: 15310482757 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      2021-11-15 23:14:25 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') turning message relay requesting on, nonlive peers: ssl://172.25.100.203:18301
      2021-11-15 23:14:26 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') reconnecting to 8f049c9f-9b55 (ssl://172.25.100.203:18301), attempt 0
      2021-11-15 23:14:39 0 [Note] WSREP: (8980a003-82ed, 'ssl://172.25.100.204:18301') connection established to 8f049c9f-9b56 ssl://172.25.100.203:18301
      2021-11-15 23:14:39 0 [Note] WSREP: remote endpoint ssl://172.25.100.203:18301 changed identity 8f049c9f-4626-11ec-9b55-632fa20da5bf -> 8f049c9f-4626-11ec-9b56-632fa20da5bf
      2021-11-15 23:14:41 0 [Note] WSREP: evs::proto(8980a003-82ed, OPERATIONAL, view_id(REG,8144286d-abff,183)) suspecting node: 8f049c9f-9b55
      2021-11-15 23:14:41 0 [Note] WSREP: evs::proto(8980a003-82ed, OPERATIONAL, view_id(REG,8144286d-abff,183)) suspected node without join message, declaring inactive
      2021-11-15 23:14:44 0 [Note] WSREP: declaring 8144286d-abff at ssl://172.25.100.205:18301 stable

      Attachments

        Activity

          People

            teemu.ollakka Teemu Ollakka
            frelist William Wong
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.