Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22724

Galera cluster losing primary component

    XMLWordPrintable

Details

    Description

      Unsure why this keeps happening. The cluster keeps shifting all nodes from PRIMARY to NON-PRIMARY at the exact same time. The logs are vaguely implying that updates are only being pushed to half the cluster, and not the other half, so it is splitting itself.

      Currently the cluster is 6 nodes spread across 3 data centers with 2 nodes per data center.

      When this occurs, the log message gets massively flooded. Overnight, each server generated anywhere in the neighborhood of ~50-100MB in the error log file, flooded with the same text over and over again showing the unstable cluster state. This makes it difficult to find where/how things are breaking.

      2020-05-26 10:18:47 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.16.8.12:4567
      2020-05-26 10:18:47 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection established to f57bde06 tcp://172.16.8.12:4567
      2020-05-26 10:18:47 0 [Note] WSREP: declaring 3cfcb3bd at tcp://172.16.12.61:4567 stable
      2020-05-26 10:18:47 0 [Note] WSREP: declaring 454edd57 at tcp://172.16.20.61:4567 stable
      2020-05-26 10:18:47 0 [Note] WSREP: declaring 54695c9f at tcp://172.16.12.62:4567 stable
      2020-05-26 10:18:47 0 [Note] WSREP: declaring 74c795ad at tcp://172.16.8.11:4567 stable
      2020-05-26 10:18:47 0 [Note] WSREP: declaring f57bde06 at tcp://172.16.8.12:4567 stable
      2020-05-26 10:18:47 0 [Note] WSREP: Node 14cf7875 state prim
      2020-05-26 10:18:47 0 [Note] WSREP: view(view_id(PRIM,14cf7875,7) memb {
          14cf7875,2
          3cfcb3bd,1
          454edd57,2
          54695c9f,1
          74c795ad,8
          f57bde06,8
      } joined {
      } left {
      } partitioned {
      })
      2020-05-26 10:18:47 0 [Note] WSREP: save pc into disk
      2020-05-26 10:18:47 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 6
      2020-05-26 10:18:47 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: f5acf28b-9f74-11ea-a86a-d6a84667ccd7
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: sent state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 4 (puy-maria-01)
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 0 (sea-maria-02)
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 2 (sea-maria-01)
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 3 (tac-maria-02)
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 1 (tac-cospix-maria-01)
      2020-05-26 10:18:47 0 [Note] WSREP: STATE EXCHANGE: got state msg: f5acf28b-9f74-11ea-a86a-d6a84667ccd7 from 5 (puy-maria-02)
      2020-05-26 10:18:47 0 [Note] WSREP: Quorum results:
          version    = 5,
          component  = PRIMARY,
          conf_id    = 5,
          members    = 5/6 (joined/total),
          act_id     = 159998,
          last_appl. = 159942,
          protocols  = 1/10/4 (gcs/repl/appl),
          vote policy= 0,
          group UUID = 00c6c8ed-9281-11ea-b296-5b01975a9d80
      2020-05-26 10:18:47 0 [Note] WSREP: Writing down CC checksum: c3bcc75b e521c376 e088241f 2abed3ee at offset 456
      2020-05-26 10:18:47 0 [Note] WSREP: Flow-control interval: [39, 39]
      2020-05-26 10:18:47 0 [Note] WSREP: Trying to continue unpaused monitor
      2020-05-26 10:18:47 2 [Note] WSREP: ####### processing CC 159999, local, ordered
      2020-05-26 10:18:47 2 [Note] WSREP: ####### drain monitors upto 159998
      2020-05-26 10:18:47 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
      2020-05-26 10:18:47 2 [Note] WSREP: ####### My UUID: 14cf7875-9f6d-11ea-be91-3b28d841cd17
      2020-05-26 10:18:47 2 [Note] WSREP: ####### ST not required
      2020-05-26 10:18:47 2 [Note] WSREP: Skipping cert index reset
      2020-05-26 10:18:47 2 [Note] WSREP: ####### Adjusting cert position: 159998 -> 159999
      2020-05-26 10:18:47 0 [Note] WSREP: Service thread queue flushed.
      2020-05-26 10:18:47 2 [Note] WSREP: ####### Setting monitor position to 159999
      2020-05-26 10:18:47 2 [Note] WSREP: Lowest cert indnex boundary for CC from group: 159943
      2020-05-26 10:18:47 2 [Note] WSREP: Min available from gcache for CC from group: 156743
      2020-05-26 10:18:47 2 [Note] WSREP: ================================================
      View:
        id: 00c6c8ed-9281-11ea-b296-5b01975a9d80:159999
        status: primary
        protocol_version: 4
        capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
        final: no
        own_index: 0
        members(6):
          0: 14cf7875-9f6d-11ea-be91-3b28d841cd17, sea-maria-02
          1: 3cfcb3bd-9f66-11ea-b50e-4306f593b013, tac-cospix-maria-01
          2: 454edd57-9f6b-11ea-aeea-2bf542f463e4, sea-maria-01
          3: 54695c9f-9f66-11ea-a4d8-7ba7a1c6775f, tac-maria-02
          4: 74c795ad-9f6e-11ea-989d-9313028d7e05, puy-maria-01
          5: f57bde06-9f74-11ea-968d-af0c6e831d4e, puy-maria-02
      =================================================
      2020-05-26 10:18:47 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2020-05-26 10:18:47 0 [Note] WSREP: Member 5.8 (puy-maria-02) requested state transfer from '*any*'. Selected 1.1 (tac-cospix-maria-01)(SYNCED) as donor.
      2020-05-26 10:18:48 0 [Note] WSREP: 1.1 (tac-cospix-maria-01): State transfer to 5.8 (puy-maria-02) complete.
      2020-05-26 10:18:48 0 [Note] WSREP: Member 1.1 (tac-cospix-maria-01) synced with group.
      2020-05-26 10:18:50 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2020-05-26 10:33:52 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection to peer 74c795ad with addr tcp://172.16.8.11:4567 timed out, no messages seen in PT3S
      2020-05-26 10:33:52 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.16.8.11:4567
      2020-05-26 10:33:53 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') reconnecting to 74c795ad (tcp://172.16.8.11:4567), attempt 0
      2020-05-26 10:33:56 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection to peer 74c795ad with addr tcp://172.16.8.11:4567 timed out, no messages seen in PT3S
      2020-05-26 10:33:56 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') connection established to 74c795ad tcp://172.16.8.11:4567
      2020-05-26 10:33:59 0 [Note] WSREP: (14cf7875, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2020-05-26 10:51:00 0 [Note] WSREP: view(view_id(NON_PRIM,14cf7875,7) memb {
          14cf7875,2
          3cfcb3bd,1
          54695c9f,1
      } joined {
      } left {
      } partitioned {
          454edd57,2
          74c795ad,8
          f57bde06,8
      })
      2020-05-26 10:51:00 0 [Note] WSREP: declaring 3cfcb3bd at tcp://172.16.12.61:4567 stable
      2020-05-26 10:51:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
      2020-05-26 10:51:00 0 [Note] WSREP: declaring 54695c9f at tcp://172.16.12.62:4567 stable
      2020-05-26 10:51:00 0 [Note] WSREP: Writing down CC checksum: d8cb29c2 cdf070a4 64dba1ce 6fc2471e at offset 264
      2020-05-26 10:51:00 0 [Note] WSREP: Flow-control interval: [28, 28]
      2020-05-26 10:51:00 0 [Note] WSREP: Trying to continue unpaused monitor
      2020-05-26 10:51:00 0 [Note] WSREP: Received NON-PRIMARY.
      2020-05-26 10:51:00 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 161343)
      2020-05-26 10:51:00 2 [Note] WSREP: ####### processing CC -1, local, ordered
      2020-05-26 10:51:00 2 [Note] WSREP: ####### drain monitors upto 161343
      2020-05-26 10:51:00 2 [Note] WSREP: ####### My UUID: 14cf7875-9f6d-11ea-be91-3b28d841cd17
      2020-05-26 10:51:00 2 [Note] WSREP: ####### ST not required
      2020-05-26 10:51:00 2 [Note] WSREP: ================================================
      View:
        id: 00c6c8ed-9281-11ea-b296-5b01975a9d80:-1
        status: non-primary
        protocol_version: 4
        capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
        final: no
        own_index: 0
        members(3):
          0: 14cf7875-9f6d-11ea-be91-3b28d841cd17, sea-maria-02
          1: 3cfcb3bd-9f66-11ea-b50e-4306f593b013, tac-cospix-maria-01
          2: 54695c9f-9f66-11ea-a4d8-7ba7a1c6775f, tac-maria-02
      =================================================
      2020-05-26 10:51:00 2 [Note] WSREP: Non-primary view
      2020-05-26 10:51:00 2 [Note] WSREP: Server status change synced -> connected
      2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2020-05-26 10:51:00 0 [Note] WSREP: view(view_id(NON_PRIM,14cf7875,8) memb {
          14cf7875,2
          3cfcb3bd,1
          54695c9f,1
      } joined {
      } left {
      } partitioned {
          454edd57,2
          74c795ad,8
          f57bde06,8
      })
      2020-05-26 10:51:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
      2020-05-26 10:51:00 0 [Note] WSREP: Writing down CC checksum: d8cb29c2 cdf070a4 64dba1ce 6fc2471e at offset 264
      2020-05-26 10:51:00 0 [Note] WSREP: Flow-control interval: [28, 28]
      2020-05-26 10:51:00 0 [Note] WSREP: Trying to continue unpaused monitor
      2020-05-26 10:51:00 0 [Note] WSREP: Received NON-PRIMARY.
      2020-05-26 10:51:00 2 [Note] WSREP: ####### processing CC -1, local, ordered
      2020-05-26 10:51:00 2 [Note] WSREP: ####### drain monitors upto 161343
      2020-05-26 10:51:00 2 [Note] WSREP: ####### My UUID: 14cf7875-9f6d-11ea-be91-3b28d841cd17
      2020-05-26 10:51:00 2 [Note] WSREP: ####### ST not required
      2020-05-26 10:51:00 2 [Note] WSREP: ================================================
      View:
        id: 00c6c8ed-9281-11ea-b296-5b01975a9d80:-1
        status: non-primary
        protocol_version: 4
        capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
        final: no
        own_index: 0
        members(3):
          0: 14cf7875-9f6d-11ea-be91-3b28d841cd17, sea-maria-02
          1: 3cfcb3bd-9f66-11ea-b50e-4306f593b013, tac-cospix-maria-01
          2: 54695c9f-9f66-11ea-a4d8-7ba7a1c6775f, tac-maria-02
      =================================================
      2020-05-26 10:51:00 2 [Note] WSREP: Non-primary view
      2020-05-26 10:51:00 2 [Note] WSREP: Server status change connected -> connected
      2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2020-05-26 10:51:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      

      Attachments

        Issue Links

          Activity

            People

              janlindstrom Jan Lindström
              darkain Vincent Milum Jr
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.