Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33495

Graceful node shutdown can crash Garbd and Cluster can go non-Primary when SSL is used

    XMLWordPrintable

Details

    Description

      :: 10.6.14-9-MariaDB-enterprise-log
      :: Galera 26.4.14-1(ra480e94)

      Tested with a Galera cluster with [2-4] Nodes + 1 Garbd using SSL for Galera replication.

      This could only be reproduced when SSL is used.

      The reproduction is pretty straightforward, the steps are:

      • Setup a Galera cluster with at least 2 nodes using SSL for Wsrep replication
      • Add garbd using SSL as well
      • Stop one node

      Note, the node that causes the issue tends to be the same, in a two nodes cluster with Node1 and Node2 it could be always Node1 but unfortunately not always.

      If you try that enough you should see garbd crashing and the cluster going non-Primary.

      When gracefully stopping one of the galera nodes garbd can stop abruptly.
      When it happens that garbd stops abruptly the outcome (most of the times) is that the cluster goes in non-Primary state.

      I have tested with 4 Galera nodes + 1 Garbd supposing that the Garbd crash was happening so sudden that quorum after the graceful shutdown of the node was not yet calculated and the cluster found itself with no quorum for a moment, but also with 4+1, stopping 1 node + garbd crash could lead to lose quorum while in theory there are 3 nodes up.

      It does not always happen like that in particular what can happen falls into one of the three categories:

      A node is gracefully stopped and

      [A] Garbd does NOT crash, Cluster is Primary
      [B] Garbd crashes, Cluster stays Primary
      [C] Garbd crashes, Cluster goes non-Primary

      Some finding from extensive testing:

      • It only happens when SSL is enabled
      • It's not necessarily the node indicated as Donor of Garbd when the same joins
      • It tends to always be the same node
      • It is not the node with lower cluster index
      • There does not seem to be relationship between garbd index and the 'special' node's index (at least the pattern is not easy to spot)
      • Stopping the node does not always cause the issue, it can happen that garbd stays connected and quorum is not lost
      • When the 'suspect' node is stopped and the issue does NOT show up I could stop all nodes and leave only garbd running without any issues so the "speciality" did not move to any other node, it was simply not there, it did not happen.

      My tests use the same host (so same ip) and different ports but this has been hit by customer on different nodes.

      For the logs interpretation you may find this useful:

      IP 192.168.178.24
      DB Port / Server id / Node Name = 10614,20614,30614,40614
      Galera Wsrep Port: 11614,21614,31514,41614, Garbd: 22222
      

      For the case [B] we can see on Node 3 (31614)'s log when stopping 10614 also 22222(garbd) disappears:

      ============
      ...
      ...
      ...
      2024-01-10 16:27:25 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting off
       
      STOP NODE_10614-a58a4469-a8d8
       
      2024-01-10 16:27:29 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:11614 
      2024-01-10 16:27:30 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') reconnecting to a58a4469-a8d8 (ssl://192.168.178.24:11614), attempt 0
      2024-01-10 16:27:30 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') reconnecting to 02ceb1f8-bfaf (ssl://192.168.178.24:22222), attempt 0    <--- Why?
      ============
      

      Nevertheless as possibly expected after some messages exchange the cluster reconfigures itself to be a 3 nodes cluster in Primary state:

      2024-01-10 16:27:34 0 [Note] WSREP: evs::proto(4ac7cd7d-ab72, INSTALL, view_id(REG,02ceb1f8-bfaf,44)) suspecting node: 02ceb1f8-bfaf
      2024-01-10 16:27:34 0 [Note] WSREP: evs::proto(4ac7cd7d-ab72, INSTALL, view_id(REG,02ceb1f8-bfaf,44)) suspecting node: a58a4469-a8d8
      ....
      ....
      ....
      2024-01-10 16:27:37 0 [Note] WSREP: view(view_id(PRIM,427856fb-9778,45) memb {
      	427856fb-9778,0
      	4ac7cd7d-ab72,0
      	bfa02389-938f,0
      } joined {
      } left {
      } partitioned {
      	02ceb1f8-bfaf,0
      	a58a4469-a8d8,0
      })
      2024-01-10 16:27:37 0 [Note] WSREP: save pc into disk
      2024-01-10 16:27:37 0 [Note] WSREP: forgetting 02ceb1f8-bfaf (ssl://192.168.178.24:22222)
      2024-01-10 16:27:37 0 [Note] WSREP: forgetting a58a4469-a8d8 (ssl://192.168.178.24:11614)
      2024-01-10 16:27:37 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting off
      2024-01-10 16:27:37 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
      ...
      ...
      ...
      2024-01-10 16:27:37 0 [Note] WSREP: forgetting 02ceb1f8-bfaf (ssl://192.168.178.24:22222)
      2024-01-10 16:27:37 0 [Note] WSREP: forgetting a58a4469-a8d8 (ssl://192.168.178.24:11614)
      

      Note from the top of this log snippet only a58a4469-a8d8[Node 11614] and 02ceb1f8-bfaf[Garbd] are lost.

      The correspondent garbd log section is:

       
      2024-01-10 16:27:23.326  INFO: view(view_id(PRIM,02ceb1f8-bfaf,44) memb {
      	02ceb1f8-bfaf,0
      	427856fb-9778,0
      	4ac7cd7d-ab72,0
      	a58a4469-a8d8,0
      	bfa02389-938f,0
      } joined {
      } left {
      } partitioned {
      })
      2024-01-10 16:27:23.326  INFO: save pc into disk
      2024-01-10 16:27:23.329  INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 5
      2024-01-10 16:27:23.329  INFO: STATE_EXCHANGE: sent state UUID: c0868bd1-afcc-11ee-828b-7a2e32139412
      2024-01-10 16:27:23.331  INFO: STATE EXCHANGE: sent state msg: c0868bd1-afcc-11ee-828b-7a2e32139412
      2024-01-10 16:27:23.331  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 0 (garb)
      2024-01-10 16:27:23.331  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 2 (NODE_30614)
      2024-01-10 16:27:23.332  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 3 (NODE_10614)
      2024-01-10 16:27:23.332  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 1 (NODE_20614)
      2024-01-10 16:27:23.821  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 4 (NODE_40614)
      2024-01-10 16:27:23.821  INFO: Quorum results:
      	version    = 6,
      	component  = PRIMARY,
      	conf_id    = 41,
      	members    = 4/5 (joined/total),
      	act_id     = 57,
      	last_appl. = 0,
      	protocols  = 2/10/4 (gcs/repl/appl),
      	vote policy= 0,
      	group UUID = 9455f258-af00-11ee-9184-5ee699307cf2
      2024-01-10 16:27:23.821  INFO: Flow-control interval: [9999999, 9999999]
      2024-01-10 16:27:24.218  WARN: Protocol violation. JOIN message sender 2.0 (NODE_30614) is not in state transfer (SYNCED). Message ignored.
      2024-01-10 16:27:24.348  WARN: Rejecting JOIN message from 4.0 (NODE_40614): new State Transfer required.
      2024-01-10 16:27:24.348  WARN: SYNC message from non-JOINED 4.0 (NODE_40614, PRIMARY). Ignored.
      2024-01-10 16:27:25.255  INFO: (02ceb1f8-bfaf, 'ssl://192.168.178.24:22222') turning message relay requesting off
      <<< CRASH >>>
      mysql@fedora:/run/media/claudio/FedoraData/myharem/instances/garbd$ ../10614/bin/garbd --cfg ./garbd.conf 
      

      Now let's look at case [C] still with total 5 nodes, from 21614's log, stopping 11614:

      2024-01-10 16:32:54 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting off
      ....time passed....
      2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:11614 
      2024-01-10 17:18:05 0 [Note] WSREP: declaring 02ceb1f8-bfb1 at ssl://192.168.178.24:22222 stable
      2024-01-10 17:18:05 0 [Note] WSREP: declaring 4ac7cd7d-ab72 at ssl://192.168.178.24:31614 stable
      2024-01-10 17:18:05 0 [Note] WSREP: declaring 633be1f6-a9d0 at ssl://192.168.178.24:41614 stable
      2024-01-10 17:18:05 0 [Note] WSREP: forgetting 8417f130-b31c (ssl://192.168.178.24:11614)
      2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting off
      2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:22222 
      2024-01-10 17:18:07 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') reconnecting to 02ceb1f8-bfb1 (ssl://192.168.178.24:22222), attempt 0
      2024-01-10 17:18:11 0 [Note] WSREP:  cleaning up 8417f130-b31c (ssl://192.168.178.24:11614)
      2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspecting node: 02ceb1f8-bfb1
      2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspected node without join message, declaring inactive
      2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspecting node: 4ac7cd7d-ab72
      2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspected node without join message, declaring inactive
      2024-01-10 17:18:11 0 [Note] WSREP: view(view_id(NON_PRIM,02ceb1f8-bfb1,59) memb {
      	427856fb-9778,0
      	633be1f6-a9d0,0
      } joined {
      } left {
      } partitioned {
      	02ceb1f8-bfb1,0  [Garbd crashed]
      	4ac7cd7d-ab72,0  [Why 31614 ?]
      	8417f130-b31c,0   [11614 - We stopped it]
      })
      2024-01-10 17:18:11 0 [Note] WSREP: declaring 633be1f6-a9d0 at ssl://192.168.178.24:41614 stable
      2024-01-10 17:18:11 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
      

      It's confusing because it first declares gardb/22222, 31614, 41614 stable and then it says that 22222, 11614 and 31614[!!!] partitioned, why 31614/4ac7cd7d-ab72 is lost at all?

      As you can see in the 'Suspecting Node' lines instead of being 11614 and 22222 we find 22222 and 31614 uuids.

      Again, I could not reproduce this without SSL.

      Attachments

        Activity

          People

            sysprg Julius Goryavsky
            janlindstrom Jan Lindström
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.