Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.5
Description
:: 10.6.14-9-MariaDB-enterprise-log
:: Galera 26.4.14-1(ra480e94)
Tested with a Galera cluster with [2-4] Nodes + 1 Garbd using SSL for Galera replication.
This could only be reproduced when SSL is used.
The reproduction is pretty straightforward, the steps are:
- Setup a Galera cluster with at least 2 nodes using SSL for Wsrep replication
- Add garbd using SSL as well
- Stop one node
Note, the node that causes the issue tends to be the same, in a two nodes cluster with Node1 and Node2 it could be always Node1 but unfortunately not always.
If you try that enough you should see garbd crashing and the cluster going non-Primary.
When gracefully stopping one of the galera nodes garbd can stop abruptly.
When it happens that garbd stops abruptly the outcome (most of the times) is that the cluster goes in non-Primary state.
I have tested with 4 Galera nodes + 1 Garbd supposing that the Garbd crash was happening so sudden that quorum after the graceful shutdown of the node was not yet calculated and the cluster found itself with no quorum for a moment, but also with 4+1, stopping 1 node + garbd crash could lead to lose quorum while in theory there are 3 nodes up.
It does not always happen like that in particular what can happen falls into one of the three categories:
A node is gracefully stopped and
[A] Garbd does NOT crash, Cluster is Primary
[B] Garbd crashes, Cluster stays Primary
[C] Garbd crashes, Cluster goes non-Primary
Some finding from extensive testing:
- It only happens when SSL is enabled
- It's not necessarily the node indicated as Donor of Garbd when the same joins
- It tends to always be the same node
- It is not the node with lower cluster index
- There does not seem to be relationship between garbd index and the 'special' node's index (at least the pattern is not easy to spot)
- Stopping the node does not always cause the issue, it can happen that garbd stays connected and quorum is not lost
- When the 'suspect' node is stopped and the issue does NOT show up I could stop all nodes and leave only garbd running without any issues so the "speciality" did not move to any other node, it was simply not there, it did not happen.
My tests use the same host (so same ip) and different ports but this has been hit by customer on different nodes.
For the logs interpretation you may find this useful:
IP 192.168.178.24 |
DB Port / Server id / Node Name = 10614,20614,30614,40614 |
Galera Wsrep Port: 11614,21614,31514,41614, Garbd: 22222 |
For the case [B] we can see on Node 3 (31614)'s log when stopping 10614 also 22222(garbd) disappears:
============
|
...
|
...
|
...
|
2024-01-10 16:27:25 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting off
|
|
STOP NODE_10614-a58a4469-a8d8
|
|
2024-01-10 16:27:29 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:11614
|
2024-01-10 16:27:30 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') reconnecting to a58a4469-a8d8 (ssl://192.168.178.24:11614), attempt 0
|
2024-01-10 16:27:30 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') reconnecting to 02ceb1f8-bfaf (ssl://192.168.178.24:22222), attempt 0 <--- Why?
|
============
|
Nevertheless as possibly expected after some messages exchange the cluster reconfigures itself to be a 3 nodes cluster in Primary state:
2024-01-10 16:27:34 0 [Note] WSREP: evs::proto(4ac7cd7d-ab72, INSTALL, view_id(REG,02ceb1f8-bfaf,44)) suspecting node: 02ceb1f8-bfaf
|
2024-01-10 16:27:34 0 [Note] WSREP: evs::proto(4ac7cd7d-ab72, INSTALL, view_id(REG,02ceb1f8-bfaf,44)) suspecting node: a58a4469-a8d8
|
....
|
....
|
....
|
2024-01-10 16:27:37 0 [Note] WSREP: view(view_id(PRIM,427856fb-9778,45) memb {
|
427856fb-9778,0
|
4ac7cd7d-ab72,0
|
bfa02389-938f,0
|
} joined {
|
} left {
|
} partitioned {
|
02ceb1f8-bfaf,0
|
a58a4469-a8d8,0
|
})
|
2024-01-10 16:27:37 0 [Note] WSREP: save pc into disk
|
2024-01-10 16:27:37 0 [Note] WSREP: forgetting 02ceb1f8-bfaf (ssl://192.168.178.24:22222)
|
2024-01-10 16:27:37 0 [Note] WSREP: forgetting a58a4469-a8d8 (ssl://192.168.178.24:11614)
|
2024-01-10 16:27:37 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting off
|
2024-01-10 16:27:37 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
|
...
|
...
|
...
|
2024-01-10 16:27:37 0 [Note] WSREP: forgetting 02ceb1f8-bfaf (ssl://192.168.178.24:22222)
|
2024-01-10 16:27:37 0 [Note] WSREP: forgetting a58a4469-a8d8 (ssl://192.168.178.24:11614)
|
Note from the top of this log snippet only a58a4469-a8d8[Node 11614] and 02ceb1f8-bfaf[Garbd] are lost.
The correspondent garbd log section is:
|
2024-01-10 16:27:23.326 INFO: view(view_id(PRIM,02ceb1f8-bfaf,44) memb {
|
02ceb1f8-bfaf,0
|
427856fb-9778,0
|
4ac7cd7d-ab72,0
|
a58a4469-a8d8,0
|
bfa02389-938f,0
|
} joined {
|
} left {
|
} partitioned {
|
})
|
2024-01-10 16:27:23.326 INFO: save pc into disk
|
2024-01-10 16:27:23.329 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 5
|
2024-01-10 16:27:23.329 INFO: STATE_EXCHANGE: sent state UUID: c0868bd1-afcc-11ee-828b-7a2e32139412
|
2024-01-10 16:27:23.331 INFO: STATE EXCHANGE: sent state msg: c0868bd1-afcc-11ee-828b-7a2e32139412
|
2024-01-10 16:27:23.331 INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 0 (garb)
|
2024-01-10 16:27:23.331 INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 2 (NODE_30614)
|
2024-01-10 16:27:23.332 INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 3 (NODE_10614)
|
2024-01-10 16:27:23.332 INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 1 (NODE_20614)
|
2024-01-10 16:27:23.821 INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 4 (NODE_40614)
|
2024-01-10 16:27:23.821 INFO: Quorum results:
|
version = 6,
|
component = PRIMARY,
|
conf_id = 41,
|
members = 4/5 (joined/total),
|
act_id = 57,
|
last_appl. = 0,
|
protocols = 2/10/4 (gcs/repl/appl),
|
vote policy= 0,
|
group UUID = 9455f258-af00-11ee-9184-5ee699307cf2
|
2024-01-10 16:27:23.821 INFO: Flow-control interval: [9999999, 9999999]
|
2024-01-10 16:27:24.218 WARN: Protocol violation. JOIN message sender 2.0 (NODE_30614) is not in state transfer (SYNCED). Message ignored.
|
2024-01-10 16:27:24.348 WARN: Rejecting JOIN message from 4.0 (NODE_40614): new State Transfer required.
|
2024-01-10 16:27:24.348 WARN: SYNC message from non-JOINED 4.0 (NODE_40614, PRIMARY). Ignored.
|
2024-01-10 16:27:25.255 INFO: (02ceb1f8-bfaf, 'ssl://192.168.178.24:22222') turning message relay requesting off
|
<<< CRASH >>>
|
mysql@fedora:/run/media/claudio/FedoraData/myharem/instances/garbd$ ../10614/bin/garbd --cfg ./garbd.conf
|
Now let's look at case [C] still with total 5 nodes, from 21614's log, stopping 11614:
2024-01-10 16:32:54 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting off
|
....time passed....
|
2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:11614
|
2024-01-10 17:18:05 0 [Note] WSREP: declaring 02ceb1f8-bfb1 at ssl://192.168.178.24:22222 stable
|
2024-01-10 17:18:05 0 [Note] WSREP: declaring 4ac7cd7d-ab72 at ssl://192.168.178.24:31614 stable
|
2024-01-10 17:18:05 0 [Note] WSREP: declaring 633be1f6-a9d0 at ssl://192.168.178.24:41614 stable
|
2024-01-10 17:18:05 0 [Note] WSREP: forgetting 8417f130-b31c (ssl://192.168.178.24:11614)
|
2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting off
|
2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:22222
|
2024-01-10 17:18:07 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') reconnecting to 02ceb1f8-bfb1 (ssl://192.168.178.24:22222), attempt 0
|
2024-01-10 17:18:11 0 [Note] WSREP: cleaning up 8417f130-b31c (ssl://192.168.178.24:11614)
|
2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspecting node: 02ceb1f8-bfb1
|
2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspected node without join message, declaring inactive
|
2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspecting node: 4ac7cd7d-ab72
|
2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspected node without join message, declaring inactive
|
2024-01-10 17:18:11 0 [Note] WSREP: view(view_id(NON_PRIM,02ceb1f8-bfb1,59) memb {
|
427856fb-9778,0
|
633be1f6-a9d0,0
|
} joined {
|
} left {
|
} partitioned {
|
02ceb1f8-bfb1,0 [Garbd crashed]
|
4ac7cd7d-ab72,0 [Why 31614 ?]
|
8417f130-b31c,0 [11614 - We stopped it]
|
})
|
2024-01-10 17:18:11 0 [Note] WSREP: declaring 633be1f6-a9d0 at ssl://192.168.178.24:41614 stable
|
2024-01-10 17:18:11 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
|
It's confusing because it first declares gardb/22222, 31614, 41614 stable and then it says that 22222, 11614 and 31614[!!!] partitioned, why 31614/4ac7cd7d-ab72 is lost at all?
As you can see in the 'Suspecting Node' lines instead of being 11614 and 22222 we find 22222 and 31614 uuids.
Again, I could not reproduce this without SSL.