[MDEV-33495] Graceful node shutdown can crash Garbd and Cluster can go non-Primary when SSL is used - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.5
Fix Version/s: 10.5.25, 10.6.18, 10.11.8, 11.0.6, 11.1.5, 11.2.4, 11.4.2
Component/s: Galera Arbitrator garbd
Labels:
- galera-library

Description

:: 10.6.14-9-MariaDB-enterprise-log
:: Galera 26.4.14-1(ra480e94)

Tested with a Galera cluster with [2-4] Nodes + 1 Garbd using SSL for Galera replication.

This could only be reproduced when SSL is used.

The reproduction is pretty straightforward, the steps are:

Setup a Galera cluster with at least 2 nodes using SSL for Wsrep replication
Add garbd using SSL as well
Stop one node

Note, the node that causes the issue tends to be the same, in a two nodes cluster with Node1 and Node2 it could be always Node1 but unfortunately not always.

If you try that enough you should see garbd crashing and the cluster going non-Primary.

When gracefully stopping one of the galera nodes garbd can stop abruptly.
When it happens that garbd stops abruptly the outcome (most of the times) is that the cluster goes in non-Primary state.

I have tested with 4 Galera nodes + 1 Garbd supposing that the Garbd crash was happening so sudden that quorum after the graceful shutdown of the node was not yet calculated and the cluster found itself with no quorum for a moment, but also with 4+1, stopping 1 node + garbd crash could lead to lose quorum while in theory there are 3 nodes up.

It does not always happen like that in particular what can happen falls into one of the three categories:

A node is gracefully stopped and

[A] Garbd does NOT crash, Cluster is Primary
[B] Garbd crashes, Cluster stays Primary
[C] Garbd crashes, Cluster goes non-Primary

Some finding from extensive testing:

It only happens when SSL is enabled
It's not necessarily the node indicated as Donor of Garbd when the same joins
It tends to always be the same node
It is not the node with lower cluster index
There does not seem to be relationship between garbd index and the 'special' node's index (at least the pattern is not easy to spot)
Stopping the node does not always cause the issue, it can happen that garbd stays connected and quorum is not lost
When the 'suspect' node is stopped and the issue does NOT show up I could stop all nodes and leave only garbd running without any issues so the "speciality" did not move to any other node, it was simply not there, it did not happen.

My tests use the same host (so same ip) and different ports but this has been hit by customer on different nodes.

For the logs interpretation you may find this useful:

IP 192.168.178.24

DB Port / Server id / Node Name = 10614,20614,30614,40614

Galera Wsrep Port: 11614,21614,31514,41614, Garbd: 22222

For the case [B] we can see on Node 3 (31614)'s log when stopping 10614 also 22222(garbd) disappears:

============

...

...

...

2024-01-10 16:27:25 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting off

STOP NODE_10614-a58a4469-a8d8

2024-01-10 16:27:29 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:11614

2024-01-10 16:27:30 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') reconnecting to a58a4469-a8d8 (ssl://192.168.178.24:11614), attempt 0

2024-01-10 16:27:30 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') reconnecting to 02ceb1f8-bfaf (ssl://192.168.178.24:22222), attempt 0    <--- Why?

============

Nevertheless as possibly expected after some messages exchange the cluster reconfigures itself to be a 3 nodes cluster in Primary state:

2024-01-10 16:27:34 0 [Note] WSREP: evs::proto(4ac7cd7d-ab72, INSTALL, view_id(REG,02ceb1f8-bfaf,44)) suspecting node: 02ceb1f8-bfaf

2024-01-10 16:27:34 0 [Note] WSREP: evs::proto(4ac7cd7d-ab72, INSTALL, view_id(REG,02ceb1f8-bfaf,44)) suspecting node: a58a4469-a8d8

....

....

....

2024-01-10 16:27:37 0 [Note] WSREP: view(view_id(PRIM,427856fb-9778,45) memb {

	427856fb-9778,0

	4ac7cd7d-ab72,0

	bfa02389-938f,0

} joined {

} left {

} partitioned {

	02ceb1f8-bfaf,0

	a58a4469-a8d8,0

})

2024-01-10 16:27:37 0 [Note] WSREP: save pc into disk

2024-01-10 16:27:37 0 [Note] WSREP: forgetting 02ceb1f8-bfaf (ssl://192.168.178.24:22222)

2024-01-10 16:27:37 0 [Note] WSREP: forgetting a58a4469-a8d8 (ssl://192.168.178.24:11614)

2024-01-10 16:27:37 0 [Note] WSREP: (4ac7cd7d-ab72, 'ssl://192.168.178.24:31614') turning message relay requesting off

2024-01-10 16:27:37 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3

...

...

...

2024-01-10 16:27:37 0 [Note] WSREP: forgetting 02ceb1f8-bfaf (ssl://192.168.178.24:22222)

2024-01-10 16:27:37 0 [Note] WSREP: forgetting a58a4469-a8d8 (ssl://192.168.178.24:11614)

Note from the top of this log snippet only a58a4469-a8d8[Node 11614] and 02ceb1f8-bfaf[Garbd] are lost.

The correspondent garbd log section is:

2024-01-10 16:27:23.326  INFO: view(view_id(PRIM,02ceb1f8-bfaf,44) memb {

	02ceb1f8-bfaf,0

	427856fb-9778,0

	4ac7cd7d-ab72,0

	a58a4469-a8d8,0

	bfa02389-938f,0

} joined {

} left {

} partitioned {

})

2024-01-10 16:27:23.326  INFO: save pc into disk

2024-01-10 16:27:23.329  INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 5

2024-01-10 16:27:23.329  INFO: STATE_EXCHANGE: sent state UUID: c0868bd1-afcc-11ee-828b-7a2e32139412

2024-01-10 16:27:23.331  INFO: STATE EXCHANGE: sent state msg: c0868bd1-afcc-11ee-828b-7a2e32139412

2024-01-10 16:27:23.331  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 0 (garb)

2024-01-10 16:27:23.331  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 2 (NODE_30614)

2024-01-10 16:27:23.332  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 3 (NODE_10614)

2024-01-10 16:27:23.332  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 1 (NODE_20614)

2024-01-10 16:27:23.821  INFO: STATE EXCHANGE: got state msg: c0868bd1-afcc-11ee-828b-7a2e32139412 from 4 (NODE_40614)

2024-01-10 16:27:23.821  INFO: Quorum results:

	version    = 6,

	component  = PRIMARY,

	conf_id    = 41,

	members    = 4/5 (joined/total),

	act_id     = 57,

	last_appl. = 0,

	protocols  = 2/10/4 (gcs/repl/appl),

	vote policy= 0,

	group UUID = 9455f258-af00-11ee-9184-5ee699307cf2

2024-01-10 16:27:23.821  INFO: Flow-control interval: [9999999, 9999999]

2024-01-10 16:27:24.218  WARN: Protocol violation. JOIN message sender 2.0 (NODE_30614) is not in state transfer (SYNCED). Message ignored.

2024-01-10 16:27:24.348  WARN: Rejecting JOIN message from 4.0 (NODE_40614): new State Transfer required.

2024-01-10 16:27:24.348  WARN: SYNC message from non-JOINED 4.0 (NODE_40614, PRIMARY). Ignored.

2024-01-10 16:27:25.255  INFO: (02ceb1f8-bfaf, 'ssl://192.168.178.24:22222') turning message relay requesting off

<<< CRASH >>>

mysql@fedora:/run/media/claudio/FedoraData/myharem/instances/garbd$ ../10614/bin/garbd --cfg ./garbd.conf

Now let's look at case [C] still with total 5 nodes, from 21614's log, stopping 11614:

2024-01-10 16:32:54 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting off

....time passed....

2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:11614

2024-01-10 17:18:05 0 [Note] WSREP: declaring 02ceb1f8-bfb1 at ssl://192.168.178.24:22222 stable

2024-01-10 17:18:05 0 [Note] WSREP: declaring 4ac7cd7d-ab72 at ssl://192.168.178.24:31614 stable

2024-01-10 17:18:05 0 [Note] WSREP: declaring 633be1f6-a9d0 at ssl://192.168.178.24:41614 stable

2024-01-10 17:18:05 0 [Note] WSREP: forgetting 8417f130-b31c (ssl://192.168.178.24:11614)

2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting off

2024-01-10 17:18:05 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') turning message relay requesting on, nonlive peers: ssl://192.168.178.24:22222

2024-01-10 17:18:07 0 [Note] WSREP: (427856fb-9778, 'ssl://192.168.178.24:21614') reconnecting to 02ceb1f8-bfb1 (ssl://192.168.178.24:22222), attempt 0

2024-01-10 17:18:11 0 [Note] WSREP:  cleaning up 8417f130-b31c (ssl://192.168.178.24:11614)

2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspecting node: 02ceb1f8-bfb1

2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspected node without join message, declaring inactive

2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspecting node: 4ac7cd7d-ab72

2024-01-10 17:18:11 0 [Note] WSREP: evs::proto(427856fb-9778, GATHER, view_id(REG,02ceb1f8-bfb1,59)) suspected node without join message, declaring inactive

2024-01-10 17:18:11 0 [Note] WSREP: view(view_id(NON_PRIM,02ceb1f8-bfb1,59) memb {

	427856fb-9778,0

	633be1f6-a9d0,0

} joined {

} left {

} partitioned {

	02ceb1f8-bfb1,0  [Garbd crashed]

	4ac7cd7d-ab72,0  [Why 31614 ?]

	8417f130-b31c,0   [11614 - We stopped it]

})

2024-01-10 17:18:11 0 [Note] WSREP: declaring 633be1f6-a9d0 at ssl://192.168.178.24:41614 stable

2024-01-10 17:18:11 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2

It's confusing because it first declares gardb/22222, 31614, 41614 stable and then it says that 22222, 11614 and 31614[!!!] partitioned, why 31614/4ac7cd7d-ab72 is lost at all?

As you can see in the 'Suspecting Node' lines instead of being 11614 and 22222 we find 22222 and 31614 uuids.

Again, I could not reproduce this without SSL.

Attachments

Activity

People

Assignee:: Julius Goryavsky

Reporter:: Jan Lindström

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2024-02-20 07:00

Updated:: 2024-07-07 19:04

Resolved:: 2024-04-23 13:41

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.