[MDEV-30888] Node restarting causes cluster to crash - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4.13
Fix Version/s: None
Component/s: Galera, Galera SST
Labels:
- bug
- galera
- innodb
- replication
Environment:
prod

Description

NOTE: This is to fix our issue and understand it more/understand if we are doing something wrong. ty for the help and sorry if bad issue, first time on jira.

Seems to be similar if not the exact thing (but with a bigger cluster) as the following issue:
https://github.com/codership/galera/issues/623
And this issue:
https://github.com/codership/galera/issues/410

This issue seems to only re-occur when a non-clean shutdown occurs (I.e, the shutdown of VM via killing the process, disconnection from power, etc...)

Recently we had a couple of problems with our Galera cluster, we have added a 3rd region and to it 3 more nodes, (we used to have 3 nodes on 2 regions, and 1 garbd on one of those regions.)

A few days ago the compute the VM was on crashed, when the node went back up it crashed the cluster with SST problems and caused the cluster to go down being READ-only and needing to be bootstrapped.

we are using :
Galera 26.4.4
MariaDB 10.4.13

The configuration is as follows and the same on all nodes (different ist.recv_bind ip and wsrep_node_address)

my.cnf:
```
[galera]
wsrep_on=ON
wsrep_cluster_name="powerdns"
binlog_format=ROW
default_storage_enginge=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address=gcomm://<9 ips of nodes>
wsrep_notify_cmd=/usr/bin/get-status.sh

wsrep_provider_options="gmcast.segment=<segment>; ist.recv_bind=<ip>; socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem"
wsrep_dirty_reads=ON
wsrep-sync-wait=0
wsrep_node_address="<node_ip>"

[mysqld]
ssl-ca = /etc/ssl/mysq/ca-cert.pem
ssl-key = /etc/ssl/mysql/server-key.pem
ssl-ccert = /etc/ssl/mysql/server-cert.pem

[client]
ssl-ca = /etc/ssl/mysql/ca-cert.pem
ssl-key = /etc/ssl/mysql/client-key.pem
ssl-cert = /etc/ssl/mysql/client-cert.pem
```

The logs we see on the nodes that causes the crash: (JOINER nodes)
```
WSREP: Member 7.1 (db-<region-1>~~1) request state transfer from 'any'. Selected 6.1 (db~~<region-1>-2)(SYNCED) as donor.
WSREP: Shifting PRIMARY -> JOINER (TO: 59319)
WSREP: Requesting state transfer: success, donor: 6
WSREP: forgetting f46bc950-abe6 (ssl://<ip>:4567)
version= 6,
component = PRIMARY,
conf_id = 75
members = 6/7 (joined/total),
act_id = 59324
last_appl. = 59214
protocols = 2/10/4 (gcs/repl/appl),
[Warning] WSREP: Donor f46bc950-9d7f-11ed-abe6-57fe7b2de322 is no longer in the group. State transfer cannot be completed, need to abort. Aborting
WSREP: /usr/bin/mysql: Terminated
systemd: mariadb.service: main process exited, code=killed, status=6/ABRT
mysqld: Terminated
WSREP_SST: [INFO] Joined cleanup. rsync PID:4389
rsyncd[4389]: sent 0 bytes recieved 0 bytes total size 0
mysql: WSREP_SST:[INFO] Joined cleanup done.
Failed to start MariaDB 10.4.13
```

The logs we see on the donor LOGS:
```
WSREP: Member 7.1 (db-<region-1>~~1) request state transfer from 'any'. Selected 6.1 (db~~<region-1>-2)(SYNCED) as donor.
Shifting SYNCED -> DONOR/DESYNCED (TO: 59319)
WSREP: Detected STR version: 1, req_len: 120, req: STRv1
Cert index preload: 59215 -> 59319
IST sender using ssl
[ERROR] WSREP: Failed to process action STATE_REQUEST, g:59319, l:5187, ptr:0x7f6322974e78, size: 120: IST sender, failed to connect 'ssl://<server_ip>:4568': connect: No router to hose: 113 (No route to host)
```

Then after that, the node continued each one in the "line" of DONORS until he reached one that he didn't crash (the one we bootstrapped from).

The second time (after it restarts) we can see normal logs up until the following log:
`[Warning] WSREP: Donor <id> is no longer in the group. State transfer cannot be completed, need to abort. Aborting...`
This seems to be because the connecting node caused it to crash, then we see the same log on all of the other nodes that it crashes.

This already happened twice to us and causes a lot of problems and downtime, what is the cause of this? why does this sometimes happen?

Why sometimes the node succeeds and is able to sync, and other times it goes 1 by 1 to the nodes and causes them to crash?
Ty

Attachments

Issue Links

is duplicated by

MDEV-30887 Node restarting causes cluster to crash

Closed

relates to

MDEV-23958 All node ejected from cluster after a new member

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ben Shalev

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2023-03-20 15:03

Updated:: 2023-04-03 10:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server