[MDEV-30888] Node restarting causes cluster to crash Created: 2023-03-20 Updated: 2023-04-03 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera, Galera SST |
| Affects Version/s: | 10.4.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Ben Shalev | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | bug, galera, innodb, replication | ||
| Environment: |
prod |
||
| Issue Links: |
|
||||||||||||||||
| Description |
|
NOTE: This is to fix our issue and understand it more/understand if we are doing something wrong. ty for the help and sorry if bad issue, first time on jira. Seems to be similar if not the exact thing (but with a bigger cluster) as the following issue: This issue seems to only re-occur when a non-clean shutdown occurs (I.e, the shutdown of VM via killing the process, disconnection from power, etc...) Recently we had a couple of problems with our Galera cluster, we have added a 3rd region and to it 3 more nodes, (we used to have 3 nodes on 2 regions, and 1 garbd on one of those regions.) A few days ago the compute the VM was on crashed, when the node went back up it crashed the cluster with SST problems and caused the cluster to go down being READ-only and needing to be bootstrapped. we are using : The configuration is as follows and the same on all nodes (different ist.recv_bind ip and wsrep_node_address) my.cnf: wsrep_provider_options="gmcast.segment=<segment>; ist.recv_bind=<ip>; socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem" [mysqld] [client] The logs we see on the nodes that causes the crash: (JOINER nodes) The logs we see on the donor LOGS: Then after that, the node continued each one in the "line" of DONORS until he reached one that he didn't crash (the one we bootstrapped from). The second time (after it restarts) we can see normal logs up until the following log: This already happened twice to us and causes a lot of problems and downtime, what is the cause of this? why does this sometimes happen? Why sometimes the node succeeds and is able to sync, and other times it goes 1 by 1 to the nodes and causes them to crash? |
| Comments |
| Comment by Ben Shalev [ 2023-04-03 ] |
|
Hey any updates regarding the `WSREP: Donor <id> is no longer in the group. State transfer cannot be completed` error? |