[MDEV-8323] Failed DDL execution can cause a full Galera Cluster crash Created: 2015-06-17 Updated: 2015-07-14 Resolved: 2015-07-08 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, wsrep |
| Affects Version/s: | 5.5.43-galera, 10.0.19-galera |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Geoff Montee (Inactive) | Assignee: | Nirbhay Choubey (Inactive) |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | galera | ||
| Description |
|
Consider the following sequence of events happening with Galera Cluster if wsrep_OSU_method is set to TOI:
node1 will see an error like this:
node2 and node3 might see errors like this:
And node1 will see node2 and node3 leave the cluster, causing a loss of quorum and total cluster failure:
Should it be possible for this to happen? Can we fix this by making a node crash if DDL fails if wsrep_OSU_method is set to TOI? Making one node crash is probably better than total cluster failure most of the time. |
| Comments |
| Comment by Nirbhay Choubey (Inactive) [ 2015-07-08 ] |
|
Hi @jeoffmontee
This is not exactly what happens internally. In TOI, the DDL is replicated to other
Post-ALTER, in my opinion, the DMLs should also be tuned accordingly (made compatible |
| Comment by Geoff Montee (Inactive) [ 2015-07-14 ] |
|
Hi nirbhay_c,
If that is what is supposed to happen, I wonder if it isn't working properly. The series of events described in the JIRA issue, including the full cluster crash, has actually happened. Rather than evicting node1, node2 and node3 thought that they were compromised, so they intentionally crashed, which created a loss of quorum. Only node1 (with the outdated schema) was left alive. |