[MDEV-23528] Galera: simultaneous DDL and DML lead to entire cluster crash Created: 2020-08-21 Updated: 2023-11-17 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera, Server |
| Affects Version/s: | 10.4.12 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Rick Pizzi | Assignee: | Seppo Jaakola |
| Resolution: | Unresolved | Votes: | 4 |
| Labels: | None | ||
| Description |
|
Scenario:
Suddendly, node 1 errors out due to not being able to find a row to update. Notice that the error seems to come from the applier thread even if we are only writing on this node:
The node then asserts and shuts down:
At the same time, almost identical errors are logged on nodes 2 and 3: Node2:
Node 3:
Node 2 and 3 should shut down but they don't, and go into wsrep-disabled mode instead. The result of this scenario is a completely down cluster. Kill -9 is needed to shut down nodes 2 and 3, and only way to recover is to re-bootstrap the cluster, with node 2 and 3 undergoing an SST. I verified by checking binary logs of node1 , that writes only go on node 1 (no multi master). Still it puzzles me that the error comes from the applier thread! The last statement logged to binary log before the crash is the drop table referenced by the error. But the update that precedes it is on another schema. The last reference to the dropped table is from an update that is 40 seconds before the drop. |
| Comments |
| Comment by Rick Pizzi [ 2020-08-21 ] |
|
After some internal conversations, a possible explanation is that a local thread may crash in applying state if local transaction was first rolled back and then replayed, and crash would happen in replaying phase, so it could be the DDL caused a DML on same table to be rolled back and when it was replayed table was gone and hence we got the issue. This is just a speculation on my side of course. |