We are experiencing a crash of all galera nodes receiving write sets. The operation is a "last resort" clean up stored procedure, that deletes many rows from the same set of related tables. It generally takes 4-5 minutes to run based on our data size, but is crashing within 10-20 seconds if it is going to go wrong.
We have been using this stored procedure, reasonably regularly, without problem on 10.1 for several years. As suggested by Enterprise support, I have also tried this on the latest 10.4 build, which they provided me with a URL to. This also exhibits the problem.
Unfortunately, I have been unable to replicate either simplified reproduction steps, or from a different system of ours. However, I have been able to take a "mariabackup" i.e. physical backup, and reproduce the fault on 2 other clusters. The original, and first replication were on VMware machines. The third system, is an AWS EC2 setup. All 3 have the same MariaDB configuration. I suspect the problem is exposed due to the particular on disk data.
Attached is the log of one of the nodes receiving the writeset.
First round of testing, I found that autocommit needs to be ON.
Due to suspecting the data, and knowing that our QA team were trying to delete rows - I started my test again and used "OPTIMIZE TABLE" on the tables that are touched. This caused
to appear in the log, at an unusual point in the crash logging.
Because of finding that info, I have now set wsrep_slave_thread = 1, and this completes successfully. Previously the value was 12. I have also tested = 4, which also crashed.
Therefore with this additional knowledge, I am presuming that something in Galera is presuming it can apply certain writesets in parallel when it cannot.