We have a 3 node Galera Cluster and this weekend we tried rebooting each node one at a time for an EC2 instance upgrade. When the servers came back online they each crashed with signal 11 while trying to rejoin the cluster.
The crash is occurring during WSREP recovery. If I set wsrep_on=OFF MySQL will startup without crashing, but it again crashes when dynamically setting wsrep_on=ON.
Nothing shows up in the other two nodes' logs while the other join is starting up before it crashes. And all ports are open between the Galera nodes. Each node is running MariaDB 10.1.25 but I did upgrade one node to 10.1.26 to see if the problem was fixed there and it exhibited the same behavior. The only way I was get the nodes to rejoin the cluster was to force an SST sync. However the data directory is 1.8TB so that is far from ideal for each node restart.
I've attached the wsrep_recovery log and apport crash file, but it doesn't contain the core dump for some reason. I've also uploaded the my.cnf and a mariadb.cnf config file containing the Galera Cluster related config options.