This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.
What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:
2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:55 error : [galeramon] There are no cluster members
2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.
After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:
mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
Attaching the full crash log below.
Also attaching the configs of one of the machines. Please note this specific one also has binary logging enabled as it is used by an external slave.
Is there a way to prevent this?