[MDEV-23064] Crashed node under Galera Created: 2020-07-01  Updated: 2020-12-08  Resolved: 2020-10-07

Status: Closed
Project: MariaDB Server
Component/s: Data Manipulation - Insert, Galera
Affects Version/s: 10.4.13
Fix Version/s: 10.4.16, 10.5.7

Type: Bug Priority: Critical
Reporter: Martin Kovachev Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: assertion, crash, insert
Environment:

CentOS Linux release 7.8.2003 (Core)


Attachments: Text File 07-02-2020-node1.txt     Text File 07-02-2020-node2.txt     Text File 07-02-2020-node3.txt     Text File Maria crash.txt     File binlog.cfg     File galera.cfg     File server.cfg     Text File table structure.txt    
Issue Links:
Relates
relates to MDEV-19966 GaleraCluster's one node is Crashed d... Closed

 Description   

This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.

What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:

2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:55 error : [galeramon] There are no cluster members
2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.

After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:

mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

Attaching the full crash log below.

Also attaching the configs of one of the machines. Please note this specific one also has binary logging enabled as it is used by an external slave.

Is there a way to prevent this?



 Comments   
Comment by Martin Kovachev [ 2020-07-02 ]

This morning at approximately the same time and the same query / tables being used i've had another crash...

This time i've collected logs from all 3 nodes - it seems 2 of them have crashed and the third one somehow became non-primary.

Attaching the logs (starting with 07-02)

Comment by Martin Kovachev [ 2020-07-02 ]

I've also attached the structure of the table used.

Comment by Jan Lindström (Inactive) [ 2020-10-07 ]

In my understanding this bug has been fixed. If you can repeat this with 10.4.15 please provide resolved stack dump (see https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/), error log from both nodes and more detailed steps how to reproduce.

Generated at Thu Feb 08 09:19:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.