[MDEV-23064] Crashed node under Galera - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4.13
Fix Version/s: 10.4.16, 10.5.7
Component/s: Data Manipulation - Insert, Galera
Labels:
- assertion
- crash
- insert
Environment:
CentOS Linux release 7.8.2003 (Core)

Description

This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.

What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:

2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:55 error : [galeramon] There are no cluster members
2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.

After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:

mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

Attaching the full crash log below.

Also attaching the configs of one of the machines. Please note this specific one also has binary logging enabled as it is used by an external slave.

Is there a way to prevent this?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

07-02-2020-node1.txt
35 kB
2020-07-02 06:49
07-02-2020-node2.txt
42 kB
2020-07-02 06:49
07-02-2020-node3.txt
15 kB
2020-07-02 06:49
binlog.cfg
0.2 kB
2020-07-01 07:20
galera.cfg
0.8 kB
2020-07-01 07:20
Maria crash.txt
7 kB
2020-07-01 07:17
server.cfg
3 kB
2020-07-01 07:20
table structure.txt
2 kB
2020-07-02 06:49

Issue Links

relates to

MDEV-19966 GaleraCluster's one node is Crashed down when many concurrent update was issuing.

Closed

Activity

Ascending order - Click to sort in descending order

Martin Kovachev created issue - 2020-07-01 07:17

Martin Kovachev made changes - 2020-07-01 07:20

Field	Original Value	New Value
Attachment		binlog.cfg [ 52523 ]
Attachment		server.cfg [ 52524 ]
Attachment		galera.cfg [ 52525 ]

Martin Kovachev made changes - 2020-07-01 07:20

Description

This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.

What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:

2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2020-07-01 06:41:55 error : [galeramon] There are no cluster members
2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.

After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:

mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

Attaching the full crash log below.

Is there a way to prevent this?

Martin Kovachev made changes - 2020-07-01 08:50

Link

This issue relates to ~~MDEV-19966~~ [ ~~MDEV-19966~~ ]

Elena Stepanova made changes - 2020-07-01 12:01

Component/s		Galera [ 10124 ]
Fix Version/s		10.4 [ 22408 ]
Assignee		Jan Lindström [ jplindst ]

Martin Kovachev made changes - 2020-07-02 06:49

Attachment

table structure.txt [ 52547 ]

Martin Kovachev made changes - 2020-07-02 06:49

Attachment		07-02-2020-node3.txt [ 52548 ]
Attachment		07-02-2020-node2.txt [ 52549 ]
Attachment		07-02-2020-node1.txt [ 52550 ]

Jan Lindström (Inactive) made changes - 2020-10-07 11:52

Status

Open [ 1 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2020-10-07 13:41

issue.field.resolutiondate

2020-10-07 13:41:46.0

2020-10-07 13:41:46.995

Jan Lindström (Inactive) made changes - 2020-10-07 13:41

Fix Version/s		10.4.15 [ 24507 ]
Fix Version/s		10.5.6 [ 24508 ]
Fix Version/s		10.6.0 [ 24431 ]
Fix Version/s	10.4 [ 22408 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Sergei Golubchik made changes - 2020-10-07 16:48

Fix Version/s		10.4.16 [ 25020 ]
Fix Version/s		10.5.7 [ 25019 ]
Fix Version/s	10.6.0 [ 24431 ]
Fix Version/s	10.4.15 [ 24507 ]
Fix Version/s	10.5.6 [ 24508 ]

Sergei Golubchik made changes - 2021-12-06 21:51

Workflow

MariaDB v3 [ 110738 ]

MariaDB v4 [ 158043 ]

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Martin Kovachev

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2020-07-01 07:17

Updated:: 2020-12-08 09:22

Resolved:: 2020-10-07 13:41

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration