[MDEV-27085] Total Galera Cluster failure on Delete_rows_v1 - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.2.40, 10.2.41
Fix Version/s: 10.11, 11.4, 11.8
Component/s: Galera
Labels:
- crash
- galera
- mariadb
Environment:
Ubuntu 18.04

Description

We've had a multi-primary Galera Cluster with 5 MariaDB 10.1 for the last 5 years, without major issues.

In the past month, we started a rolling upgrade process to 10.2.40, intending to go up to 10.4 or 10.5.

We've upgraded one machine per one or two days, to give them time to fill RAM, in order to be the less disruptible for our clients.

Last thursday, November 11th, the last one was upgraded, and caught 10.2.41 instead of the 10.2.40 of the other machines.

The next day, friday 12th, at one of our peak hours, all the cluster came crashing down.

Slave SQL: Could not execute Write_rows_v1 event on table moloni_geral.documentos; Duplicate entry '27:409702:52248:0' for key 'doc_unico', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1518, Internal MariaDB error code: 1062

The logNovember12th file is the output on all the servers, which strikes me as odd - at least one shouldn't have this output but something else?

We know how this happened: an API customer tried to insert two documents at the same time, and the doc_unico unique key didn't end up unique.

However, this worked for years without problems. Our load for the day was nowhere near the worst we had.

We managed to reproduce the issue (inadvertly, while trying to implement a request throttler) on the morning of November 17th, with the exact same results.

We've also had another error, for 4 times now - one right on the afternoon of the 12th, after recovering some of the cluster machines, and three more today.

Slave SQL: Could not execute Delete_rows_v1 event on table moloni_geral.AC_remain_cookies; Can't find record in 'AC_remain_cookies', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 501, Internal MariaDB error code: 1032

The logNovember18th file contains the message that, again, appears on all the servers. Between the events today, we've tried to set wsrep_slave_threads=1 to no avail, it still occurred.

The query that triggers this error is the following:

DELETE FROM AC_remain_cookies

WHERE hash = [some user hash]

  OR expires < [a year ago]

And the structure is the following:

CREATE TABLE `AC_remain_cookies` (

  `hash` varchar(128) NOT NULL,

  `ip` varchar(250) NOT NULL,

  `user_id` int(11) NOT NULL,

  `created` datetime NOT NULL,

  `expires` datetime NOT NULL

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

ALTER TABLE `AC_remain_cookies`

  ADD PRIMARY KEY (`hash`),

  ADD KEY `expires` (`expires`),

  ADD KEY `user_id` (`user_id`);

This table is very, very small, averaging at about 15k lines.

Our Galera configuration is as follows:

[mysqld]

binlog_format=ROW

binlog_checksum=NONE

default-storage-engine=innodb

innodb_autoinc_lock_mode=2

bind-address=0.0.0.0

slave_type_conversions="ALL_NON_LOSSY,ALL_LOSSY"

# Galera Provider Configuration

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_provider_options="gcache.size=4096M;gcache.recover=yes"

# Galera Cluster Configuration

wsrep_cluster_name="moloni_cluster"

wsrep_cluster_address=gcomm://<usually five servers here>

# Galera Synchronization Configuration

wsrep_sst_method=mariabackup

wsrep_sst_auth="<username:password>"

wsrep_sst_donor="fiona"

# Galera Node Configuration

wsrep_node_address="10.32.0.27"

wsrep_node_name="grace"

wsrep_slave_threads=1

wsrep_slave_UK_checks=ON

# Retry autocommit queries

wsrep_retry_autocommit=4

Analysing the timing of the crashes and the order with which we've upgraded or recovered the machines, we're tempted to point at the machine with 10.2.41: the first crashes were the morning after the dawn the machine was upgraded from 10.1, then our test on the 17th crashed when we just had just rejoined this machine to the cluster, and this morning errors were after this machine was recovered again.

Between the 12th and the dawn of the 17th, for instance, this cluster was working with three machines, all with 10.2.40, with no major issues. And, of course, since we started the rolling upgrade, 3 weeks ago, and even further beyond.

This feels like a bug - because nothing like it happened for years on 10.1 - the queries that triggered the problems exist since forever in our codebase.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

logNovember12th
2021-11-18 16:20
6 kB
Marco Amado
logNovember18th
2021-11-18 16:28
7 kB
Marco Amado

Activity

People

Assignee:: Seppo Jaakola

Reporter:: Marco Amado (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2021-11-18 16:51

Updated:: 2026-05-04 16:54

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.