[MDEV-27085] Total Galera Cluster failure on Delete_rows_v1 - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.2.40, 10.2.41
Fix Version/s: None
Component/s: Galera
Labels:
- crash
- galera
- mariadb
Environment:
Ubuntu 18.04

Description

We've had a multi-primary Galera Cluster with 5 MariaDB 10.1 for the last 5 years, without major issues.

In the past month, we started a rolling upgrade process to 10.2.40, intending to go up to 10.4 or 10.5.

We've upgraded one machine per one or two days, to give them time to fill RAM, in order to be the less disruptible for our clients.

Last thursday, November 11th, the last one was upgraded, and caught 10.2.41 instead of the 10.2.40 of the other machines.

The next day, friday 12th, at one of our peak hours, all the cluster came crashing down.

Slave SQL: Could not execute Write_rows_v1 event on table moloni_geral.documentos; Duplicate entry '27:409702:52248:0' for key 'doc_unico', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1518, Internal MariaDB error code: 1062

The logNovember12th file is the output on all the servers, which strikes me as odd - at least one shouldn't have this output but something else?

We know how this happened: an API customer tried to insert two documents at the same time, and the doc_unico unique key didn't end up unique.

However, this worked for years without problems. Our load for the day was nowhere near the worst we had.

We managed to reproduce the issue (inadvertly, while trying to implement a request throttler) on the morning of November 17th, with the exact same results.

We've also had another error, for 4 times now - one right on the afternoon of the 12th, after recovering some of the cluster machines, and three more today.

Slave SQL: Could not execute Delete_rows_v1 event on table moloni_geral.AC_remain_cookies; Can't find record in 'AC_remain_cookies', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 501, Internal MariaDB error code: 1032

The logNovember18th file contains the message that, again, appears on all the servers. Between the events today, we've tried to set wsrep_slave_threads=1 to no avail, it still occurred.

The query that triggers this error is the following:

DELETE FROM AC_remain_cookies

WHERE hash = [some user hash]

  OR expires < [a year ago]

And the structure is the following:

CREATE TABLE `AC_remain_cookies` (

  `hash` varchar(128) NOT NULL,

  `ip` varchar(250) NOT NULL,

  `user_id` int(11) NOT NULL,

  `created` datetime NOT NULL,

  `expires` datetime NOT NULL

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

ALTER TABLE `AC_remain_cookies`

  ADD PRIMARY KEY (`hash`),

  ADD KEY `expires` (`expires`),

  ADD KEY `user_id` (`user_id`);

This table is very, very small, averaging at about 15k lines.

Our Galera configuration is as follows:

[mysqld]

binlog_format=ROW

binlog_checksum=NONE

default-storage-engine=innodb

innodb_autoinc_lock_mode=2

bind-address=0.0.0.0

slave_type_conversions="ALL_NON_LOSSY,ALL_LOSSY"

# Galera Provider Configuration

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_provider_options="gcache.size=4096M;gcache.recover=yes"

# Galera Cluster Configuration

wsrep_cluster_name="moloni_cluster"

wsrep_cluster_address=gcomm://<usually five servers here>

# Galera Synchronization Configuration

wsrep_sst_method=mariabackup

wsrep_sst_auth="<username:password>"

wsrep_sst_donor="fiona"

# Galera Node Configuration

wsrep_node_address="10.32.0.27"

wsrep_node_name="grace"

wsrep_slave_threads=1

wsrep_slave_UK_checks=ON

# Retry autocommit queries

wsrep_retry_autocommit=4

Analysing the timing of the crashes and the order with which we've upgraded or recovered the machines, we're tempted to point at the machine with 10.2.41: the first crashes were the morning after the dawn the machine was upgraded from 10.1, then our test on the 17th crashed when we just had just rejoined this machine to the cluster, and this morning errors were after this machine was recovered again.

Between the 12th and the dawn of the 17th, for instance, this cluster was working with three machines, all with 10.2.40, with no major issues. And, of course, since we started the rolling upgrade, 3 weeks ago, and even further beyond.

This feels like a bug - because nothing like it happened for years on 10.1 - the queries that triggered the problems exist since forever in our codebase.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

logNovember12th
6 kB
2021-11-18 16:20
logNovember18th
7 kB
2021-11-18 16:28

Activity

Ascending order - Click to sort in descending order

Marco Amado added a comment - 2021-12-02 15:40 - edited

We've ran a battery of tests on our production cluster (after hours) to try and pinpoint the problem.

We chased the HA_ERR_FOUND_DUPP_KEY on parallel insertions of a unique key, that struck us as easier to reproduce. We sent 3, 5, 10, 20, 50, 100, 200 and 500 requests in parallel on each test, in an attempt to replicate the issue.

We expect some (the majority, actually) of the requests to fail with a deadlock error - what is not expected is a full cluster shutdown.

Test #1: 4-machine cluster, all on 10.2.40, 3 of those answering to requests. No cluster crash;
Test #2: 5-machine cluster, all on 10.2.40, 4 of those answering to requests. No cluster crash;
Test #3: 5-machine cluster, 4 on 10.2.40 answering to requests, another one on 10.2.41. No cluster crash;
Test #4: 5-machine cluster, 4 on 10.2.40 and 1 on 10.2.41, 4 answering to requests, including the 10.2.41 one. Cluster crashed, at a mere 10 requests in parallel.

The logs are pretty much the same that's on logNovember12th above.

There's clearly an issue with replication/certification/something on 10.2.41. We're downgrading that machine to 10.2.40 again, but we really wanted to move on to 10.3, 10.4 and beyond - and now we can't, because all the versions got the same thing that's on 10.2.41.

Help? Please?

Marco Amado added a comment - 2021-12-02 15:40 - edited We've ran a battery of tests on our production cluster (after hours) to try and pinpoint the problem. We chased the HA_ERR_FOUND_DUPP_KEY on parallel insertions of a unique key, that struck us as easier to reproduce. We sent 3, 5, 10, 20, 50, 100, 200 and 500 requests in parallel on each test, in an attempt to replicate the issue. We expect some (the majority, actually) of the requests to fail with a deadlock error - what is not expected is a full cluster shutdown. Test #1: 4-machine cluster, all on 10.2.40, 3 of those answering to requests. No cluster crash; Test #2: 5-machine cluster, all on 10.2.40, 4 of those answering to requests. No cluster crash; Test #3: 5-machine cluster, 4 on 10.2.40 answering to requests, another one on 10.2.41. No cluster crash; Test #4: 5-machine cluster, 4 on 10.2.40 and 1 on 10.2.41, 4 answering to requests, including the 10.2.41 one. Cluster crashed , at a mere 10 requests in parallel. The logs are pretty much the same that's on logNovember12th above. There's clearly an issue with replication/certification/something on 10.2.41. We're downgrading that machine to 10.2.40 again, but we really wanted to move on to 10.3, 10.4 and beyond - and now we can't, because all the versions got the same thing that's on 10.2.41. Help? Please?

Marco Amado added a comment - 2021-12-02 16:10

This issue seems similar.

Marco Amado added a comment - 2021-12-02 16:10 This issue seems similar.

drkspc added a comment - 2022-03-01 10:27

I am running into the same issue with MariaDB 10.4.22 as well as 10.4.24 on FreeBSD 13.0. Our setup is a three node cluster with all nodes answering requests. For some time after upgrading one of the nodes to a newer version (the others are still on 10.4.21) things run fine. After a few hours (if not eariler), the cluster breaks up.

Error log on first node:

2022-02-21 13:18:13 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event

's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062

2022-02-21 13:18:13 2 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457

2022-02-21 13:18:13 2 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit)

2022-02-21 13:18:13 2 [Note] WSREP: Closing send monitor...

...

Second node:

2022-02-21 13:18:13 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event

's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062

2022-02-21 13:18:13 2 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457

2022-02-21 13:18:13 2 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit)

2022-02-21 13:18:13 2 [Note] WSREP: Closing send monitor...

...

Third node:

2022-02-21 13:18:13 1 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event

's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062

2022-02-21 13:18:13 1 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457

2022-02-21 13:18:13 1 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit)

2022-02-21 13:18:13 1 [Note] WSREP: Closing send monitor...

... (cluster partitioning)

2022-02-21 13:18:13 55435 [ERROR] WSREP: trx_replay failed for: 7, schema: roundcubemail, query: INSERT INTO `session` (`sess_id`, `vars`, `ip`, `changed`) VALUES ('0mg2ecmqvmh74l0b0d93kckg25', 'dGVtcHxiOjE7bGFuZ3VhZ2V8czo1OiJkZV9ERSI7c2tp

bl9jb25maWd8YTowOnt9cmVxdWVzdF90b2tlbnxzOjMyOiJGUWVMV09hekJxb3BLeFl6WjdpdGhGSnZXMkVkcVEyQiI7', 'some.ip.address.xyz', now())

2022-02-21 13:18:13 55435 [ERROR] Aborting

/usr/local/etc/mysql/wsrep_notify.sh: MEMBERS: parameter not set

220221 13:18:14 [ERROR] mysqld got signal 11 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.4.24-MariaDB-log

key_buffer_size=402653184

read_buffer_size=2097152

max_used_connections=23

max_threads=1339

thread_count=4

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5910248 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

Is there any progress on this issue? These failures prevent me from upgrading our cluster to a newer version. In particular, I am left with either cluster breakups or known vulnerable versions of MariaDB, as all versions beyond 10.4.21 were security updates.

drkspc added a comment - 2022-03-01 10:27 I am running into the same issue with MariaDB 10.4.22 as well as 10.4.24 on FreeBSD 13.0. Our setup is a three node cluster with all nodes answering requests. For some time after upgrading one of the nodes to a newer version (the others are still on 10.4.21) things run fine. After a few hours (if not eariler), the cluster breaks up. Error log on first node: 2022-02-21 13:18:13 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event 's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062 2022-02-21 13:18:13 2 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457 2022-02-21 13:18:13 2 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit) 2022-02-21 13:18:13 2 [Note] WSREP: Closing send monitor... ... Second node: 2022-02-21 13:18:13 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event 's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062 2022-02-21 13:18:13 2 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457 2022-02-21 13:18:13 2 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit) 2022-02-21 13:18:13 2 [Note] WSREP: Closing send monitor... ... Third node: 2022-02-21 13:18:13 1 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event 's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062 2022-02-21 13:18:13 1 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457 2022-02-21 13:18:13 1 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit) 2022-02-21 13:18:13 1 [Note] WSREP: Closing send monitor... ... (cluster partitioning) 2022-02-21 13:18:13 55435 [ERROR] WSREP: trx_replay failed for: 7, schema: roundcubemail, query: INSERT INTO `session` (`sess_id`, `vars`, `ip`, `changed`) VALUES ('0mg2ecmqvmh74l0b0d93kckg25', 'dGVtcHxiOjE7bGFuZ3VhZ2V8czo1OiJkZV9ERSI7c2tp bl9jb25maWd8YTowOnt9cmVxdWVzdF90b2tlbnxzOjMyOiJGUWVMV09hekJxb3BLeFl6WjdpdGhGSnZXMkVkcVEyQiI7', 'some.ip.address.xyz', now()) 2022-02-21 13:18:13 55435 [ERROR] Aborting /usr/local/etc/mysql/wsrep_notify.sh: MEMBERS: parameter not set 220221 13:18:14 [ERROR] mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.4.24-MariaDB-log key_buffer_size=402653184 read_buffer_size=2097152 max_used_connections=23 max_threads=1339 thread_count=4 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5910248 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 Is there any progress on this issue? These failures prevent me from upgrading our cluster to a newer version. In particular, I am left with either cluster breakups or known vulnerable versions of MariaDB, as all versions beyond 10.4.21 were security updates.

MariaDB Server

Total Galera Cluster failure on Delete_rows_v1

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration