[MDEV-27085] Total Galera Cluster failure on Delete_rows_v1 Created: 2021-11-18  Updated: 2022-03-01

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.40, 10.2.41
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Marco Amado Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: crash, galera, mariadb
Environment:

Ubuntu 18.04


Attachments: HTML File logNovember12th     HTML File logNovember18th    

 Description   

We've had a multi-primary Galera Cluster with 5 MariaDB 10.1 for the last 5 years, without major issues.

In the past month, we started a rolling upgrade process to 10.2.40, intending to go up to 10.4 or 10.5.

We've upgraded one machine per one or two days, to give them time to fill RAM, in order to be the less disruptible for our clients.

Last thursday, November 11th, the last one was upgraded, and caught 10.2.41 instead of the 10.2.40 of the other machines.

The next day, friday 12th, at one of our peak hours, all the cluster came crashing down.

Slave SQL: Could not execute Write_rows_v1 event on table moloni_geral.documentos; Duplicate entry '27:409702:52248:0' for key 'doc_unico', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1518, Internal MariaDB error code: 1062

The logNovember12th file is the output on all the servers, which strikes me as odd - at least one shouldn't have this output but something else?

We know how this happened: an API customer tried to insert two documents at the same time, and the doc_unico unique key didn't end up unique.

However, this worked for years without problems. Our load for the day was nowhere near the worst we had.

We managed to reproduce the issue (inadvertly, while trying to implement a request throttler) on the morning of November 17th, with the exact same results.

We've also had another error, for 4 times now - one right on the afternoon of the 12th, after recovering some of the cluster machines, and three more today.

Slave SQL: Could not execute Delete_rows_v1 event on table moloni_geral.AC_remain_cookies; Can't find record in 'AC_remain_cookies', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 501, Internal MariaDB error code: 1032

The logNovember18th file contains the message that, again, appears on all the servers. Between the events today, we've tried to set wsrep_slave_threads=1 to no avail, it still occurred.

The query that triggers this error is the following:

DELETE FROM AC_remain_cookies
WHERE hash = [some user hash]
  OR expires < [a year ago]

And the structure is the following:

CREATE TABLE `AC_remain_cookies` (
  `hash` varchar(128) NOT NULL,
  `ip` varchar(250) NOT NULL,
  `user_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `expires` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
ALTER TABLE `AC_remain_cookies`
  ADD PRIMARY KEY (`hash`),
  ADD KEY `expires` (`expires`),
  ADD KEY `user_id` (`user_id`);

This table is very, very small, averaging at about 15k lines.

Our Galera configuration is as follows:

[mysqld]
binlog_format=ROW
binlog_checksum=NONE
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
slave_type_conversions="ALL_NON_LOSSY,ALL_LOSSY"
 
# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_provider_options="gcache.size=4096M;gcache.recover=yes"
 
# Galera Cluster Configuration
wsrep_cluster_name="moloni_cluster"
wsrep_cluster_address=gcomm://<usually five servers here>
 
# Galera Synchronization Configuration
wsrep_sst_method=mariabackup
wsrep_sst_auth="<username:password>"
wsrep_sst_donor="fiona"
 
# Galera Node Configuration
wsrep_node_address="10.32.0.27"
wsrep_node_name="grace"
wsrep_slave_threads=1
wsrep_slave_UK_checks=ON
 
# Retry autocommit queries
wsrep_retry_autocommit=4

Analysing the timing of the crashes and the order with which we've upgraded or recovered the machines, we're tempted to point at the machine with 10.2.41: the first crashes were the morning after the dawn the machine was upgraded from 10.1, then our test on the 17th crashed when we just had just rejoined this machine to the cluster, and this morning errors were after this machine was recovered again.

Between the 12th and the dawn of the 17th, for instance, this cluster was working with three machines, all with 10.2.40, with no major issues. And, of course, since we started the rolling upgrade, 3 weeks ago, and even further beyond.

This feels like a bug - because nothing like it happened for years on 10.1 - the queries that triggered the problems exist since forever in our codebase.



 Comments   
Comment by Marco Amado [ 2021-12-02 ]

We've ran a battery of tests on our production cluster (after hours) to try and pinpoint the problem.

We chased the HA_ERR_FOUND_DUPP_KEY on parallel insertions of a unique key, that struck us as easier to reproduce. We sent 3, 5, 10, 20, 50, 100, 200 and 500 requests in parallel on each test, in an attempt to replicate the issue.

We expect some (the majority, actually) of the requests to fail with a deadlock error - what is not expected is a full cluster shutdown.

  • Test #1: 4-machine cluster, all on 10.2.40, 3 of those answering to requests. No cluster crash;
  • Test #2: 5-machine cluster, all on 10.2.40, 4 of those answering to requests. No cluster crash;
  • Test #3: 5-machine cluster, 4 on 10.2.40 answering to requests, another one on 10.2.41. No cluster crash;
  • Test #4: 5-machine cluster, 4 on 10.2.40 and 1 on 10.2.41, 4 answering to requests, including the 10.2.41 one. Cluster crashed, at a mere 10 requests in parallel.

The logs are pretty much the same that's on logNovember12th above.

There's clearly an issue with replication/certification/something on 10.2.41. We're downgrading that machine to 10.2.40 again, but we really wanted to move on to 10.3, 10.4 and beyond - and now we can't, because all the versions got the same thing that's on 10.2.41.

Help? Please?

Comment by Marco Amado [ 2021-12-02 ]

This issue seems similar.

Comment by drkspc [ 2022-03-01 ]

I am running into the same issue with MariaDB 10.4.22 as well as 10.4.24 on FreeBSD 13.0. Our setup is a three node cluster with all nodes answering requests. For some time after upgrading one of the nodes to a newer version (the others are still on 10.4.21) things run fine. After a few hours (if not eariler), the cluster breaks up.

Error log on first node:

2022-02-21 13:18:13 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event
's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062
2022-02-21 13:18:13 2 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457
2022-02-21 13:18:13 2 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit)
2022-02-21 13:18:13 2 [Note] WSREP: Closing send monitor...
...

Second node:

2022-02-21 13:18:13 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event
's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062
2022-02-21 13:18:13 2 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457
2022-02-21 13:18:13 2 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit)
2022-02-21 13:18:13 2 [Note] WSREP: Closing send monitor...
...

Third node:

2022-02-21 13:18:13 1 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table roundcubemail.session; Duplicate entry '0mg2ecmqvmh74l0b0d93kckg25' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event
's master log FIRST, end_log_pos 558, Internal MariaDB error code: 1062
2022-02-21 13:18:13 1 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 121, seqno 1423961457
2022-02-21 13:18:13 1 [ERROR] WSREP: Failed to apply write set: gtid: e7cece19-5283-11ec-b6c6-1e78b2ac2f83:1423961457 server_id: b1c8d53e-9307-11ec-b294-66aa01186711 client_id: 55435 trx_id: 2304145 flags: 3 (start_transaction | commit)
2022-02-21 13:18:13 1 [Note] WSREP: Closing send monitor...
... (cluster partitioning)
2022-02-21 13:18:13 55435 [ERROR] WSREP: trx_replay failed for: 7, schema: roundcubemail, query: INSERT INTO `session` (`sess_id`, `vars`, `ip`, `changed`) VALUES ('0mg2ecmqvmh74l0b0d93kckg25', 'dGVtcHxiOjE7bGFuZ3VhZ2V8czo1OiJkZV9ERSI7c2tp
bl9jb25maWd8YTowOnt9cmVxdWVzdF90b2tlbnxzOjMyOiJGUWVMV09hekJxb3BLeFl6WjdpdGhGSnZXMkVkcVEyQiI7', 'some.ip.address.xyz', now())
2022-02-21 13:18:13 55435 [ERROR] Aborting
/usr/local/etc/mysql/wsrep_notify.sh: MEMBERS: parameter not set
220221 13:18:14 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.24-MariaDB-log
key_buffer_size=402653184
read_buffer_size=2097152
max_used_connections=23
max_threads=1339
thread_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5910248 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000

Is there any progress on this issue? These failures prevent me from upgrading our cluster to a newer version. In particular, I am left with either cluster breakups or known vulnerable versions of MariaDB, as all versions beyond 10.4.21 were security updates.

Generated at Thu Feb 08 09:50:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.