Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.18
-
None
-
None
-
Ubuntu 22
Description
Two weeks ago we hade our complete 3 node galera cluster hung.
We had to stop all nodes and eventually had to restore from backup.
I'm not sure if this is related or the cause but that morning node 1 and node 2 did an automatic unattended upgrade from 10.6.15 -> 10.6.18.
These APT upgrades however did not seem to have updated the Galera replication library and seems the existing Galera lib version for 10.6.15 was left alone.
For some reason node 2 did not upgrade and stayed on 10.6.15.
In the early after afternoon the transaction queues suddenly started piling up on the nodes and our applications had trouble selecting/inserting.
Node 2 displayed issues:
Dec 18 13:03:51 localhost mariadbd[12191]: 2024-12-18 13:03:51 2 [Note] InnoDB: WSREP: BF lock wait long for trx:0x4d3e27a3a query: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,
|
Dec 18 13:03:51 localhost mariadbd[12191]: alertCount)
|
Dec 18 13:03:51 localhost mariadbd[12191]: select clientId, deviceCount, state, alertTypeId, alertCount
|
Dec 18 13:03:51 localhost mariadbd[12191]: from vClientDeviceAlertDetail
|
Dec 18 13:04:10 localhost systemd[1]: session-611545.scope: Deactivated successfully.
|
Dec 18 13:04:10 localhost systemd[1]: session-611545.scope: Consumed 55.307s CPU time.
|
Dec 18 13:04:10 localhost systemd[1]: Started Session 611553 of User clustercontrol.
|
Dec 18 13:04:41 localhost mariadbd[12191]: 2024-12-18 13:04:41 2 [Note] InnoDB: WSREP: BF lock wait long for trx:0x4d3e27a3a query: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,
|
Dec 18 13:04:41 localhost mariadbd[12191]: alertCount)
|
Dec 18 13:04:41 localhost mariadbd[12191]: select clientId, deviceCount, state, alertTypeId, alertCount
|
Dec 18 13:04:41 localhost mariadbd[12191]: from vClientDeviceAlertDetail
|
Dec 18 13:04:47 localhost systemd[1]: session-611551.scope: Deactivated successfully.
|
Dec 18 13:04:47 localhost systemd[1]: session-611551.scope: Consumed 25.793s CPU time.
|
Dec 18 13:04:47 localhost systemd[1]: Started Session 611554 of User clustercontrol.
|
Dec 18 13:05:31 localhost mariadbd[12191]: 2024-12-18 13:05:31 2 [Note] InnoDB: WSREP: BF lock wait long for trx:0x4d3e27a3a query: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,
|
Dec 18 13:05:31 localhost mariadbd[12191]: alertCount)
|
Dec 18 13:05:31 localhost mariadbd[12191]: select clientId, deviceCount, state, alertTypeId, alertCount
|
Dec 18 13:05:31 localhost mariadbd[12191]: from vClientDeviceAlertDetail
|
These inserts originate from a node 1 scheduled event (only node 1 has event scheduler enabled)
create or replace definer = admin@`%` event opmon.e_clientDeviceAlertDetail on schedule |
every '1' MINUTE |
starts '2024-04-22 11:05:00' |
enable
|
comment 'Populate clientDeviceAlertDetail' |
do
|
begin |
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; |
insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId, |
alertCount)
|
select clientId, deviceCount, state, alertTypeId, alertCount |
from vClientDeviceAlertDetail; |
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; |
end; |
|
CREATE TABLE `clientDeviceAlertDetail` ( |
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, |
`logTime` datetime NOT NULL DEFAULT current_timestamp(), |
`clientId` int(10) unsigned NOT NULL, |
`deviceCount` int(10) unsigned NOT NULL, |
`state` tinyint(1) unsigned DEFAULT NULL, |
`alertTypeId` smallint(2) unsigned NOT NULL, |
`alertCount` int(10) unsigned DEFAULT NULL, |
PRIMARY KEY (`clientId`,`alertTypeId`,`logTime`,`id`), |
KEY `id` (`id`) |
) ENGINE=InnoDB AUTO_INCREMENT=23263622621 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci |
PARTITION BY RANGE COLUMNS(`logTime`) |
(PARTITION `p_202411` VALUES LESS THAN ('2024-12-01 00:00:00') ENGINE = InnoDB, |
PARTITION `p_202412` VALUES LESS THAN ('2025-01-01 00:00:00') ENGINE = InnoDB, |
PARTITION `p_202501` VALUES LESS THAN ('2025-02-01 00:00:00') ENGINE = InnoDB, |
PARTITION `p_max` VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) |
Node 1
We eventually had to shutdown node 1 which produced the following errors about 30 minutes after the errors/warning started on Node 2.
Not sure if this is normal event scheduler behavior when shutting down a node.
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 38515 [Warning] WSREP: TO isolation failed for: 1, schema: opmon, sql: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,
|
Dec 18 13:33:40 db1 mariadbd[3487303]: alertCount)
|
Dec 18 13:33:40 db1 mariadbd[3487303]: select clientId, deviceCount, state, alertTypeId, alertCount
|
Dec 18 13:33:40 db1 mariadbd[3487303]: from vClientDeviceAlertDetail. Operation timed out.
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 38601 [Warning] WSREP: TO isolation failed for: 1, schema: opmon, sql: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,
|
Dec 18 13:33:40 db1 mariadbd[3487303]: alertCount)
|
Dec 18 13:33:40 db1 mariadbd[3487303]: select clientId, deviceCount, state, alertTypeId, alertCount
|
Dec 18 13:33:40 db1 mariadbd[3487303]: from vClientDeviceAlertDetail. Operation timed out.
|
Dec 18 13:33:40 db1 mariadbd[3487303]: from vClientDeviceAlertDetail. Operation timed out.
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 39983 [Warning] WSREP: TO isolation failed for: 1, schema: opmon, sql: FLUSH STATUS. Operation timed out.
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40408 [ERROR] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] Lock wait timeout exceeded; try restarting transaction
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40480 [ERROR] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] Lock wait timeout exceeded; try restarting transaction
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40189 [ERROR] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] Lock wait timeout exceeded; try restarting transaction
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40189 [Note] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] At line 6 in opmon.e_clientdevicealertdetail
|
Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40189 [Note] Event Scheduler: [admin@%].[opmon.e_clientDeviceAlertDetail] event execution failed.
|
So I am unsure if the above happened because of the nodes being out of sync pertaining to MariaDB/Galera version or if there was a bug in 10.6.18 that caused this.
All I could find that might be related is:
MDEV-34269
MDEV-33456 (This seems more like the behavior we experienced)
MDEV-30161
Before restoring node 1 and node 3 I reinstalled 10.6.15 so the cluster is now on the same versions again and so far no issues.
Attachments
Issue Links
- causes
-
MDEV-33456 InnoDB: WSREP: BF lock wait long for trx:xxxxxx query: update
-
- Open
-
Is there perhaps any feedback that can be given as to how or why this could have happened?