[MDEV-35762] Galera cluster hung with InnoDB: WSREP: BF lock wait long for trx - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6.18
Fix Version/s: 10.6
Component/s: None
Labels:
None
Environment:
Ubuntu 22

Description

Two weeks ago we hade our complete 3 node galera cluster hung.
We had to stop all nodes and eventually had to restore from backup.

I'm not sure if this is related or the cause but that morning node 1 and node 2 did an automatic unattended upgrade from 10.6.15 -> 10.6.18.
These APT upgrades however did not seem to have updated the Galera replication library and seems the existing Galera lib version for 10.6.15 was left alone.
For some reason node 2 did not upgrade and stayed on 10.6.15.

In the early after afternoon the transaction queues suddenly started piling up on the nodes and our applications had trouble selecting/inserting.

Node 2 displayed issues:

Dec 18 13:03:51 localhost mariadbd[12191]: 2024-12-18 13:03:51 2 [Note] InnoDB: WSREP: BF lock wait long for trx:0x4d3e27a3a query: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,

Dec 18 13:03:51 localhost mariadbd[12191]:                                              alertCount)

Dec 18 13:03:51 localhost mariadbd[12191]:         select clientId, deviceCount, state, alertTypeId, alertCount

Dec 18 13:03:51 localhost mariadbd[12191]:         from vClientDeviceAlertDetail

Dec 18 13:04:10 localhost systemd[1]: session-611545.scope: Deactivated successfully.

Dec 18 13:04:10 localhost systemd[1]: session-611545.scope: Consumed 55.307s CPU time.

Dec 18 13:04:10 localhost systemd[1]: Started Session 611553 of User clustercontrol.

Dec 18 13:04:41 localhost mariadbd[12191]: 2024-12-18 13:04:41 2 [Note] InnoDB: WSREP: BF lock wait long for trx:0x4d3e27a3a query: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,

Dec 18 13:04:41 localhost mariadbd[12191]:                                              alertCount)

Dec 18 13:04:41 localhost mariadbd[12191]:         select clientId, deviceCount, state, alertTypeId, alertCount

Dec 18 13:04:41 localhost mariadbd[12191]:         from vClientDeviceAlertDetail

Dec 18 13:04:47 localhost systemd[1]: session-611551.scope: Deactivated successfully.

Dec 18 13:04:47 localhost systemd[1]: session-611551.scope: Consumed 25.793s CPU time.

Dec 18 13:04:47 localhost systemd[1]: Started Session 611554 of User clustercontrol.

Dec 18 13:05:31 localhost mariadbd[12191]: 2024-12-18 13:05:31 2 [Note] InnoDB: WSREP: BF lock wait long for trx:0x4d3e27a3a query: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,

Dec 18 13:05:31 localhost mariadbd[12191]:                                              alertCount)

Dec 18 13:05:31 localhost mariadbd[12191]:         select clientId, deviceCount, state, alertTypeId, alertCount

Dec 18 13:05:31 localhost mariadbd[12191]:         from vClientDeviceAlertDetail

These inserts originate from a node 1 scheduled event (only node 1 has event scheduler enabled)

create or replace definer = admin@`%` event opmon.e_clientDeviceAlertDetail on schedule

	every '1' MINUTE

	starts '2024-04-22 11:05:00'

	enable

	comment 'Populate clientDeviceAlertDetail'

do

	begin

        SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

        insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,

                                             alertCount)

        select clientId, deviceCount, state, alertTypeId, alertCount

        from vClientDeviceAlertDetail;

        SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;

    end;

CREATE TABLE `clientDeviceAlertDetail` (

  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

  `logTime` datetime NOT NULL DEFAULT current_timestamp(),

  `clientId` int(10) unsigned NOT NULL,

  `deviceCount` int(10) unsigned NOT NULL,

  `state` tinyint(1) unsigned DEFAULT NULL,

  `alertTypeId` smallint(2) unsigned NOT NULL,

  `alertCount` int(10) unsigned DEFAULT NULL,

  PRIMARY KEY (`clientId`,`alertTypeId`,`logTime`,`id`),

  KEY `id` (`id`)

) ENGINE=InnoDB AUTO_INCREMENT=23263622621 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci

 PARTITION BY RANGE  COLUMNS(`logTime`)

(PARTITION `p_202411` VALUES LESS THAN ('2024-12-01 00:00:00') ENGINE = InnoDB,

 PARTITION `p_202412` VALUES LESS THAN ('2025-01-01 00:00:00') ENGINE = InnoDB,

 PARTITION `p_202501` VALUES LESS THAN ('2025-02-01 00:00:00') ENGINE = InnoDB,

 PARTITION `p_max` VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB)

Node 1
We eventually had to shutdown node 1 which produced the following errors about 30 minutes after the errors/warning started on Node 2.
Not sure if this is normal event scheduler behavior when shutting down a node.

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 38515 [Warning] WSREP: TO isolation failed for: 1, schema: opmon, sql: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,

Dec 18 13:33:40 db1 mariadbd[3487303]:                                              alertCount)

Dec 18 13:33:40 db1 mariadbd[3487303]:         select clientId, deviceCount, state, alertTypeId, alertCount

Dec 18 13:33:40 db1 mariadbd[3487303]:         from vClientDeviceAlertDetail. Operation timed out.

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 38601 [Warning] WSREP: TO isolation failed for: 1, schema: opmon, sql: insert into clientDeviceAlertDetail (clientId, deviceCount, state, alertTypeId,

Dec 18 13:33:40 db1 mariadbd[3487303]:                                              alertCount)

Dec 18 13:33:40 db1 mariadbd[3487303]:         select clientId, deviceCount, state, alertTypeId, alertCount

Dec 18 13:33:40 db1 mariadbd[3487303]:         from vClientDeviceAlertDetail. Operation timed out.

Dec 18 13:33:40 db1 mariadbd[3487303]:         from vClientDeviceAlertDetail. Operation timed out.

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 39983 [Warning] WSREP: TO isolation failed for: 1, schema: opmon, sql: FLUSH STATUS. Operation timed out.

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40408 [ERROR] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] Lock wait timeout exceeded; try restarting transaction

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40480 [ERROR] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] Lock wait timeout exceeded; try restarting transaction

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40189 [ERROR] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] Lock wait timeout exceeded; try restarting transaction

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40189 [Note] Event Scheduler: [admin@%][opmon.e_clientDeviceAlertDetail] At line 6 in opmon.e_clientdevicealertdetail

Dec 18 13:33:40 db1 mariadbd[3487303]: 2024-12-18 13:33:40 40189 [Note] Event Scheduler: [admin@%].[opmon.e_clientDeviceAlertDetail] event execution failed.

So I am unsure if the above happened because of the nodes being out of sync pertaining to MariaDB/Galera version or if there was a bug in 10.6.18 that caused this.
All I could find that might be related is:
~~MDEV-34269~~
MDEV-33456 (This seems more like the behavior we experienced)
~~MDEV-30161~~

Before restoring node 1 and node 3 I reinstalled 10.6.15 so the cluster is now on the same versions again and so far no issues.

Attachments

Issue Links

causes

MDEV-33456 InnoDB: WSREP: BF lock wait long for trx:xxxxxx query: update

Open

Activity

Stephan Vos added a comment - 2025-02-25 12:47

Is there perhaps any feedback that can be given as to how or why this could have happened?

Stephan Vos added a comment - 2025-02-25 12:47 Is there perhaps any feedback that can be given as to how or why this could have happened?

MariaDB Server

Galera cluster hung with InnoDB: WSREP: BF lock wait long for trx

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration