[MDEV-35002] ALTER TABLE replicating (TOI) to other hosts in cluster caused "long BF lock" (for over 20 minutes) on downstream servers - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11.9
Fix Version/s: 10.11
Component/s: Galera
Labels:
None
Environment:
Ubuntu 22.04, MariaDB 10.11.9, Galera 4 Specifically "26.4.19-ubu2204",
3 hosts in Galera cluster.

Description

Hi, we're running a Galera cluster, and we ran into an issue with an "ALTER TABLE" executing without issue on the write host but totally locking up the other servers in the cluster (during replication).

Got a lot of these messages in the logs of the servers that got stuck:

2024-09-19 15:49:09 14 [Note] InnoDB: WSREP: BF lock wait long for trx:0xd9128c4 query: ALTER TABLE some_table_name

    ADD some_field_name enum('VALUE1', 'VALUE2', 'VALUE3', 'VALUE4') DEFAULT 'VALUE4' NOT null

Wasn't even able to kill the MariaDB processes with a normal SIGTERM sent by systemd.

This SQL that was run (against our write host only) looked something like this:

SET SESSION alter_algorithm='INSTANT';

ALTER TABLE some_table_name

    ADD some_field_name enum('VALUE1', 'VALUE2', 'VALUE3', 'VALUE4') DEFAULT 'VALUE4' NOT null;

ALTER TABLE some_table_name

    ADD some_field_name2 smallint(5) DEFAULT 0 NOT null;

And immediately after that this was run:

SET SESSION alter_algorithm='INSTANT';

alter table some_table_name

    add some_field_name3 varchar(5) null;

alter table some_table_name

    add some_field_name4 varchar(5) null;

All the instances of `some_table_name` were referencing the same table in the above queries.

It should also be noted that all the queries being executed that seemed to have lead to this issue were successfully finished on the "write host" inside the cluster. The issue only seems to occur during replication stage of these alterations to the other servers inside the cluster.
Also the table being altered is only 105 records long and 68 columns wide at the time of the incident. Which does not justify long ALTER TABLE locking times in my opinion.

My two cents is that was somehow caused by some locking done on the write host during the consecutive alter statements on the same table. But that's just a wild guess.

Attachments

Issue Links

relates to

MDEV-34836 TOI (ALTER) hangs on a parent table if SR transaction is in progress on a child table

Closed

Activity

People

Assignee:: Julius Goryavsky

Reporter:: Tom

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2024-09-24 12:03

Updated:: 2025-02-14 09:42

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server