Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35002

ALTER TABLE replicating (TOI) to other hosts in cluster caused "long BF lock" (for over 20 minutes) on downstream servers

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.11.9
    • None
    • None
    • None
    • Ubuntu 22.04, MariaDB 10.11.9, Galera 4 Specifically "26.4.19-ubu2204",
      3 hosts in Galera cluster.

    Description

      Hi, we're running a Galera cluster, and we ran into an issue with an "ALTER TABLE" executing without issue on the write host but totally locking up the other servers in the cluster (during replication).

      Got a lot of these messages in the logs of the servers that got stuck:

      2024-09-19 15:49:09 14 [Note] InnoDB: WSREP: BF lock wait long for trx:0xd9128c4 query: ALTER TABLE some_table_name
          ADD some_field_name enum('VALUE1', 'VALUE2', 'VALUE3', 'VALUE4') DEFAULT 'VALUE4' NOT null
      

      Wasn't even able to kill the MariaDB processes with a normal SIGTERM sent by systemd.

      This SQL that was run (against our write host only) looked something like this:

      SET SESSION alter_algorithm='INSTANT';
      ALTER TABLE some_table_name
          ADD some_field_name enum('VALUE1', 'VALUE2', 'VALUE3', 'VALUE4') DEFAULT 'VALUE4' NOT null;
       
      ALTER TABLE some_table_name
          ADD some_field_name2 smallint(5) DEFAULT 0 NOT null;
      

      And immediately after that this was run:

      SET SESSION alter_algorithm='INSTANT';
      alter table some_table_name
          add some_field_name3 varchar(5) null;
       
      alter table some_table_name
          add some_field_name4 varchar(5) null;
      

      All the instances of `some_table_name` were referencing the same table in the above queries.

      It should also be noted that all the queries being executed that seemed to have lead to this issue were successfully finished on the "write host" inside the cluster. The issue only seems to occur during replication stage of these alterations to the other servers inside the cluster.
      Also the table being altered is only 105 records long and 68 columns wide at the time of the incident. Which does not justify long ALTER TABLE locking times in my opinion.

      My two cents is that was somehow caused by some locking done on the write host during the consecutive alter statements on the same table. But that's just a wild guess.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bravoman Tom
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.