Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28452

wsrep_ready: OFF after MDL BF-BF conflict

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.6.7, 10.7.3
    • N/A
    • Galera
    • None
    • Ubuntu 20.04 LTS

    Description

      We are running a 2 node + arbitrator cluster.

      Galera sets WSREP_READY to OFF after MDL BF-BF conflict on second node.
      mariadb service does not crash

      logs:

      Apr 27 03:00:09 node02.mariadb mariadbd[1003]: 2022-04-27  3:00:09 8 [Note] WSREP: MDL BF-BF conflict
      Apr 27 03:00:09 node02.mariadb mariadbd[1003]: schema:  authc  
      Apr 27 03:00:09 node02.mariadb mariadbd[1003]: request: (8 #011seqno 3840795 #011wsrep (high priority, exec, executing) cmd 0 160 #011update `user` set `last_login` = '2022-04-27T01:00:09Z' where `id` = '131b4cd3-e390-4b31-b47d-d1a3d5cee3ee'<99><95>hb#023#001)
      Apr 27 03:00:09 node02.mariadb mariadbd[1003]: granted: (2 #011seqno 3840793 #011wsrep (toi, exec, committed) cmd 0 45 #011OPTIMIZE TABLE `log_history_daily`)
      Apr 27 03:00:09 node02.mariadb mariadbd[1003]: 2022-04-27  3:00:09 8 [ERROR] Aborting
      

      SHOW CREATE TABLE for the tables mentioned in the logs:

      CREATE TABLE `user` (
        `id` char(36) COLLATE utf8mb4_unicode_ci NOT NULL,
        `environment_id` char(36) COLLATE utf8mb4_unicode_ci NOT NULL,
        `username` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
        `password` char(60) COLLATE utf8mb4_unicode_ci NOT NULL,
        `status` enum('sign_up','invited','active','archived') COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT 'invited',
        `token` char(64) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
        `token_valid_till` datetime DEFAULT NULL,
        `last_login` datetime DEFAULT NULL,
        `password_updated_at` datetime DEFAULT NULL,
        `created_at` timestamp NULL DEFAULT NULL,
        `updated_at` timestamp NULL DEFAULT NULL,
        PRIMARY KEY (`id`),
        UNIQUE KEY `user_username_unique` (`username`),
        KEY `user_environment_id_index` (`environment_id`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
       
      CREATE TABLE `log_history_daily` (
        `type` enum('unknown','user','developer','api_key','api_env','mail_token','device') COLLATE utf8mb4_unicode_ci NOT NULL,
        `status` enum('valid','invalid') COLLATE utf8mb4_unicode_ci NOT NULL,
        `origin` enum('unknown','browser','go','android','ios','third_party') COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT 'unknown',
        `ip` varchar(45) COLLATE utf8mb4_unicode_ci NOT NULL,
        `value` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
        `user_id` char(36) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
        `date` date NOT NULL,
        `count` int(10) unsigned NOT NULL,
        `created_at` timestamp NULL DEFAULT NULL,
        `updated_at` timestamp NULL DEFAULT NULL,
        PRIMARY KEY (`type`,`status`,`origin`,`ip`,`value`,`date`),
        KEY `log_history_daily_user_id_foreign` (`user_id`),
        CONSTRAINT `log_history_daily_user_id_foreign` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
      

      This happened on our test environment (mariadb 10.7.3) and our acceptance environment (mariadb 10.6.7)

      Attachments

        Issue Links

          Activity

            karll Karl Levik added a comment - - edited

            I'm seeing the same thing.

            I ran a couple of ALTER TABLE statements which succeeded on the node where I was running them, but they actually got stuck on the other two nodes in the cluster. I only saw this later when I did "SHOW PROCESSLIST;". wsrep_ready had gone to OFF, and the nodes stopped working.

            The .err files on the two broken nodes contained messages including: "WSREP: MDL BF-BF conflict" whereas the good node had messages such as:

            WSREP: MDL conflict db=name_of_database table=name_of_table ticket=3 solved by abort

            Only through a "kill -9" on the mariadbd process and subsequently restarting, which triggered an SST, was I able to get the two broken nodes back to a working state.

            karll Karl Levik added a comment - - edited I'm seeing the same thing. I ran a couple of ALTER TABLE statements which succeeded on the node where I was running them, but they actually got stuck on the other two nodes in the cluster. I only saw this later when I did "SHOW PROCESSLIST;". wsrep_ready had gone to OFF, and the nodes stopped working. The .err files on the two broken nodes contained messages including: "WSREP: MDL BF-BF conflict" whereas the good node had messages such as: WSREP: MDL conflict db=name_of_database table=name_of_table ticket=3 solved by abort Only through a "kill -9" on the mariadbd process and subsequently restarting, which triggered an SST, was I able to get the two broken nodes back to a working state.
            violuke Luke Cousins added a comment -

            We're having this problem roughly once per week. How can we get more debug information to share with you to help it get fixed? Same as MDEV-28180 I think

            violuke Luke Cousins added a comment - We're having this problem roughly once per week. How can we get more debug information to share with you to help it get fixed? Same as MDEV-28180 I think
            UweB Uwe Beierlein added a comment -

            We have this problem as soon as we alter a table or add an index.

            UweB Uwe Beierlein added a comment - We have this problem as soon as we alter a table or add an index.
            kin Kin added a comment -

            This is issue is not related to resources.
            In our case it was triggered by CREATE INDEX ON and INSERT INTO during a database migration script. Not particulary failsafe.
            This resluts in "WSREP has not yet prepared node for application use". Either the nodes themselves will restart after a while and sometimes they recover or don't in a endless crashloop.

            But it happens only when the cluster is running more than one nodes. I hope there will be a fix for this.
            From a developer point of view it isn't ideal to potentially crash the cluster and having to do backup recovery every time.

            kin Kin added a comment - This is issue is not related to resources. In our case it was triggered by CREATE INDEX ON and INSERT INTO during a database migration script. Not particulary failsafe. This resluts in "WSREP has not yet prepared node for application use". Either the nodes themselves will restart after a while and sometimes they recover or don't in a endless crashloop. But it happens only when the cluster is running more than one nodes. I hope there will be a fix for this. From a developer point of view it isn't ideal to potentially crash the cluster and having to do backup recovery every time.
            kin Kin added a comment - - edited

            Update:

            We have given 1 CPU to each nodes. According to MariaDB we should be able to have twice the number for wsrep_slave_threads, but that also resulted in MDL conflicts.

            But when we set wsrep_slave_threads to 1, the issue is gone. We tried to run the same script several times and weren't able to reproduce the issue anymore.
            See: https://mariadb.com/kb/en/about-galera-replication/ under "Galera Slave Threads".

            For me it not clear wether it is a CPU resource issue or some kind of race condition problem with the slave threads.

            kin Kin added a comment - - edited Update: We have given 1 CPU to each nodes. According to MariaDB we should be able to have twice the number for wsrep_slave_threads, but that also resulted in MDL conflicts. But when we set wsrep_slave_threads to 1, the issue is gone. We tried to run the same script several times and weren't able to reproduce the issue anymore. See: https://mariadb.com/kb/en/about-galera-replication/ under "Galera Slave Threads". For me it not clear wether it is a CPU resource issue or some kind of race condition problem with the slave threads.
            rtuk Rick Tuk added a comment -

            We have recently updated our clusters to 10.11.7, this issue still persists in this version

            rtuk Rick Tuk added a comment - We have recently updated our clusters to 10.11.7, this issue still persists in this version
            Marak Jaroslav added a comment -

            Version 10.11.7 is affected too.

            Marak Jaroslav added a comment - Version 10.11.7 is affected too.
            janlindstrom Jan Lindström added a comment - - edited

            rtuk, Marak, kin, UweB, karll, violuke Firstly, 10.7 is EOL and you should upgraded to more recent version of MariaDB and Galera library. Secondly version used on 10.6 is also very old and should be upgraded to more recent version. Finally, for 10.11 I would like to have full unedited error log from all nodes.

            janlindstrom Jan Lindström added a comment - - edited rtuk , Marak , kin , UweB , karll , violuke Firstly, 10.7 is EOL and you should upgraded to more recent version of MariaDB and Galera library. Secondly version used on 10.6 is also very old and should be upgraded to more recent version. Finally, for 10.11 I would like to have full unedited error log from all nodes.

            People

              janlindstrom Jan Lindström
              rtuk Rick Tuk
              Votes:
              6 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.