[MDEV-28452] wsrep_ready: OFF after MDL BF-BF conflict Created: 2022-05-02  Updated: 2023-11-17

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.6.7, 10.7.3
Fix Version/s: 10.6

Type: Bug Priority: Major
Reporter: Rick Tuk Assignee: Seppo Jaakola
Resolution: Unresolved Votes: 3
Labels: None
Environment:

Ubuntu 20.04 LTS



 Description   

We are running a 2 node + arbitrator cluster.

Galera sets WSREP_READY to OFF after MDL BF-BF conflict on second node.
mariadb service does not crash

logs:

Apr 27 03:00:09 node02.mariadb mariadbd[1003]: 2022-04-27  3:00:09 8 [Note] WSREP: MDL BF-BF conflict
Apr 27 03:00:09 node02.mariadb mariadbd[1003]: schema:  authc  
Apr 27 03:00:09 node02.mariadb mariadbd[1003]: request: (8 #011seqno 3840795 #011wsrep (high priority, exec, executing) cmd 0 160 #011update `user` set `last_login` = '2022-04-27T01:00:09Z' where `id` = '131b4cd3-e390-4b31-b47d-d1a3d5cee3ee'<99><95>hb#023#001)
Apr 27 03:00:09 node02.mariadb mariadbd[1003]: granted: (2 #011seqno 3840793 #011wsrep (toi, exec, committed) cmd 0 45 #011OPTIMIZE TABLE `log_history_daily`)
Apr 27 03:00:09 node02.mariadb mariadbd[1003]: 2022-04-27  3:00:09 8 [ERROR] Aborting

SHOW CREATE TABLE for the tables mentioned in the logs:

CREATE TABLE `user` (
  `id` char(36) COLLATE utf8mb4_unicode_ci NOT NULL,
  `environment_id` char(36) COLLATE utf8mb4_unicode_ci NOT NULL,
  `username` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `password` char(60) COLLATE utf8mb4_unicode_ci NOT NULL,
  `status` enum('sign_up','invited','active','archived') COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT 'invited',
  `token` char(64) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `token_valid_till` datetime DEFAULT NULL,
  `last_login` datetime DEFAULT NULL,
  `password_updated_at` datetime DEFAULT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `user_username_unique` (`username`),
  KEY `user_environment_id_index` (`environment_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
 
CREATE TABLE `log_history_daily` (
  `type` enum('unknown','user','developer','api_key','api_env','mail_token','device') COLLATE utf8mb4_unicode_ci NOT NULL,
  `status` enum('valid','invalid') COLLATE utf8mb4_unicode_ci NOT NULL,
  `origin` enum('unknown','browser','go','android','ios','third_party') COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT 'unknown',
  `ip` varchar(45) COLLATE utf8mb4_unicode_ci NOT NULL,
  `value` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `user_id` char(36) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `date` date NOT NULL,
  `count` int(10) unsigned NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`type`,`status`,`origin`,`ip`,`value`,`date`),
  KEY `log_history_daily_user_id_foreign` (`user_id`),
  CONSTRAINT `log_history_daily_user_id_foreign` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

This happened on our test environment (mariadb 10.7.3) and our acceptance environment (mariadb 10.6.7)



 Comments   
Comment by Karl Levik [ 2022-09-02 ]

I'm seeing the same thing.

I ran a couple of ALTER TABLE statements which succeeded on the node where I was running them, but they actually got stuck on the other two nodes in the cluster. I only saw this later when I did "SHOW PROCESSLIST;". wsrep_ready had gone to OFF, and the nodes stopped working.

The .err files on the two broken nodes contained messages including: "WSREP: MDL BF-BF conflict" whereas the good node had messages such as:

WSREP: MDL conflict db=name_of_database table=name_of_table ticket=3 solved by abort

Only through a "kill -9" on the mariadbd process and subsequently restarting, which triggered an SST, was I able to get the two broken nodes back to a working state.

Comment by Luke Cousins [ 2022-11-21 ]

We're having this problem roughly once per week. How can we get more debug information to share with you to help it get fixed? Same as MDEV-28180 I think

Comment by Uwe Beierlein [ 2022-12-13 ]

We have this problem as soon as we alter a table or add an index.

Comment by Kin [ 2023-11-10 ]

This is issue is not related to resources.
In our case it was triggered by CREATE INDEX ON and INSERT INTO during a database migration script. Not particulary failsafe.
This resluts in "WSREP has not yet prepared node for application use". Either the nodes themselves will restart after a while and sometimes they recover or don't in a endless crashloop.

But it happens only when the cluster is running more than one nodes. I hope there will be a fix for this.
From a developer point of view it isn't ideal to potentially crash the cluster and having to do backup recovery every time.

Comment by Kin [ 2023-11-17 ]

Update:

We have given 1 CPU to each nodes. According to MariaDB we should be able to have twice the number for wsrep_slave_threads, but that also resulted in MDL conflicts.

But when we set wsrep_slave_threads to 1, the issue is gone. We tried to run the same script several times and weren't able to reproduce the issue anymore.
See: https://mariadb.com/kb/en/about-galera-replication/ under "Galera Slave Threads".

For me it not clear wether it is a CPU resource issue or some kind of race condition problem with the slave threads.

Generated at Thu Feb 08 10:00:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.