[MDEV-26883] InnoDB hang due to table lock conflict Created: 2021-10-22  Updated: 2021-10-22  Resolved: 2021-10-22

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.0, 10.6.1, 10.6.2, 10.6.3, 10.6.4
Fix Version/s: 10.6.5

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: regression-10.6, rr-profile-analyzed

Issue Links:
Problem/Incident
is caused by MDEV-24789 Performance regression after MDEV-24671 Closed

 Description   

In a stress test campaign of a 10.6-based branch by mleich, a deadlock between two InnoDB threads were observed, involving lock_sys.wait_mutex and a dict_table_t::lock_mutex. The cause of the hang is a latching order violation in lock_sys_t::cancel():

resolve_table_lock:
      dict_table_t *table= lock->un_member.tab_lock.table;
      table->lock_mutex_lock();

The correct latching order would be lock_sys.latch, dict_table_t::lock_mutex, lock_sys.wait_mutex. Because we are already holding lock_sys.wait_mutex here, we must invoke table->lock_mutex_trylock(). If that mutex is unavailable, we must first release lock_sys.wait_mutex before acquiring it, and finally acquire lock_sys.mutex, just like we handle the lock_sys.latch order violation in the same function.

This hang should mostly only affect DDL operations, and possibly LOCK TABLES. During normal DML, there will be no table lock conflicts, because IX and IS locks are compatible with each other.

The final symptom was the infamous watchdog message like this (copied from another log):

2021-10-21 17:57:45 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

The fix was validated by further stress testing, and no hangs were observed. The random query generator (RQG) grammar involved partitioned tables (each partition is handled as a separate InnoDB table) and some compression and encryption to slow down the buffer pool operations.


Generated at Thu Feb 08 09:48:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.