In a stress test campaign of a 10.6-based branch by Matthias Leich, a deadlock between two InnoDB threads were observed, involving lock_sys.wait_mutex and a dict_table_t::lock_mutex. The cause of the hang is a latching order violation in lock_sys_t::cancel():
The correct latching order would be lock_sys.latch, dict_table_t::lock_mutex, lock_sys.wait_mutex. Because we are already holding lock_sys.wait_mutex here, we must invoke table->lock_mutex_trylock(). If that mutex is unavailable, we must first release lock_sys.wait_mutex before acquiring it, and finally acquire lock_sys.mutex, just like we handle the lock_sys.latch order violation in the same function.
This hang should mostly only affect DDL operations, and possibly LOCK TABLES. During normal DML, there will be no table lock conflicts, because IX and IS locks are compatible with each other.
The final symptom was the infamous watchdog message like this (copied from another log):
The fix was validated by further stress testing, and no hangs were observed. The random query generator (RQG) grammar involved partitioned tables (each partition is handled as a separate InnoDB table) and some compression and encryption to slow down the buffer pool operations.