[MDEV-32899] InnoDB is holding shared dict_sys.latch while waiting for FOREIGN KEY child table lock on DDL Created: 2023-11-28  Updated: 2024-02-08  Resolved: 2024-02-08

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.5, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2, 11.3
Fix Version/s: 10.6.18, 10.11.8, 11.0.6, 11.1.5, 11.2.4, 11.3.3

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: foreign-keys, performance, regression

Issue Links:
Problem/Incident
is caused by MDEV-26217 Failing assertion: list.count > 0 in ... Closed
is caused by MDEV-26554 Table-rebuilding DDL on parent table ... Closed
Relates
relates to MDEV-33104 Assertion `table.get_ref_count() <= 1... Closed

 Description   

In order to fix the race conditions MDEV-26217 and MDEV-26554, some code was added so that InnoDB could hold a shared dict_sys.latch while waiting for an exclusive lock on tables that are connected by FOREIGN KEY statements. This is not acceptable, because a lock wait can be blocked for a long time (worst case, indefinitely if innodb_lock_wait_timeout=100000000). If another thread tries to acquire an exclusive dict_sys.latch, it will block any other threads from acquiring a shared dict_sys.latch until the table lock wait has been resolved.

This bug can be fixed by changing lock_table_for_trx() so that whenever the caller is holding a shared dict_sys.latch, it will be released and reacquired around the call to lock_wait(). In this way, the lock object will be created or released while the table is protected by the shared dict_sys.latch. It is safe to temporarily release the dict_sys.latch, because tables on which lock objects exist cannot be evicted or dropped. In the callers, we have to take special care to ensure that dict_table_t::referenced_set is safe to traverse if dict_sys.latch was temporarily released.



 Comments   
Comment by Marko Mäkelä [ 2024-01-19 ]

I reverted this due to the regression MDEV-33104.

Comment by Marko Mäkelä [ 2024-01-19 ]

To avoid reintroducing a bug like MDEV-33104, we must revise lock_table_children() so that it will successfully acquire MDL on each child table before waiting for an InnoDB table lock. The initial (reverted) version of this was holding a table reference while waiting for an InnoDB table lock. Concurrently, a DDL operation might want to drop or rebuild the table while holding an MDL_EXCLUSIVE as well as an InnoDB table lock.

Comment by Marko Mäkelä [ 2024-01-19 ]

A metadata lock can be acquired by invoking dict_acquire_mdl_shared<false>() in lock_table_children() while holding shared dict_sys.latch. Because that function will temporarily release dict_sys.latch while waiting for MDL, we had better rescan table->referenced_set after each call, in case a constraint or a child table had been dropped meanwhile. We will have to keep track of the tables on which dict_acquire_mdl_shared<false>() was already invoked.

Comment by Matthias Leich [ 2024-01-25 ]

origin/10.6-MDEV-32899 c851e172ea043985fc8d3cec46368004a174892d 2024-01-23T17:10:37+02:00
performed well in RQG testing. No new bad effects.

Comment by Matthias Leich [ 2024-02-01 ]

origin/10.6-MDEV-32899 f50940ee0b81b9c963bd114c54788e515220bc7e 2024-02-01T15:48:46+02:00
performed well in RQG testing. No new problems.

Comment by Debarun Banerjee [ 2024-02-07 ]

https://github.com/MariaDB/server/pull/3021 looks good to me.

Generated at Thu Feb 08 10:34:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.