[MDEV-33104] Assertion `table.get_ref_count() <= 1' failed in dberr_t trx_t::drop_table Created: 2023-12-21 Updated: 2024-01-19 Resolved: 2024-01-19 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Locking, Storage Engine - InnoDB |
| Affects Version/s: | 10.6, 10.11, 11.0, 11.1, 11.2, 11.3 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Done | Votes: | 1 |
| Labels: | foreign-keys, regression, rr-profile-analyzed | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
I've set versions tentatively to 10.6+, as that's where the failure has been recently observed in concurrent tests. Please adjust if needed.
I have an impression that the failure started happening not very long ago (but probably before Q4 2023 release). However, it's not impossible that its appearance was triggered not by code changes but by some test changes. Effect on non-debug builds: |
| Comments |
| Comment by Marko Mäkelä [ 2023-12-22 ] | ||||||||||||||||||||||
|
There is one unexpected table reference, being held by the another DDL operation that is waiting for exclusive InnoDB table locks:
The assertion fails in Thread 2, which is trying to drop the table simple_db.D after completing the rebuild of that table:
The wait for InnoDB table locks was added in | ||||||||||||||||||||||
| Comment by Elena Stepanova [ 2023-12-24 ] | ||||||||||||||||||||||
|
The failure didn't start happening after the above-mentioned bugs were closed in 10.6.5/10.6.8. As the description says, it started happening recently. I have now ruled out possible test modifications as a cause and am quite certain that it was triggered by code changes, most likely by
I've raised it tentatively to a blocker (as a regression since the latest release) at least until further analysis is done. After that, if it turns out to be harmless, please feel free to demote. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-02 ] | ||||||||||||||||||||||
|
Before the fix of To prevent this failure scenario, I think that the caller of handler::commit_inplace_alter_table(commit=true) must hold MDL_EXCLUSIVE on all tables that have FOREIGN KEY…REFERENCES pointing to the table that is being altered, as well as on all tables that are pointed to by the current or altered REFERENCES clauses of the being-altered table. In that way, the lock wait would happen on a metadata lock (MDL) object, outside any InnoDB object that may have to be freed during the course of the ALTER TABLE. InnoDB locks are always connected to a dict_table_t object, which is roughly equivalent to TABLE_SHARE. Many ALTER TABLE operations will involve creating a new dict_table_t object. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-12 ] | ||||||||||||||||||||||
|
To prevent this race condition between two concurrently committing ALTER TABLE that are modifying a parent and a child table, I think that at the start of handler::commit_inplace_alter_table(commit=true) a meta-data lock must be held in MDL_EXCLUSIVE mode on the immediate parent and child tables. Adding this metadata lock acquisition to handler::commit_inplace_alter_table() itself would have the benefit that we’d only do it when the table needs to be rebuilt or reloaded later on in that function. It is not clear to me how this MDL acquisition would be implemented and whether deadlocks would be detected by other means than lock_wait_timeout. | ||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2024-01-16 ] | ||||||||||||||||||||||
|
There is only this interface:
| ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-16 ] | ||||||||||||||||||||||
|
As far as I can tell, a MDL_context would be available to InnoDB by calling thd_mdl_context(thd), which returns a pointer to thd->mdl_context. However, before invoking handler::commit_inplace_alter_table(), mysql_inplace_alter_table() would already have called wait_while_table_is_used() to acquire MDL_EXCLUSIVE on the name of the table that is being altered. Therefore, the condition There must be no granted locks in the context cannot be met. In fact, even if that call was removed, a metadata lock in some shared mode would still be granted in the current thd->mdl_context. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-16 ] | ||||||||||||||||||||||
|
serg claimed in a call that there is a deadlock detector within the MDL subsystem. Since we are trying to prevent a conflict with another DDL operation (which would hold MDL_EXCLUSIVE on its being-ALTERed table), I think that the MDL_SHARED that InnoDB could acquire by invoking MDL_context::acquire_lock() should be sufficient. When it comes to MDL timeouts or deadlocks, we see that in fact, ha_innobase::commit_inplace_alter_table() is already invoking MDL_context::acquire_lock() via dict_acquire_mdl_shared<false>() on the InnoDB persistent statistics tables. I will work on a fix based on this. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-19 ] | ||||||||||||||||||||||
|
Based on some debugging with rr, I believe that we can get the complete list of parent and child table names in ha_innobase::commit_inplace_alter_table() by traversing ctx0->old_table->referenced_set, ctx0->old_table->foreign_set, as well as ctx0->add_fk under a shared dict_sys.latch. We should be able to create MDL_ticket for all names, release the shared dict_sys.latch, and invoke MDL_context::acquire_lock() on each MDL_ticket. (Until MDEV-12483 is done, no FOREIGN KEY relationship can be defined on partitioned tables.) | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-19 ] | ||||||||||||||||||||||
|
The only existing regression test where my added code would capture more than 1 table name that is connected by a FOREIGN KEY constraint is for the following statement before the first server restart in the test innodb.innodb-index:
The ADD PRIMARY KEY implies a table rebuild, hence the table needs to be rebuilt. The captured names are t1, t3, and t4. Come to think of it, there could be an easier fix of this: modify lock_table_children() so that it instead of simply acquiring a reference on each child table, it would invoke MDL_context::acquire_lock() first. To fix this particular bug scenario (as well as possibly in general), we do not have to acquire any locks on any parent tables, that is, the table traversal in lock_table_children() should be sufficient. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-19 ] | ||||||||||||||||||||||
|
|