[MDEV-22867] Assertion `instant.n_core_fields == n_core_fields' failed in dict_index_t::instant_add_field Created: 2020-06-10 Updated: 2020-07-13 Resolved: 2020-06-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Definition - Alter Table, Storage Engine - InnoDB |
| Affects Version/s: | 10.5 |
| Fix Version/s: | 10.5.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | regression | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Note: The test case has an obvious race condition, run with --repeat=N. It currently fails nearly every time for me, both in memory and on disk, but it can vary on different machines and builds.
No obvious immediate problem on a non-debug build.
|
| Comments |
| Comment by Marko Mäkelä [ 2020-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is a race condition between DELETE clearing the ‘instant ADD’ status of the table, and instant DROP COLUMN. During the DELETE, the DROP COLUMN is blocked by MDL, which it is supposed to, because the operation is not ALGORITHM=INSTANT any more. An ALGORITHM=INSTANT operation would be continuously protected by MDL_EXCLUSIVE:
During this wait, the table is converted to plain format by not only the DELETE, but also a purge worker:
Based on the above, I created a deterministic test case, but it is only crashing in 10.5, not 10.4. It might be related to
The fix is simple (tested on 10.5):
I will push it to 10.4 after running ./mtr --big-test on both 10.4 and 10.5. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
On a second thought, we must not let the metadata and the persistent data file get out of sync. We must block this at the ultimate callers of dict_index_t::clear_instant_add(). The affected callers are btr_discard_only_page_on_level(), btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was able to simplify the test so that purge is not directly involved. But, for some reason it does not crash in 10.4 for me. It could still be safest to fix this bug also in 10.4.
Note: there is a slight race condition in the test at the COMMIT. I did not double-check that purge did not kick in at that point. But I do not think that it should be able to. We are holding the page latch when row_undo_mod_must_purge() returns true. If you run with innodb_force_recovery=2, then row_undo_mod_must_purge() will not hold, because the purge_sys.view will not be updated. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems that in 10.4, the function instant_metadata_lock() would prevent the race condition. But, it would also hold a page latch on the leftmost leaf page of clustered index for the duration of a possible DROP INDEX operation. It would probably violate the latching order and could cause InnoDB to hang if concurrent DML is executed during the ALTER TABLE…DROP COLUMN…, DROP INDEX operation. According to the latching order, a secondary index leaf page latch may be held while looking up something in a clustered index. The following should be what happens in 10.4:
So, 10.4 does not seem to be affected. The Restoring instant_metadata_lock() could be a bad idea, because at some point, MDEV-16282 or a related task may implement ADD INDEX in combination with ALGORITHM=NOCOPY variant of DROP COLUMN. I think that it is safest to apply my outlined fix to 10.5. |