[MDEV-30638] Deadlock due to updating InnoDB statistics Created: 2023-02-10 Updated: 2023-03-02 Resolved: 2023-02-16 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1, 11.0.0, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3 |
| Fix Version/s: | 10.11.3, 11.0.1, 10.6.13, 10.7.8, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | affects-tests, deadlock, hang, regression-10.6 | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
This InnoDB deadlock came up while mleich was testing
One thread is waiting for an exclusive tablespace latch while holding several buffer page latches:
I see that before I also see that before
Because a pessimistic insert would start by acquiring an index latch, it would not be able to deadlock with this mini-transaction anymore. |
| Comments |
| Comment by Marko Mäkelä [ 2023-02-10 ] |
|
Sorry, it is not that simple. Acquiring a stronger index latch does not help this deadlock. The INSERT operation is actually holding an index latch in a mode that would conflict with the statistics operation, if the statistics operation were holding a latch on the same index. The statistics can be updated for different indexes in the same tablespace than the one that the INSERT is operating on. To prevent this hang, one possible fix could be that dict_stats_update_transient_for_index() and possibly other functions (dict_index_t::clear(), btr_get_size_and_reserved(), dict_stats_save_defrag_stats(), dict_stats_analyze_index()) would acquire a latch on every index of the table, like row_quiesce_set_state() does it, and only then proceeds to acquire a fil_space_t::latch. This approach would still fail to prevent a hang if a table is stored in the InnoDB system tablespace. |
| Comment by Marko Mäkelä [ 2023-02-11 ] |
|
The relevant latching order here should be as follows:
Currently, dict_stats_update_transient_for_index() and possibly other functions fail to acquire the index root page latch before the tablespace latch. Swapping the order of acquisition ought to fix this. In the core dump that I analyzed, dict_stats_update_transient_for_index() was holding a tablespace latch and waiting for the index root page latch. |
| Comment by Marko Mäkelä [ 2023-02-13 ] |
|
It turns out that dict_stats_update_transient_for_index() and other code that I examined is already following the correct latching order as described above. What looks incorrect is that we are only acquiring a shared tablespace latch. This will wrongly allow similar code to be executed for other indexes in the same tablespace. In the core dump of a hang that I analyzed, we have a wait for some metadata allocation page inside fseg_n_reserved_pages(). If all threads that access a tablespace allocation metadata pages are holding a non-shared latch on the tablespace, they would conflict with each other at a higher level and these deadlocks would be impossible. |
| Comment by Marko Mäkelä [ 2023-02-13 ] |
|
Ironically, this hang was introduced with the fix of
|
| Comment by Marko Mäkelä [ 2023-02-13 ] |
|
After this fix, the only invoker of mtr_t::s_lock_space() will be the function fseg_page_is_allocated() (renamed from fseg_page_is_free() in |