[MDEV-24258] Merge dict_sys.mutex into dict_sys.latch Created: 2020-11-20 Updated: 2023-07-22 Resolved: 2021-08-31 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 10.6.5 |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
InnoDB data dictionary cache is protected by both dict_sys.latch (an RW-lock) and dict_sys.mutex. One reason for the redundant synchronization primitive would be eliminated by Another reason for keeping a separate mutex is that sometimes, the mutex provides a mechanism to ‘upgrade’ the rw-lock. The most prominent case of that would be removed by |
| Comments |
| Comment by Marko Mäkelä [ 2020-11-20 ] | |
|
MDEV-24258.patch | |
| Comment by Marko Mäkelä [ 2021-07-28 ] | |
|
I believe that this can be achieved in the 10.6 release series after all. It seems that all use of dict_sys.freeze() can be replaced with MDL. In most cases, the MDL will already have been acquired by the SQL layer. | |
| Comment by Marko Mäkelä [ 2021-07-30 ] | |
|
thiru, please review. | |
| Comment by Marko Mäkelä [ 2021-07-30 ] | |
|
Unfortunately, we got the following problem with rr record once (not without it):
In the trace that I analyzed, the first thread that started the wait would starve while later threads would acquire and release the latch. Nothing is really stuck, but either the scheduling is extremely unfair, or we may really have to think about making the lock waits fairer. For example, when a wait time start has been set by some thread, threads that are trying to acquire dict_sys.latch would yield. Given this potential regression, this change is too risky to be included in the upcoming 10.6.4 release. We will need more testing. | |
| Comment by Marko Mäkelä [ 2021-08-27 ] | |
|
During the testing of | |
| Comment by Marko Mäkelä [ 2021-08-31 ] | |
|
The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex was removed along with dict_sys.mutex. The dict_sys.latch will continue be instrumented as dict_operation_lock. Because dict_sys.mutex will no longer 'throttle' the threads that purge InnoDB transaction history, a performance degradation may be observed unless innodb_purge_threads=1. The table cache eviction policy will become FIFO-like, because table lookup will be protected by a shared dict_sys.latch, instead of being protected by exclusive dict_sys.mutex. Note: Tables can never be evicted as long as locks exist on them or the tables are in use by some thread. | |
| Comment by Marko Mäkelä [ 2021-09-02 ] | |
|
The http://www.brendangregg.com/offcpuanalysis.html graphs provided by axel showed that purge tasks would end up waiting much more elsewhere if their table lookups are no longer serialized by exclusive dict_sys.latch. Possibly this would lead to purge lag, which in turn would lead to degraded throughput. I managed to reproduce the phenomenon on my system today. Adding dummy synchronization fixed the regression for me. This is of course only a work-around, and deeper investigation on purge subsystem will be needed. | |
| Comment by Marko Mäkelä [ 2021-09-03 ] | |
|
I tested a more complex purge throttle, and it resulted in much worse overall throughput. The previous dummy synchronization did consistently help in my tests. I checked the graphs again, and I see that the purge coordinator is waiting for undo pages to be read into the buffer pool, which in turn can wait for a to-be-evicted page to be written. Purge workers are waiting for index page latches (theoretically conflicting with workload, but maybe more likely just waiting for the page read). Before the removal of dict_sys.mutex, both these threads were spending 2/3 or 3/4 of their waiting time on dict_sys.mutex. I think that we must throttle purge based on buffer pool contention. That could be done in |