[MDEV-22871] Contention on the buf_pool.page_hash Created: 2020-06-11 Updated: 2021-10-14 Resolved: 2020-06-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5 |
| Fix Version/s: | 10.5.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | Scalability, performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
But, based on earlier perf record analysis we seem to have some spurious contention on concurrent S-latch requests. I believe that this is due to the spin loop rw_lock_lock_word_decr() that rw_lock_s_lock() is executing. Spurious S-latch conflicts could be addressed by implementing a bare-bones variant of rw-locks using std::atomic::fetch_add() for the S-latch and compare_exchange_strong() for the X-latch acquisition. This variant would not need to support any recursive latches at all. As far as I understand, only dict_index_t::lock, buf_block_t::lock, fil_space_t::latch, dict_sys.latch, purge_sys.latch may truly require some additional ‘bells and whistles’, such as recursive X-locks or SX-locks, instrumentation, and event-based wakeup of waiting requests. Most other types of InnoDB rw-locks should be held for extremely short durations only, and we might best implement both S and X latch acquisition with simple spin loops (inlined atomic operation possibly followed by a subroutine call for the spin loop). |
| Comments |
| Comment by Marko Mäkelä [ 2020-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Rough idea for implementation:
The spin loops would be located in non-inlined functions in order to minimize the code footprint. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My prototype passed the stress testing of mleich, but according to krunalbauskar it did not improve performance yet when the number of concurrently executing threads exceeds the number of processor cores. Also, he reported a possible writer_lock() related shutdown hang. I have some further ideas to address the problems:
Because currently the sole user of rw_lock is the buf_pool.page_hash, such tight coupling is possible. Extensive benchmarking with various experiments will be needed to move forward. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
More observations and ideas:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It turns out that buf_pool_t::watch_set() is not holding buf_pool.mutex while acquiring an exclusive latch on buf_pool.page_hash. Therefore, we cannot use buf_pool.mutex to make read_lock() operations wait for write_unlock(). I pushed a revised version that implements most of my ideas. The spin loops can be tuned with innodb_sync_spin_loops and innodb_spin_wait_delay. This revision seems to reduce read/write performance further. Deeper analysis will be needed to identify the remaining bottlenecks. If the problem is memory bus contention, I hope that eliminating page_hash_latch::pad and storing page_hash_latch interleaved with the hash table elements would address that. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To prepare for the elimination of page_hash_latch::pad, I simplified the InnoDB hash_table_t. It turns out that there were many unused data fields and unnecessarily allocated memory, such as page_hash_t::heaps. The reduction of memory footprint did not lead to notable performance improvements in my test. The final step would be to move page_hash_latch to buf_pool.page_hash.array. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems that dedicating every 1024th element in buf_pool.page_hash.array to the page_hash_latch (which replaces the fat rw_lock_t) may have addressed the problem. Some benchmarks are still in progress. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2020-06-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The tree | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems that the crucial step was to store a std::atomic<uint32_t> based rw-lock directly in buf_pool.page_hash.array, interleaved with the hash table payload. I examined some perf record traces, and the buf_pool.page_hash is not an obvious bottleneck at all. In a read-only test that I conducted myself, I did not see any waiting happening at all, when running 50 threads against the server that was running on a 10-core processor with 2×hyperthreading. With smaller buffer pool and lots of page eviction going on, contention cannot be avoided, but even then it seems to be somewhere else (such as buf_pool.mutex). Thanks to krunalbauskar and axel for the performance testing, and mleich for the stress testing. |