[MDEV-21212] buf_page_get_gen -> buf_pool->stat.n_page_gets++ is a cpu waste (0.5-1%) Created: 2019-12-04 Updated: 2022-05-16 Resolved: 2021-03-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 5.5, 10.5 |
| Fix Version/s: | 10.6.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Daniel Black | Assignee: | Eugene Kosov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | affects-tests, performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Even on a 20 cpu Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz this single increment shows up significantly in a performance profile: In a TPC-C test (10.5.0-523879dd51c3bf2ad23295852304531d856cc869)
As the second highest CPU consuming function, the increment shows up in ~16% of that time.
Recommend removing statistic counter. I'm not sure a DBA could make a decision based on this statistic. |
| Comments |
| Comment by Marko Mäkelä [ 2019-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you, danblack! The counter is being exposed via the following interfaces:
It turns out that some of the buffer_pool_ counters are being updated via a data structure that was aimed for scalability, srv_stats, mostly using srv_stats_t::ulint_ctr_64_t that is partitioned for the purpose of improving scalability. Looking at srv_export_innodb_status(), we have references to both a local buf_pool_stat_t stat (9 counters) and a global srv_stats_t srv_stats (47 counters). I tend to agree that it could make sense to remove this particular counter and possibly some other buffer_pool_ counters or members of buf_pool_stat_t. I would appreciate it if you could make some experiments. One thing worth trying could be to move this counter to srv_stats_t. It might improve scalability, but it could also make things worse, because each srv_stats_t::ulint_ctr_64_t occupies 64 cache lines. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valerii Kravchuk [ 2019-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems both community user and our optimizer team cared about innodb_buffer_pool_read_requests at some stages. I can imagine optimizer cost model that takes "logical page reads" into account, not just number of rows to read. I also suspect that some existing tools (like PMM or MONyog) may have this value monitored and presented on their graphs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2019-12-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Experiment 1: using ib_counter_t for _n_page_get_ https://github.com/grooverdan/mariadb-server/commit/761cd643568eec2a595a3ae28dc09ec9cca8e72c
Adding still has a significant conflict and intersting this is a relaxed store which still has a `lock` word. Looking at L1-dcache-load-misses as opposed to frequency based samples
I can't tell if the 18% is a minor off by one instruction accounting error or a cache miss actuall next fetch. With 650 connections active resulting in 1098651.18 tpm (compared to 1109887.63 tpm before change) its quite possible that rdtsc is getting the the same value on multiple cores/threads. thread_local might be an option however as a C++ 11 feature which currently is only used in tokudb valerii I welcome any statement that could show its useful. I haven't looked at the optimizer. I assume that monitoring tool can work without crashing if a value is removed, but yes its a consideration and should be in release notes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We actually do C++11 in MariaDB Server 10.4 and later. In 10.5, we use even more features, including thread_local. If I remember correctly, also wlad mentioned in the past that the counter could be useful. Myself, I am always happy to remove unnecessary code. danblack, did you instantiate ib_counter_t for 64 slots? I wonder if it might make sense to reformat srv_stats so that instead of it containing a number fields one cache line apart from each other, we would basically have a number of copies of cache-line-aligned srv_stats? In that way, the total memory footprint should be smaller (less memory wasted on cache line alignment), and if one CPU is updating many counters, those counters could reside in the same cache line, which could reduce cache line traffic. But, it could also lead to false sharing if the get_rnd_value() leads multiple threads to updating the same copy of the srv_stats. One more raw idea is to have thread_local srv_stats, possibly using std::atomic, and then have the reader traverse all the thread_local copies for reporting. The reporting should be rather seldom compared to the updates. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2020-05-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
64 slot ib_counter_t was probably Sunny looking at the logs. It seems to have been copied into MyRocksDB too. I'm all for reducing wasted space of cache size alignment. So sure, either multiple copies or thread local. If multiple copies, can the `thd_[s|g]et_ha_data` be used to maintain a fixed index of the copy per THD rather than `get_rnd_value` which grossly on non-x86 uses `os_thread_get_curr_id`. A `THD->query_id % X` to round robin the allocation optionally extending `srv_stats` could also contain an `inuse_count` to prevent conflicts for a limited number of tries if it can be tied to a THD. Is user global/session THD stats something easily extended? maybe the varying storage engines installed makes this less possible. Are engine session stats useful on their own rather than just being a deconfliction strategy? I've been assuming `srv_stats` is largely SQL THD incremented rather than innodb/reader/writer threads but if there's a difference need to consider the splitting. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I see that in MySQL 8.0.20, yet another sharding mechanism was introduced, instead of using the pre-existing
danblack, I would love to see how a C++11 thread_local replacement of ib_counter_t and friends would work. In 10.5, the keyword thread_local is already being used. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-06-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can pretty much confirm this on a 64-client read-only workload (on a machine with 2 NUMA nodes and 2×16 threads). The perf record output is highlighting the following in buf_page_get_low() (which buf_page_get_gen() was renamed to in
An even worse hog was highlighted in ha_innobase::index_read():
This is one of the two srv_stats calls:
For comparison, the CPU time that buf_page_get_low() is spending in the following instructions is justified:
Those instructions are part of
which could be alleviated by increasing srv_n_page_hash_locks from the default value 16. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2020-12-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Set prio to critical as many of the MONITOR_* affect-tests bugs seen during TSAN will be fixed as part of this bug. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that we should minimize the use of srv_stats and try to use normal counters in those cases where a mutex-protected critical section already exists. For example, All use of srv_stats must be carefully reviewed, and anything that can be replaced or removed should be removed. That data structure is using 1 or 64 cache lines per variable, which is pretty intensive for the memory bus. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krunal Bauskar [ 2021-03-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just to keep it posted. We tried the distributed-counter patch against 10.6 and here are the numbers with 4 numa machine that reported heavy contention. baseline patched (counter) % | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krunal Bauskar [ 2021-03-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was a suggestion (from Marko) to evaluate this patch against 10.5 10.5 baseline patched (10.5) % Unfortunately, the said patch reports regression with 10.5 (for point-select only) so I would not recommend the said patch for 10.5 and naturally to the older version. Same patch was re-evaluated with 10.6 using the head version (#b930a190) 4 numa (arm) Said patch continues to report improvement for point-select and read-only in the range of 5-7% for ps and 18-27% for read-only. Given the said information patch is recommended for 10.6 onwards. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krunal Bauskar [ 2021-03-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
as requested x86 results. x86 is a 2 numa machine and as observed with even 2 numa arm it doesn't show a major improvement but no regression too. (infact 2% improvement). 2 numa (x86) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krunal Bauskar [ 2021-03-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We re-evaluated the latest 10.6, pr#1766 approach and tls approach. The attached graph has the needed details. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krunal Bauskar [ 2021-03-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Given the multi-numa regression of tls patch we decided to study tls using single numa. Check the attached graph. On single numa performance drop is pretty less that confirms that tls approach in general on multi-numa has some overhead. (Wild guess could be cross tls page movement of 4KB). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you, krunalbauskar! The simple change of PR#1766 will be in 10.6. Most notably, this will double the size of srv_stats. There are some unused counters in srv_stats and some duplicated ones (various srv_stats.n_system_rows_ duplicating srv_stats.n_rows_). I am going to fix those separately, to reduce sizeof srv_stats. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-05-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was thinking about this once more. We could distribute the counter among buf_pool.page_hash.array, one counter per cache line. Especially on ARMv8, we are currently wasting half of those 128-byte cache lines, ever since To assess the feasibility of this, I tested the impact of removing the buf_pool.stat.n_page_gets altogether. In a quick oltp_read_only benchmark on my system (dual Intel Xeon E5-2630 v4), the difference appeared to be smaller than the noise margin. In fact, for most of the time, the code was slower after I removed the counter. krunalbauskar, could you test this again? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-05-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For the record: Both me and krunalbauskar observed a slight regression after removing buf_pool.stat.n_page_gets. In other words, it appears to serve as a useful ‘throttle’ to prevent contention elsewhere. I think that it is still worth an experiment to distribute this counter in buf_pool.page_hash.array, but I am afraid that it could face a similar fate. I filed |