[MDEV-21212] buf_page_get_gen -> buf_pool->stat.n_page_gets++ is a cpu waste (0.5-1%) Created: 2019-12-04  Updated: 2022-05-16  Resolved: 2021-03-12

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 5.5, 10.5
Fix Version/s: 10.6.0

Type: Bug Priority: Critical
Reporter: Daniel Black Assignee: Eugene Kosov (Inactive)
Resolution: Fixed Votes: 0
Labels: affects-tests, performance

Attachments: PNG File baseline (#64f89b1f), pr#1766 and tls (#1654bfee4) (single numa).png     PNG File baseline (#f386fdd7), pr#1766 and tls (#1654bfee4).png     PNG File baseline (#f386fdd7), pr#7166 and tls (#1654bfee4) (single numa).png     File my.cnf    
Issue Links:
Blocks
blocks MDEV-24329 TSAN: Data race in srv_mon_default_on... Open
blocks MDEV-24330 Data race in buf_LRU_get_free_block a... Open
blocks MDEV-24332 TSAN: Data race in os_file_pread from... Open
Relates
relates to MDEV-142 Check if Innodb_buffer_pool_read_requ... Open
relates to MDEV-8932 innodb buffer pool hit rate is less t... Closed
relates to MDEV-28537 Unused or useless InnoDB counters num... Closed
relates to MDEV-28539 Some InnoDB counters are duplicating ... Closed
relates to MDEV-28540 Deprecate and ignore the parameter in... Closed
relates to MDEV-28542 Useless INSERT BUFFER AND ADAPTIVE HA... Closed
relates to MDEV-15058 Remove multiple InnoDB buffer pool in... Closed
relates to MDEV-23249 __builtin_readcyclecounter in include... Closed
relates to MDEV-24328 Data race in buf_page_get_low at buf/... Closed
relates to MDEV-25281 Switch to use non-atomic (vs atomic) ... Closed

 Description   

Even on a 20 cpu Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz this single increment shows up significantly in a performance profile:

In a TPC-C test (10.5.0-523879dd51c3bf2ad23295852304531d856cc869)

perf report -g --no-children -i mariadb-10.5.0-event_warmup.g.perf

+    5.30%  mysqld   mysqld               [.] rec_get_offsets_func                                                                                                                                                                                                                                                 ▒
-    4.36%  mysqld   mysqld               [.] buf_page_get_gen                                                                                                                                                                                                                                                     ▒
   + 2.23% 0                                                                                                                                                                                                                                                                                                       ▒
     0.82% buf_page_get_gen                                                                                                                                                                                                                                                                                        ▒
   - 0.79% 0x1                                                                                                                                                                                                                                                                                                     ▒
      - 0.75% ha_innobase::records_in_range                                                                                                                                                                                                                                                                        ▒
           buf_page_get_gen                                                                                                                                                                                                                                                                                        ▒
+    3.46%  mysqld   mysqld               [.] l_find                                                                                                                                                                                                                                                               ▒
+    3.18%  mysqld   mysqld               [.] MYSQLparse                                                                                                                                                                                                                                                           ▒
+    2.68%  mysqld   mysqld               [.] cmp_dtuple_rec_with_match_low                                                                                                                                                                                                                                        ▒
+    2.41%  mysqld   mysqld               [.] page_cur_search_with_match           

As the second highest CPU consuming function, the increment shows up in ~16% of that time.

annotate on bug_page_get_gen

       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │      ↓ je     c311db <buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool)+0xbb>                                                                                                                                     ▒
       │                      *err = DB_SUCCESS;                                                                                                                                                                                                                                                                   ▒
       │        movl   $0x0,(%rsi)                                                                                                                                                                                                                                                                                 ▒
       │      #endif /* UNIV_DEBUG */                                                                                                                                                                                                                                                                              ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              ut_ad(!mtr || !ibuf_inside(mtr)                                                                                                                                                                                                                                                              ▒
       │                    || ibuf_page_low(page_id, zip_size, FALSE, file, line, NULL));                                                                                                                                                                                                                         ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              buf_pool->stat.n_page_gets++;                                                                                                                                                                                                                                                                ▒
  1.08 │  bb:   mov    0x10(%rsp),%rbx                                                                                                                                                                                                                                                                             ▒
       │        add    0x60(%rsp),%ecx                                                                                                                                                                                                                                                                             ▒
       │      ut_hash_ulint():                                                                                                                                                                                                                                                                                     ▒
       │              ulint    table_size)    /*!< in: hash table size */                                                                                                                                                                                                                                          ▒
       │      {                                                                                                                                                                                                                                                                                                    ▒
       │              ut_ad(table_size);                                                                                                                                                                                                                                                                           ▒
       │              key = key ^ UT_HASH_RANDOM_MASK2;                                                                                                                                                                                                                                                            ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              return(key % table_size);                                                                                                                                                                                                                                                                    ▒
       │        xor    %edx,%edx                                                                                                                                                                                                                                                                                   ▒
       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │              ulint           retries = 0;                                                                                                                                                                                                                                                                 ▒
       │        movq   $0x0,0x38(%rsp)                                                                                                                                                                                                                                                                             ▒
       │              buf_pool->stat.n_page_gets++;                                                                                                                                                                                                                                                                ▒
 16.52 │        addq   $0x1,0x188(%rbx)                                                                                                                                                                                                                                                                            ▒
       │      buf_page_hash_lock_get():                                                                                                                                                                                                                                                                            ▒
       │      /** Get appropriate page_hash_lock. */                                                                                                                                                                                                                                                               ▒
       │      UNIV_INLINE                                                                                                                                                                                                                                                                                          ▒
       │      rw_lock_t*                                                                                                                                                                                                                                                                                           ▒
       │      buf_page_hash_lock_get(const buf_pool_t* buf_pool, const page_id_t& page_id)                                                                                                                                                                                                                         ▒
       │      {                                                                                                                                                                                                                                                                                                    ▒
       │              return hash_get_lock(buf_pool->page_hash, page_id.fold());                                                                                                                                                                                                                                   ▒
  0.36 │        mov    0xb0(%rbx),%rsi                                                                                                                                                                                                                                                                             ▒
       │      ut_hash_ulint():                                                                                                                                                                                                                                                                                     ▒
       │              key = key ^ UT_HASH_RANDOM_MASK2;                                                                                                                                                                                                                                                            ▒
       │        mov    %ecx,%eax                                                                                                                                                                                                                                                                                   ▒
       │        movzbl 0x18(%rsp),%ecx                                                                                                                                                                                                                                                                             ▒
       │        xor    $0x62946a4f,%eax                                                                                                                                                                                                                                                                            ▒
       │        mov    %rax,0x8(%rsp)                                                                                                                                                                                                                                                                              ▒
       │              return(key % table_size);                                                                                                                                                                                                                                                                    ▒
  6.28 │        divq   0x8(%rsi)                                                                                                                                                                                                                                                                                   ▒
       │      _Z17ut_2pow_remainderImET_S0_S0_():                                                                                                                                                                                                                                                                  ▒
       │      /*******************************************

Recommend removing statistic counter. I'm not sure a DBA could make a decision based on this statistic.



 Comments   
Comment by Marko Mäkelä [ 2019-12-04 ]

Thank you, danblack! The counter is being exposed via the following interfaces:

show status like 'Innodb_buffer_pool_read_requests';
SELECT variable_value FROM information_schema.global_status
WHERE variable_name = 'innodb_buffer_pool_read_requests';
set global innodb_monitor_enable = buffer_pool_read_requests;
select * from information_schema.innodb_metrics where name='buffer_pool_read_requests';

It turns out that some of the buffer_pool_ counters are being updated via a data structure that was aimed for scalability, srv_stats, mostly using srv_stats_t::ulint_ctr_64_t that is partitioned for the purpose of improving scalability. Looking at srv_export_innodb_status(), we have references to both a local buf_pool_stat_t stat (9 counters) and a global srv_stats_t srv_stats (47 counters).

I tend to agree that it could make sense to remove this particular counter and possibly some other buffer_pool_ counters or members of buf_pool_stat_t. I would appreciate it if you could make some experiments.

One thing worth trying could be to move this counter to srv_stats_t. It might improve scalability, but it could also make things worse, because each srv_stats_t::ulint_ctr_64_t occupies 64 cache lines.

Comment by Valerii Kravchuk [ 2019-12-04 ]

It seems both community user and our optimizer team cared about innodb_buffer_pool_read_requests at some stages.

I can imagine optimizer cost model that takes "logical page reads" into account, not just number of rows to read.

I also suspect that some existing tools (like PMM or MONyog) may have this value monitored and presented on their graphs.

Comment by Daniel Black [ 2019-12-05 ]

Experiment 1: using ib_counter_t for _n_page_get_

https://github.com/grooverdan/mariadb-server/commit/761cd643568eec2a595a3ae28dc09ec9cca8e72c

perf report -g --no-children -i mariadb-10.5.0-sys-new-count-run_2.g.perf

Samples: 83K of event 'cycles:ppp', Event count (approx.): 330459601556601
  Overhead  Command  Shared Object        Symbol
+    6.51%  mysqld   mysqld               [.] buf_page_get_gen
+    5.14%  mysqld   mysqld               [.] rec_get_offsets_func
+    3.81%  mysqld   mysqld               [.] MYSQLparse
+    2.67%  mysqld   mysqld               [.] cmp_dtuple_rec_with_match_low

annotate on buf_page_get_gen

       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
  0.13 │        mov    %rcx,0x10(%rsp)                                                                                                                                                                                                                                                                             ▒
       │        mov    0x290(%rsp),%rcx                                                                                                                                                                                                                                                                            ▒
  0.13 │        mov    %rax,0x58(%rsp)                                                                                                                                                                                                                                                                             ▒
  0.13 │        mov    0x298(%rsp),%eax                                                                                                                                                                                                                                                                            ▒
  0.38 │        mov    %r8,0x18(%rsp)                                                                                                                                                                                                                                                                              ▒
  0.13 │        mov    %r9,0x70(%rsp)                                                                                                                                                                                                                                                                              ▒
  0.50 │        mov    %rcx,0x78(%rsp)                                                                                                                                                                                                                                                                             ▒
       │        mov    %eax,0x8c(%rsp)                                                                                                                                                                                                                                                                             ▒
       │        mov    %fs:0x28,%rax                                                                                                                                                                                                                                                                               ▒
       │        mov    %rax,0x238(%rsp)                                                                                                                                                                                                                                                                            ▒
       │        xor    %eax,%eax                                                                                                                                                                                                                                                                                   ▒
       │        mov    %rdi,%rax                                                                                                                                                                                                                                                                                   ▒
       │        shr    $0x20,%rax                                                                                                                                                                                                                                                                                  ▒
  0.13 │        mov    %rax,0x50(%rsp)                                                                                                                                                                                                                                                                             ▒
  0.13 │        mov    %eax,0x28(%rsp)                                                                                                                                                                                                                                                                             ▒
       │      _Z12buf_pool_get9page_id_t():                                                                                                                                                                                                                                                                        ▒
       │              ulint           ignored_page_no = page_id.page_no() >> 6;                                                                                                                                                                                                                                    ▒
  0.13 │        shr    $0x6,%eax                                                                                                                                                                                                                                                                                   ▒
       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │        add    %ebx,%eax                                                                                                                                                                                                                                                                                   ▒
  0.13 │        mov    %rax,0x68(%rsp)                                                                                                                                                                                                                                                                             ▒
       │      _Z12buf_pool_get9page_id_t():                                                                                                                                                                                                                                                                        ▒
       │              ulint           i = id.fold() % srv_buf_pool_instances;                                                                                                                                                                                                                                      ▒
  1.00 │        divq   (%rsi)                                                                                                                                                                                                                                                                                      ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              return(&buf_pool_ptr[i]);                                                                                                                                                                                                                                                                    ▒
  0.13 │        imul   $0x2580,%rdx,%rdx                                                                                                                                                                                                                                                                           ▒
  0.13 │        add    buf_pool_ptr,%rdx                                                                                                                                                                                                                                                                           ▒
       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │              ut_ad(!allow_ibuf_merge                                                                                                                                                                                                                                                                      ▒
       │                    || mode == BUF_GET                                                                                                                                                                                                                                                                     ▒
       │                    || mode == BUF_GET_IF_IN_POOL                                                                                                                                                                                                                                                          ▒
       │                    || mode == BUF_GET_IF_IN_POOL_OR_WATCH);                                                                                                                                                                                                                                               ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              if (err) {                                                                                                                                                                                                                                                                                   ▒
       │        test   %rcx,%rcx                                                                                                                                                                                                                                                                                   ▒
       │      _Z12buf_pool_get9page_id_t():                                                                                                                                                                                                                                                                        ▒
       │        mov    %rdx,0x8(%rsp)                                                                                                                                                                                                                                                                              ▒
       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │      ↓ je     c19866 <buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool)+0xb6>                                                                                                                                     ▒
       │                      *err = DB_SUCCESS;                                                                                                                                                                                                                                                                   ▒
       │        movl   $0x0,(%rcx)                                                                                                                                                                                                                                                                                 ▒
       │      _Z7__rdtscv():                                                                                                                                                                                                                                                                                       ▒
       │      /* rdtsc */                                                                                                                                                                                                                                                                                          ▒
       │      extern __inline unsigned long long                                                                                                                                                                                                                                                                   ▒
       │      __attribute__((__gnu_inline__, __always_inline__, __artificial__))                                                                                                                                                                                                                                   ▒
       │      __rdtsc (void)                                                                                                                                                                                                                                                                                       ▒
       │      {                                                                                                                                                                                                                                                                                                    ▒
       │        return __builtin_ia32_rdtsc ();                                                                                                                                                                                                                                                                    ▒
  0.88 │  b6:   rdtsc                                                                                                                                                                                                                                                                                              ▒
       │        shl    $0x20,%rdx                                                                                                                                                                                                                                                                                  ▒
       │      get_rnd_value():                                                                                                                                                                                                                                                                                     ▒
       │      static inline size_t                                                                                                                                                                                                                                                                                 ▒
       │      get_rnd_value()                                                                                                                                                                                                                                                                                      ▒
       │      {                                                                                                                                                                                                                                                                                                    ▒
       │              size_t c = static_cast<size_t>(my_timer_cycles());                                                                                                                                                                                                                                           ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              if (c != 0) {                                                                                                                                                                                                                                                                                ▒
       │        or     %rdx,%rax                                                                                                                                                                                                                                                                                   ▒
       │      ↓ jne    c19876 <buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool)+0xc6>                                                                                                                                     ▒
       │              }                                                                                                                                                                                                                                                                                            ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              /* We may go here if my_timer_cycles() returns 0,                                                                                                                                                                                                                                            ▒
       │              so we have to have the plan B for the counter. */                                                                                                                                                                                                                                            ▒
       │      #if !defined(_WIN32)                                                                                                                                                                                                                                                                                 ▒
       │              return (size_t)os_thread_get_curr_id();                                                                                                                                                                                                                                                      ▒
       │      → callq  os_thread_get_curr_id()                                                                                                                                                                                                                                                                     ▒
       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │            }                                                                                                                                                                                                                                                                                              ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │            _GLIBCXX_ALWAYS_INLINE __int_type                                                                                                                                                                                                                                                              ▒
       │            fetch_add(__int_type __i,                                                                                                                                                                                                                                                                      ▒
       │                      memory_order __m = memory_order_seq_cst) noexcept                                                                                                                                                                                                                                    ▒
       │            { return __atomic_fetch_add(&_M_i, __i, __m); }                                                                                                                                                                                                                                                ▒
       │  c6:   mov    0x8(%rsp),%rsi                                                                                                                                                                                                                                                                              ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              /** Add to the counter.                                                                                                                                                                                                                                                                      ▒
       │              @param[in]      index   a reasonably thread-unique identifier                                                                                                                                                                                                                                ▒
       │              @param[in]      n       amount to be added */                                                                                                                                                                                                                                                ▒
       │              void add(size_t index, Type n) {                                                                                                                                                                                                                                                             ▒
       │                      index = index % N;                                                                                                                                                                                                                                                                   ▒
       │        and    $0x3f,%eax                                                                                                                                                                                                                                                                                  ▒
  0.38 │        shl    $0x6,%rax                                                                                                                                                                                                                                                                                   ▒
       │        lea    0x1c0(%rsi,%rax,1),%rax                                                                                                                                                                                                                                                                     ▒
 15.79 │        lock   addq   $0x1,(%rax)                                                                                                                                                                                                                                                                          ▒
       │        mov    0x50(%rsp),%eax                                                                                                                                                                                                                                                                             ▒
       │      buf_page_hash_lock_get():                                                                                                                                                                                                                                                                            ▒

Adding still has a significant conflict and intersting this is a relaxed store which still has a `lock` word.

Looking at L1-dcache-load-misses as opposed to frequency based samples

perf record -o ${name}_L1-dcache-load-misses -e L1-dcache-load-misses  -p $pid -- sleep 2

perf report -i mariadb-10.5.0-sys-new-count-run_2_L1-dcache-load-misses

Samples: 182K of event 'L1-dcache-load-misses', Event count (approx.): 2186919945
Overhead  Command  Shared Object        Symbol
   6.39%  mysqld   mysqld               [.] MYSQLparse
   4.99%  mysqld   mysqld               [.] buf_page_get_gen
   3.42%  mysqld   mysqld               [.] rec_get_offsets_func
   1.59%  mysqld   mysqld               [.] page_cur_search_with_match
   1.43%  mysqld   mysqld               [.] btr_cur_search_to_nth_level_func

Annotate on buf_page_get_gen from perf report -i mariadb-10.5.0-sys-new-count-run_2_L1-dcache-load-misses

       │      _Z7__rdtscv():                                                                                                                                                                                                                                                                                       ▒
       │      /* rdtsc */                                                                                                                                                                                                                                                                                          ▒
       │      extern __inline unsigned long long                                                                                                                                                                                                                                                                   ▒
       │      __attribute__((__gnu_inline__, __always_inline__, __artificial__))                                                                                                                                                                                                                                   ▒
       │      __rdtsc (void)                                                                                                                                                                                                                                                                                       ▒
       │      {                                                                                                                                                                                                                                                                                                    ▒
       │        return __builtin_ia32_rdtsc ();                                                                                                                                                                                                                                                                    ▒
       │  b6:   rdtsc                                                                                                                                                                                                                                                                                              ▒
  0.40 │        shl    $0x20,%rdx                                                                                                                                                                                                                                                                                  ▒
       │      get_rnd_value():                                                                                                                                                                                                                                                                                     ▒
       │      static inline size_t                                                                                                                                                                                                                                                                                 ▒
       │      get_rnd_value()                                                                                                                                                                                                                                                                                      ▒
       │      {                                                                                                                                                                                                                                                                                                    ▒
       │              size_t c = static_cast<size_t>(my_timer_cycles());                                                                                                                                                                                                                                           ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              if (c != 0) {                                                                                                                                                                                                                                                                                ▒
       │        or     %rdx,%rax                                                                                                                                                                                                                                                                                   ▒
  0.04 │      ↓ jne    c19876 <buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool)+0xc6>                                                                                                                                     ▒
       │              }                                                                                                                                                                                                                                                                                            ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              /* We may go here if my_timer_cycles() returns 0,                                                                                                                                                                                                                                            ▒
       │              so we have to have the plan B for the counter. */                                                                                                                                                                                                                                            ▒
       │      #if !defined(_WIN32)                                                                                                                                                                                                                                                                                 ▒
       │              return (size_t)os_thread_get_curr_id();                                                                                                                                                                                                                                                      ▒
       │      → callq  os_thread_get_curr_id()                                                                                                                                                                                                                                                                     ▒
       │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
       │            }                                                                                                                                                                                                                                                                                              ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │            _GLIBCXX_ALWAYS_INLINE __int_type                                                                                                                                                                                                                                                              ▒
       │            fetch_add(__int_type __i,                                                                                                                                                                                                                                                                      ▒
       │                      memory_order __m = memory_order_seq_cst) noexcept                                                                                                                                                                                                                                    ▒
       │            { return __atomic_fetch_add(&_M_i, __i, __m); }                                                                                                                                                                                                                                                ▒
       │  c6:   mov    0x8(%rsp),%rsi                                                                                                                                                                                                                                                                              ▒
       │                                                                                                                                                                                                                                                                                                           ▒
       │              /** Add to the counter.                                                                                                                                                                                                                                                                      ▒
       │              @param[in]      index   a reasonably thread-unique identifier                                                                                                                                                                                                                                ▒
       │              @param[in]      n       amount to be added */                                                                                                                                                                                                                                                ▒
       │              void add(size_t index, Type n) {                                                                                                                                                                                                                                                             ▒
       │                      index = index % N;                                                                                                                                                                                                                                                                   ▒
       │        and    $0x3f,%eax                                                                                                                                                                                                                                                                                  ▒
       │        shl    $0x6,%rax                                                                                                                                                                                                                                                                                   ▒
  0.08 │        lea    0x1c0(%rsi,%rax,1),%rax                                                                                                                                                                                                                                                                     ▒
  0.02 │        lock   addq   $0x1,(%rax)                                                                                                                                                                                                                                                                          ▒
 18.70 │        mov    0x50(%rsp),%eax                                                                                                                                                                                                                                                                             ▒
       │      buf_page_hash_lock_get():                                                                                                                                                                                                                                                                            ▒
       │      /** Get appropriate page_hash_loc

I can't tell if the 18% is a minor off by one instruction accounting error or a cache miss actuall next fetch.

With 650 connections active resulting in 1098651.18 tpm (compared to 1109887.63 tpm before change) its quite possible that rdtsc is getting the the same value on multiple cores/threads. thread_local might be an option however as a C++ 11 feature which currently is only used in tokudb/Windows.

valerii I welcome any statement that could show its useful. I haven't looked at the optimizer. I assume that monitoring tool can work without crashing if a value is removed, but yes its a consideration and should be in release notes.

Comment by Marko Mäkelä [ 2020-05-06 ]

We actually do C++11 in MariaDB Server 10.4 and later. In 10.5, we use even more features, including thread_local.

If I remember correctly, also wlad mentioned in the past that the counter could be useful. Myself, I am always happy to remove unnecessary code.

danblack, did you instantiate ib_counter_t for 64 slots?

I wonder if it might make sense to reformat srv_stats so that instead of it containing a number fields one cache line apart from each other, we would basically have a number of copies of cache-line-aligned srv_stats? In that way, the total memory footprint should be smaller (less memory wasted on cache line alignment), and if one CPU is updating many counters, those counters could reside in the same cache line, which could reduce cache line traffic. But, it could also lead to false sharing if the get_rnd_value() leads multiple threads to updating the same copy of the srv_stats.

One more raw idea is to have thread_local srv_stats, possibly using std::atomic, and then have the reader traverse all the thread_local copies for reporting. The reporting should be rather seldom compared to the updates.

Comment by Daniel Black [ 2020-05-08 ]

64 slot ib_counter_t was probably Sunny looking at the logs. It seems to have been copied into MyRocksDB too.

I'm all for reducing wasted space of cache size alignment. So sure, either multiple copies or thread local.

If multiple copies, can the `thd_[s|g]et_ha_data` be used to maintain a fixed index of the copy per THD rather than `get_rnd_value` which grossly on non-x86 uses `os_thread_get_curr_id`. A `THD->query_id % X` to round robin the allocation optionally extending `srv_stats` could also contain an `inuse_count` to prevent conflicts for a limited number of tries if it can be tied to a THD.

Is user global/session THD stats something easily extended? maybe the varying storage engines installed makes this less possible. Are engine session stats useful on their own rather than just being a deconfliction strategy?

I've been assuming `srv_stats` is largely SQL THD incremented rather than innodb/reader/writer threads but if there's a difference need to consider the splitting.

Comment by Marko Mäkelä [ 2020-05-13 ]

I see that in MySQL 8.0.20, yet another sharding mechanism was introduced, instead of using the pre-existing

typedef ib_counter_t<ulint, 64> ulint_ctr_64_t;

danblack, I would love to see how a C++11 thread_local replacement of ib_counter_t and friends would work. In 10.5, the keyword thread_local is already being used.

Comment by Marko Mäkelä [ 2020-06-08 ]

I can pretty much confirm this on a 64-client read-only workload (on a machine with 2 NUMA nodes and 2×16 threads). The perf record output is highlighting the following in buf_page_get_low() (which buf_page_get_gen() was renamed to in MDEV-21572):

10.5 d3681335b18b82cf1cfb090ce1938de2ee6b21fc

  8.94 │        addq   $0x1,buf_pool+0x150

An even worse hog was highlighted in ha_innobase::index_read():

10.5 d3681335b18b82cf1cfb090ce1938de2ee6b21fc

 21.93 │       lock   addq   $0x1,(%rax,%rdx,1)

This is one of the two srv_stats calls:

	switch (ret) {
	case DB_SUCCESS:
		error = 0;
		table->status = 0;
		if (m_prebuilt->table->is_system_db) {
			srv_stats.n_system_rows_read.add(
				thd_get_thread_id(m_prebuilt->trx->mysql_thd), 1);
		} else {
			srv_stats.n_rows_read.add(
				thd_get_thread_id(m_prebuilt->trx->mysql_thd), 1);
		}
		break;

For comparison, the CPU time that buf_page_get_low() is spending in the following instructions is justified:

10.5 d3681335b18b82cf1cfb090ce1938de2ee6b21fc

14.01 │        mov    0x0(%r13),%eax
11.80 │        lock   cmpxchg %edx,0x0(%r13)

Those instructions are part of

rw_lock_t* hash_lock = buf_pool.page_hash_lock<false>(fold);

which could be alleviated by increasing srv_n_page_hash_locks from the default value 16.
The debug (or UNIV_PERF_DEBUG) parameter innodb_page_hash_locks would allow tuning that.

Comment by Roel Van de Paar [ 2020-12-02 ]

Set prio to critical as many of the MONITOR_* affect-tests bugs seen during TSAN will be fixed as part of this bug.
MDEV-24330 MDEV-24332 MDEV-24329

Comment by Marko Mäkelä [ 2020-12-04 ]

I think that we should minimize the use of srv_stats and try to use normal counters in those cases where a mutex-protected critical section already exists. For example, MDEV-23399 replaced srv_stats.buf_pool_flushed with a normal variable buf_flush_page_count that is protected by buf_pool.mutex, and MDEV-24350 will something similar with buf_dblwr.mutex.

All use of srv_stats must be carefully reviewed, and anything that can be replaced or removed should be removed. That data structure is using 1 or 64 cache lines per variable, which is pretty intensive for the memory bus.

Comment by Krunal Bauskar [ 2021-03-06 ]

Just to keep it posted. We tried the distributed-counter patch against 10.6 and here are the numbers with 4 numa machine that reported heavy contention.

baseline patched (counter) %
ro-uni 802363 985406 23
ro-zip 812435 1036265 28

Comment by Krunal Bauskar [ 2021-03-08 ]

There was a suggestion (from Marko) to evaluate this patch against 10.5

10.5 baseline patched (10.5) %
ps-uni 1054519 939498 -11
ro-uni 804907 988114 23
ps-zip 1048564 953964 -9
ro-zip 853427 1037203 22

Unfortunately, the said patch reports regression with 10.5 (for point-select only) so I would not recommend the said patch for 10.5 and naturally to the older version.

Same patch was re-evaluated with 10.6 using the head version (#b930a190)

4 numa (arm)
baseline patched (counter) %
ps-uni 1485252 1588903 7
ro-uni 804535 1022524 27
ps-zip 1541601 1615774 5
ro-zip 855737 1006159 18

Said patch continues to report improvement for point-select and read-only in the range of 5-7% for ps and 18-27% for read-only.

Given the said information patch is recommended for 10.6 onwards.

Comment by Krunal Bauskar [ 2021-03-08 ]

as requested x86 results. x86 is a 2 numa machine and as observed with even 2 numa arm it doesn't show a major improvement but no regression too. (infact 2% improvement).

2 numa (x86)
10.6 baseline patched %
ps-uni 996988 994899 0
ro-uni 591207 601881 2
ps-zip 1043517 1042304 0
ro-zip 629508 641010 2

Comment by Krunal Bauskar [ 2021-03-11 ]

We re-evaluated the latest 10.6, pr#1766 approach and tls approach. The attached graph has the needed details.

Comment by Krunal Bauskar [ 2021-03-12 ]

Given the multi-numa regression of tls patch we decided to study tls using single numa. Check the attached graph. On single numa performance drop is pretty less that confirms that tls approach in general on multi-numa has some overhead. (Wild guess could be cross tls page movement of 4KB).

Comment by Marko Mäkelä [ 2021-03-12 ]

Thank you, krunalbauskar! The simple change of PR#1766 will be in 10.6. Most notably, this will double the size of srv_stats.

There are some unused counters in srv_stats and some duplicated ones (various srv_stats.n_system_rows_ duplicating srv_stats.n_rows_). I am going to fix those separately, to reduce sizeof srv_stats.

Comment by Marko Mäkelä [ 2022-05-11 ]

I was thinking about this once more. We could distribute the counter among buf_pool.page_hash.array, one counter per cache line. Especially on ARMv8, we are currently wasting half of those 128-byte cache lines, ever since MDEV-26609. On ISAs with 64-byte cache line size, we would have to sacrifice one pointer in each cache line for a counter.

To assess the feasibility of this, I tested the impact of removing the buf_pool.stat.n_page_gets altogether. In a quick oltp_read_only benchmark on my system (dual Intel Xeon E5-2630 v4), the difference appeared to be smaller than the noise margin. In fact, for most of the time, the code was slower after I removed the counter. krunalbauskar, could you test this again?

Comment by Marko Mäkelä [ 2022-05-16 ]

For the record: Both me and krunalbauskar observed a slight regression after removing buf_pool.stat.n_page_gets. In other words, it appears to serve as a useful ‘throttle’ to prevent contention elsewhere. I think that it is still worth an experiment to distribute this counter in buf_pool.page_hash.array, but I am afraid that it could face a similar fate.

I filed MDEV-28539 and MDEV-28540 for removing some other srv_stats counters.

Generated at Thu Feb 08 09:05:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.