Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21212

buf_page_get_gen -> buf_pool->stat.n_page_gets++ is a cpu waste (0.5-1%)

Details

    Description

      Even on a 20 cpu Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz this single increment shows up significantly in a performance profile:

      In a TPC-C test (10.5.0-523879dd51c3bf2ad23295852304531d856cc869)

      perf report -g --no-children -i mariadb-10.5.0-event_warmup.g.perf

      +    5.30%  mysqld   mysqld               [.] rec_get_offsets_func                                                                                                                                                                                                                                                 ▒
      -    4.36%  mysqld   mysqld               [.] buf_page_get_gen                                                                                                                                                                                                                                                     ▒
         + 2.23% 0                                                                                                                                                                                                                                                                                                       ▒
           0.82% buf_page_get_gen                                                                                                                                                                                                                                                                                        ▒
         - 0.79% 0x1                                                                                                                                                                                                                                                                                                     ▒
            - 0.75% ha_innobase::records_in_range                                                                                                                                                                                                                                                                        ▒
                 buf_page_get_gen                                                                                                                                                                                                                                                                                        ▒
      +    3.46%  mysqld   mysqld               [.] l_find                                                                                                                                                                                                                                                               ▒
      +    3.18%  mysqld   mysqld               [.] MYSQLparse                                                                                                                                                                                                                                                           ▒
      +    2.68%  mysqld   mysqld               [.] cmp_dtuple_rec_with_match_low                                                                                                                                                                                                                                        ▒
      +    2.41%  mysqld   mysqld               [.] page_cur_search_with_match           
      

      As the second highest CPU consuming function, the increment shows up in ~16% of that time.

      annotate on bug_page_get_gen

             │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
             │      ↓ je     c311db <buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool)+0xbb>                                                                                                                                     ▒
             │                      *err = DB_SUCCESS;                                                                                                                                                                                                                                                                   ▒
             │        movl   $0x0,(%rsi)                                                                                                                                                                                                                                                                                 ▒
             │      #endif /* UNIV_DEBUG */                                                                                                                                                                                                                                                                              ▒
             │                                                                                                                                                                                                                                                                                                           ▒
             │              ut_ad(!mtr || !ibuf_inside(mtr)                                                                                                                                                                                                                                                              ▒
             │                    || ibuf_page_low(page_id, zip_size, FALSE, file, line, NULL));                                                                                                                                                                                                                         ▒
             │                                                                                                                                                                                                                                                                                                           ▒
             │              buf_pool->stat.n_page_gets++;                                                                                                                                                                                                                                                                ▒
        1.08 │  bb:   mov    0x10(%rsp),%rbx                                                                                                                                                                                                                                                                             ▒
             │        add    0x60(%rsp),%ecx                                                                                                                                                                                                                                                                             ▒
             │      ut_hash_ulint():                                                                                                                                                                                                                                                                                     ▒
             │              ulint    table_size)    /*!< in: hash table size */                                                                                                                                                                                                                                          ▒
             │      {                                                                                                                                                                                                                                                                                                    ▒
             │              ut_ad(table_size);                                                                                                                                                                                                                                                                           ▒
             │              key = key ^ UT_HASH_RANDOM_MASK2;                                                                                                                                                                                                                                                            ▒
             │                                                                                                                                                                                                                                                                                                           ▒
             │              return(key % table_size);                                                                                                                                                                                                                                                                    ▒
             │        xor    %edx,%edx                                                                                                                                                                                                                                                                                   ▒
             │      _Z16buf_page_get_gen9page_id_tmmP11buf_block_tmPKcjP5mtr_tP7dberr_tb():                                                                                                                                                                                                                              ▒
             │              ulint           retries = 0;                                                                                                                                                                                                                                                                 ▒
             │        movq   $0x0,0x38(%rsp)                                                                                                                                                                                                                                                                             ▒
             │              buf_pool->stat.n_page_gets++;                                                                                                                                                                                                                                                                ▒
       16.52 │        addq   $0x1,0x188(%rbx)                                                                                                                                                                                                                                                                            ▒
             │      buf_page_hash_lock_get():                                                                                                                                                                                                                                                                            ▒
             │      /** Get appropriate page_hash_lock. */                                                                                                                                                                                                                                                               ▒
             │      UNIV_INLINE                                                                                                                                                                                                                                                                                          ▒
             │      rw_lock_t*                                                                                                                                                                                                                                                                                           ▒
             │      buf_page_hash_lock_get(const buf_pool_t* buf_pool, const page_id_t& page_id)                                                                                                                                                                                                                         ▒
             │      {                                                                                                                                                                                                                                                                                                    ▒
             │              return hash_get_lock(buf_pool->page_hash, page_id.fold());                                                                                                                                                                                                                                   ▒
        0.36 │        mov    0xb0(%rbx),%rsi                                                                                                                                                                                                                                                                             ▒
             │      ut_hash_ulint():                                                                                                                                                                                                                                                                                     ▒
             │              key = key ^ UT_HASH_RANDOM_MASK2;                                                                                                                                                                                                                                                            ▒
             │        mov    %ecx,%eax                                                                                                                                                                                                                                                                                   ▒
             │        movzbl 0x18(%rsp),%ecx                                                                                                                                                                                                                                                                             ▒
             │        xor    $0x62946a4f,%eax                                                                                                                                                                                                                                                                            ▒
             │        mov    %rax,0x8(%rsp)                                                                                                                                                                                                                                                                              ▒
             │              return(key % table_size);                                                                                                                                                                                                                                                                    ▒
        6.28 │        divq   0x8(%rsi)                                                                                                                                                                                                                                                                                   ▒
             │      _Z17ut_2pow_remainderImET_S0_S0_():                                                                                                                                                                                                                                                                  ▒
             │      /*******************************************
      

      Recommend removing statistic counter. I'm not sure a DBA could make a decision based on this statistic.

      Attachments

        Issue Links

          Activity

            Thank you, krunalbauskar! The simple change of PR#1766 will be in 10.6. Most notably, this will double the size of srv_stats.

            There are some unused counters in srv_stats and some duplicated ones (various srv_stats.n_system_rows_ duplicating srv_stats.n_rows_). I am going to fix those separately, to reduce sizeof srv_stats.

            marko Marko Mäkelä added a comment - Thank you, krunalbauskar ! The simple change of PR#1766 will be in 10.6. Most notably, this will double the size of srv_stats . There are some unused counters in srv_stats and some duplicated ones (various srv_stats.n_system_rows_ duplicating srv_stats.n_rows_ ). I am going to fix those separately, to reduce sizeof srv_stats .

            I was thinking about this once more. We could distribute the counter among buf_pool.page_hash.array, one counter per cache line. Especially on ARMv8, we are currently wasting half of those 128-byte cache lines, ever since MDEV-26609. On ISAs with 64-byte cache line size, we would have to sacrifice one pointer in each cache line for a counter.

            To assess the feasibility of this, I tested the impact of removing the buf_pool.stat.n_page_gets altogether. In a quick oltp_read_only benchmark on my system (dual Intel Xeon E5-2630 v4), the difference appeared to be smaller than the noise margin. In fact, for most of the time, the code was slower after I removed the counter. krunalbauskar, could you test this again?

            marko Marko Mäkelä added a comment - I was thinking about this once more. We could distribute the counter among buf_pool.page_hash.array , one counter per cache line. Especially on ARMv8, we are currently wasting half of those 128-byte cache lines, ever since MDEV-26609 . On ISAs with 64-byte cache line size, we would have to sacrifice one pointer in each cache line for a counter. To assess the feasibility of this, I tested the impact of removing the buf_pool.stat.n_page_gets altogether. In a quick oltp_read_only benchmark on my system (dual Intel Xeon E5-2630 v4), the difference appeared to be smaller than the noise margin. In fact, for most of the time, the code was slower after I removed the counter. krunalbauskar , could you test this again?

            For the record: Both me and krunalbauskar observed a slight regression after removing buf_pool.stat.n_page_gets. In other words, it appears to serve as a useful ‘throttle’ to prevent contention elsewhere. I think that it is still worth an experiment to distribute this counter in buf_pool.page_hash.array, but I am afraid that it could face a similar fate.

            I filed MDEV-28539 and MDEV-28540 for removing some other srv_stats counters.

            marko Marko Mäkelä added a comment - For the record: Both me and krunalbauskar observed a slight regression after removing buf_pool.stat.n_page_gets . In other words, it appears to serve as a useful ‘throttle’ to prevent contention elsewhere. I think that it is still worth an experiment to distribute this counter in buf_pool.page_hash.array , but I am afraid that it could face a similar fate. I filed MDEV-28539 and MDEV-28540 for removing some other srv_stats counters.

            In MDEV-31558, monty removed the incrementing of this counter in buf_page_optimistic_get(), because it is not an ‘actual’ buf_pool.page_hash lookup but part of ‘optimistically restoring’ a cursor to a page, which is something that happens frequently until we might implement MDEV-16232. The statement-level instrumentation of MDEV-31558 turns out to have exposed MDEV-32286, which is either a performance bug related to accessing secondary indexes, or another proof that implmenting MDEV-17598 could be useful.

            Sharding the counter inside buf_pool.page_hash could turn read-only operations into writes and therefore ruin a main benefit of hardware memory transactions (MDEV-26769).

            For the record, we had compared https://github.com/MariaDB/server/pull/1766/ to a thread_local based implementation (and a small fixup). This could be worth revisiting. I just checked that frequently executing gettid() or the x86 rdrand instruction are not viable options; the former is a system call (not implemented as a vsyscall in linux-vdso.so.1), and rdseed ought to be an order of magnitude slower than rdtsc a.k.a. my_timer_cycles(). I wonder if we could simply hack something x86 or System V AMD64 ABI specific based on the fs register.

            marko Marko Mäkelä added a comment - In MDEV-31558 , monty removed the incrementing of this counter in buf_page_optimistic_get() , because it is not an ‘actual’ buf_pool.page_hash lookup but part of ‘optimistically restoring’ a cursor to a page, which is something that happens frequently until we might implement MDEV-16232 . The statement-level instrumentation of MDEV-31558 turns out to have exposed MDEV-32286 , which is either a performance bug related to accessing secondary indexes, or another proof that implmenting MDEV-17598 could be useful. Sharding the counter inside buf_pool.page_hash could turn read-only operations into writes and therefore ruin a main benefit of hardware memory transactions ( MDEV-26769 ). For the record, we had compared https://github.com/MariaDB/server/pull/1766/ to a thread_local based implementation (and a small fixup ). This could be worth revisiting. I just checked that frequently executing gettid() or the x86 rdrand instruction are not viable options; the former is a system call (not implemented as a vsyscall in linux-vdso.so.1 ), and rdseed ought to be an order of magnitude slower than rdtsc a.k.a. my_timer_cycles() . I wonder if we could simply hack something x86 or System V AMD64 ABI specific based on the fs register.

            I see that there is also a function sched_getcpu() that on my Linux system would appear to be implemented in linux-vdso.so.1. I single-stepped through that function, and it would seem to avoid executing the syscall fallback, at least on the second execution.

            I filed MDEV-34296 and MDEV-34297 related to this analysis. I do not think that it is feasible to completely replace buf_pool.stat.n_page_gets with a bunch of THD::handler_stats, because a THD object is not available in all threads that may execute InnoDB code.

            marko Marko Mäkelä added a comment - I see that there is also a function sched_getcpu() that on my Linux system would appear to be implemented in linux-vdso.so.1 . I single-stepped through that function, and it would seem to avoid executing the syscall fallback, at least on the second execution. I filed MDEV-34296 and MDEV-34297 related to this analysis. I do not think that it is feasible to completely replace buf_pool.stat.n_page_gets with a bunch of THD::handler_stats , because a THD object is not available in all threads that may execute InnoDB code.

            People

              kevg Eugene Kosov (Inactive)
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.