Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-37147

ARMv8 -moutline-atomics is suboptimal

    XMLWordPrintable

Details

    • Related to performance

    Description

      I thought that I would check how the atomic memory access operations in the executables that we distribute are actually implemented. I built MariaDB Server 10.11 in a Debian 12 environment:

      cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo /source
      gdb sql/mariadbd
      

      disassemble mtr_t::finish_writer<false>
      

      It turns out that the -moutline-atomics, which was the way to enable Large System Extensions, is actually checking the availability CPU feature on every single function call. Not to mention that each simple instruction, such as ldadd for std::atomic::fetch_add() is being replaced with a call to a library function.

      Here is an excerpt of the above code:

         0x0000000000cab42c <+76>:	cmp	x2, #0x0
         0x0000000000cab430 <+80>:	csel	x27, x27, x1, eq	// eq = none
         0x0000000000cab434 <+84>:	ubfx	x22, x0, #4, #1
         0x0000000000cab438 <+88>:	mov	x1, x23
         0x0000000000cab43c <+92>:	mov	x0, x20
         0x0000000000cab440 <+96>:	bl	0xee86d0 <__aarch64_ldadd8_relax>
         0x0000000000cab444 <+100>:	and	x0, x0, #0x3ffffffff
         0x0000000000cab448 <+104>:	cmp	x19, x0
         0x0000000000cab44c <+108>:	b.ls	0xcab570 <_ZN5mtr_t13finish_writerILb0EEESt4pairImNS_16page_flush_aheadEEPS_m+400>  // b.plast
      

      If I go ahead and compile the code with

      cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_{C,CXX}_FLAGS='-march=armv8.1-a' /source
      

      then the function call will be replaced with the bare instruction, bringing it closer to what we are running on AMD64:

       
         0x0000000000ca807c <+76>:    cmp x2, #0x0
         0x0000000000ca8080 <+80>:    csel    x27, x27, x1, eq    // eq = none
         0x0000000000ca8084 <+84>:    ubfx    x23, x0, #4, #1
         0x0000000000ca8088 <+88>:    ldadd   x20, x0, [x21]
         0x0000000000ca808c <+92>:    and x0, x0, #0x3ffffffff
         0x0000000000ca8090 <+96>:    cmp x19, x0
         0x0000000000ca8094 <+100>:   b.ls    0xca81b8 <_ZN5mtr_t13finish_writerILb0EEESt4pairImNS_16page_flush_aheadEEPS_m+392>  // b.plast
      

      This is 8 bytes shorter in the caller. The library function seems to detect the availability of the CPU feature on each and every call. We have the LSE version at +16 and the compare-exchange loop at +28.

      Dump of assembler code for function __aarch64_ldadd8_relax:
         0x0000000000ee86d0 <+0>: bti c
         0x0000000000ee86d4 <+4>: adrp    x16, 0x2138000 <_ZN4ShowL17user_stats_fieldsE+1080>
         0x0000000000ee86d8 <+8>: ldrb    w16, [x16, #3369]
         0x0000000000ee86dc <+12>:    cbz w16, 0xee86e8 <__aarch64_ldadd8_relax+24>
         0x0000000000ee86e0 <+16>:    ldadd   x0, x0, [x1]
         0x0000000000ee86e4 <+20>:    ret
         0x0000000000ee86e8 <+24>:    mov x16, x0
         0x0000000000ee86ec <+28>:    ldxr    x0, [x1]
         0x0000000000ee86f0 <+32>:    add x17, x0, x16
         0x0000000000ee86f4 <+36>:    stxr    w15, x17, [x1]
         0x0000000000ee86f8 <+40>:    cbnz    w15, 0xee86ec <__aarch64_ldadd8_relax+28>
         0x0000000000ee86fc <+44>:    ret
      

      I think that it would be better to instantiate the mtr_t::finish_writer template as well as some other functions for multiple ISA targets. This particular function already is being invoked via the function pointer mtr_t::finisher.

      In a non-scientific experiment (single-treaded test on persistent storage), I observed a 2% performance improvement. The environment that I used for analysis is not suitable for performance testing.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Marko Mäkelä Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.