Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.11, 11.4, 11.8
-
ARMv8.1-A
-
Related to performance
Description
I thought that I would check how the atomic memory access operations in the executables that we distribute are actually implemented. I built MariaDB Server 10.11 in a Debian 12 environment:
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo /source
|
gdb sql/mariadbd
|
disassemble mtr_t::finish_writer<false>
|
It turns out that the -moutline-atomics, which was the way to enable Large System Extensions, is actually checking the availability CPU feature on every single function call. Not to mention that each simple instruction, such as ldadd for std::atomic::fetch_add() is being replaced with a call to a library function.
Here is an excerpt of the above code:
0x0000000000cab42c <+76>: cmp x2, #0x0
|
0x0000000000cab430 <+80>: csel x27, x27, x1, eq // eq = none
|
0x0000000000cab434 <+84>: ubfx x22, x0, #4, #1
|
0x0000000000cab438 <+88>: mov x1, x23
|
0x0000000000cab43c <+92>: mov x0, x20
|
0x0000000000cab440 <+96>: bl 0xee86d0 <__aarch64_ldadd8_relax>
|
0x0000000000cab444 <+100>: and x0, x0, #0x3ffffffff
|
0x0000000000cab448 <+104>: cmp x19, x0
|
0x0000000000cab44c <+108>: b.ls 0xcab570 <_ZN5mtr_t13finish_writerILb0EEESt4pairImNS_16page_flush_aheadEEPS_m+400> // b.plast
|
If I go ahead and compile the code with
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_{C,CXX}_FLAGS='-march=armv8.1-a' /source |
then the function call will be replaced with the bare instruction, bringing it closer to what we are running on AMD64:
|
0x0000000000ca807c <+76>: cmp x2, #0x0
|
0x0000000000ca8080 <+80>: csel x27, x27, x1, eq // eq = none
|
0x0000000000ca8084 <+84>: ubfx x23, x0, #4, #1
|
0x0000000000ca8088 <+88>: ldadd x20, x0, [x21]
|
0x0000000000ca808c <+92>: and x0, x0, #0x3ffffffff
|
0x0000000000ca8090 <+96>: cmp x19, x0
|
0x0000000000ca8094 <+100>: b.ls 0xca81b8 <_ZN5mtr_t13finish_writerILb0EEESt4pairImNS_16page_flush_aheadEEPS_m+392> // b.plast
|
This is 8 bytes shorter in the caller. The library function seems to detect the availability of the CPU feature on each and every call. We have the LSE version at +16 and the compare-exchange loop at +28.
Dump of assembler code for function __aarch64_ldadd8_relax:
|
0x0000000000ee86d0 <+0>: bti c
|
0x0000000000ee86d4 <+4>: adrp x16, 0x2138000 <_ZN4ShowL17user_stats_fieldsE+1080>
|
0x0000000000ee86d8 <+8>: ldrb w16, [x16, #3369]
|
0x0000000000ee86dc <+12>: cbz w16, 0xee86e8 <__aarch64_ldadd8_relax+24>
|
0x0000000000ee86e0 <+16>: ldadd x0, x0, [x1]
|
0x0000000000ee86e4 <+20>: ret
|
0x0000000000ee86e8 <+24>: mov x16, x0
|
0x0000000000ee86ec <+28>: ldxr x0, [x1]
|
0x0000000000ee86f0 <+32>: add x17, x0, x16
|
0x0000000000ee86f4 <+36>: stxr w15, x17, [x1]
|
0x0000000000ee86f8 <+40>: cbnz w15, 0xee86ec <__aarch64_ldadd8_relax+28>
|
0x0000000000ee86fc <+44>: ret
|
I think that it would be better to instantiate the mtr_t::finish_writer template as well as some other functions for multiple ISA targets. This particular function already is being invoked via the function pointer mtr_t::finisher.
In a non-scientific experiment (single-treaded test on persistent storage), I observed a 2% performance improvement. The environment that I used for analysis is not suitable for performance testing.
Attachments
Issue Links
- relates to
-
MDEV-21923 LSN allocation is a bottleneck
-
- Closed
-