The integer load/store macros defined in include/my_byteorder.h have different implementations for x86[_64] (see include/byte_order_generic_x86.h and include/byte_order_generic_x86_64.h) and all other architectures (include/byte_order_generic.h).
Which is unfortunate, because that discriminates little-endian architectures that are not X86, in particular ARM64.
The distinction should really be between big-endian and little-endian architectures. Which is the way it is currently implemented in MySQL 8.0, where they have replaced most legacy *int*korr() and int*store() macros with memcpy()-based inline functions, see https://github.com/mysql/mysql-server/commit/536ea313a6a71f9ed87f14d95e03e04e40ff5605
The rationale for using memcpy() looks a little inconclusive to me. But it works almost fine, i.e. the compiler is usually smart enough to convert memcpy into the most efficient implementation on all little-endian architectures.
This is a request to:
- Use the same optimized implementations of load/store macros on all little-endian architectures, including ARM64, as used on X86;
- Optimize the current X86 implementations even further for a few macros
I'm not yet sure if I'm allowed to share benchmark results and contribute code. But I'm attaching a Godbolt link as a testcase that demonstrates the optimization opportunities. Just build it and run on any available ARM64 and X86 machines: