[MDEV-23633] MY_RELAX_CPU performs unnecessary compare-and-swap on ARM Created: 2020-08-31  Updated: 2021-01-20  Resolved: 2020-09-04

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4.16, 10.5.7

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: ARM, ARMv8, performance
Environment:

ARMv8


Attachments: PNG File MDB-10.4.14-ARM-testing.png    
Issue Links:
Blocks
blocks MDEV-23635 Add notional delay (using existing ut... Open
Relates
relates to MDEV-14374 UT_DELAY code : Removing hardware bar... Closed
relates to MDEV-24630 MY_RELAX_CPU assembly instruction upg... Closed

 Description   

This is basically a re-request to fix MDEV-14374 as it was described. On target architectures where nothing special is available (not IA-32, AMD64, POWER), the macro UT_RELAX_CPU() would perform a dummy compare-and-swap operation.

According to krunalbauskar, this extra operation is visible in some benchmarks.



 Comments   
Comment by Krunal Bauskar [ 2020-08-31 ]

Marko,

On ARM as quoted on channel as per benchmarking switching to simple barrier (vs CAS) helps improve performance.
For other architecture, we should benchmark it and then decide.

Comment by Marko Mäkelä [ 2020-09-01 ]

krunalbauskar, thank you. I think that I would make it conditional to minimize the impact on other architectures:

diff --git a/storage/innobase/include/ut0ut.h b/storage/innobase/include/ut0ut.h
index a19f3db188d..72a9754c326 100644
--- a/storage/innobase/include/ut0ut.h
+++ b/storage/innobase/include/ut0ut.h
@@ -71,6 +71,9 @@ Created 1/20/1994 Heikki Tuuri
 #elif defined(__powerpc__) && defined __GLIBC__
 # include <sys/platform/ppc.h>
 # define UT_RELAX_CPU() __ppc_get_timebase()
+#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__)
+/* Mainly, prevent the compiler from optimizing away delay loops */
+# define UT_RELAX_CPU() __asm__ __volatile__ ("":::"memory")
 #else
 # define UT_RELAX_CPU() do { \
      volatile int32    volatile_var; \

Comment by Krunal Bauskar [ 2020-09-01 ]
  • We tried evaluating the patch against 10.2 and we see mixed results with +ve and -ve (especially with increasing scalability). Also, the proposed patch is along with other optimization to use ut_delay as part of the latch acquisition process. The wholistic patch ensures no regression for read-only and gain of 5-12% for update/non-update use-cases (check MDEV-23635 for results using the wholistic patch). Perf profile for 10.2 and 10.4 should be quite different too.
  • Given this I would recommend that we should re-consider applying to the version+ (10.4+ onwards) that started reporting issues.
Comment by Marko Mäkelä [ 2020-09-01 ]

The equivalent patch for 10.4 would be as follows. It should apply to 10.3 as well, but krunalbauskar is right, the performance impact could be a mixed bag, and we’d better limit us to newer versions. Besides, within the scope of MDEV-14374 on 10.3, this alternative had been considered and rejected (on a different ARMv8 implementation):

diff --git a/include/my_cpu.h b/include/my_cpu.h
index b7d7008a8e3..0b51d3ef90f 100644
--- a/include/my_cpu.h
+++ b/include/my_cpu.h
@@ -53,6 +53,7 @@
 #ifdef _WIN32
 #elif defined HAVE_PAUSE_INSTRUCTION
 #elif defined(_ARCH_PWR8)
+#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__)
 #else
 # include "my_atomic.h"
 #endif
@@ -80,6 +81,9 @@ static inline void MY_RELAX_CPU(void)
 #endif
 #elif defined(_ARCH_PWR8)
   __ppc_get_timebase();
+#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__)
+  /* Mainly, prevent the compiler from optimizing away delay loops */
+  __asm__ __volatile__ ("":::"memory")
 #else
   int32 var, oldval = 0;
   my_atomic_cas32_strong_explicit(&var, &oldval, 1, MY_MEMORY_ORDER_RELAXED,

Comment by Krunal Bauskar [ 2020-09-04 ]
  • We tested the said patch with MDB-10.4.14 and observed a gain up to 12% for update workload. No regression observed with other workload. Graph for the same has been uploaded.
  • We tested the said patch with MDB-10.5.5 but given 10.5.5 has different performance profile patch fails to prove the value there (already MDB-10.5.5 is reporting TPS that is 2.5x times higher than MDB-10.4.14 so perf hot-spots are completely different though we are expecting more changes to 10.5.5 flushing so will keep 10.5.5 aspect open for now).
Generated at Thu Feb 08 09:23:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.