Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23633

MY_RELAX_CPU performs unnecessary compare-and-swap on ARM

Details

    Description

      This is basically a re-request to fix MDEV-14374 as it was described. On target architectures where nothing special is available (not IA-32, AMD64, POWER), the macro UT_RELAX_CPU() would perform a dummy compare-and-swap operation.

      According to krunalbauskar, this extra operation is visible in some benchmarks.

      Attachments

        Issue Links

          Activity

            Marko,

            On ARM as quoted on channel as per benchmarking switching to simple barrier (vs CAS) helps improve performance.
            For other architecture, we should benchmark it and then decide.

            krunalbauskar Krunal Bauskar added a comment - Marko, On ARM as quoted on channel as per benchmarking switching to simple barrier (vs CAS) helps improve performance. For other architecture, we should benchmark it and then decide.

            krunalbauskar, thank you. I think that I would make it conditional to minimize the impact on other architectures:

            diff --git a/storage/innobase/include/ut0ut.h b/storage/innobase/include/ut0ut.h
            index a19f3db188d..72a9754c326 100644
            --- a/storage/innobase/include/ut0ut.h
            +++ b/storage/innobase/include/ut0ut.h
            @@ -71,6 +71,9 @@ Created 1/20/1994 Heikki Tuuri
             #elif defined(__powerpc__) && defined __GLIBC__
             # include <sys/platform/ppc.h>
             # define UT_RELAX_CPU() __ppc_get_timebase()
            +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__)
            +/* Mainly, prevent the compiler from optimizing away delay loops */
            +# define UT_RELAX_CPU() __asm__ __volatile__ ("":::"memory")
             #else
             # define UT_RELAX_CPU() do { \
                  volatile int32    volatile_var; \
            

            marko Marko Mäkelä added a comment - krunalbauskar , thank you. I think that I would make it conditional to minimize the impact on other architectures: diff --git a/storage/innobase/include/ut0ut.h b/storage/innobase/include/ut0ut.h index a19f3db188d..72a9754c326 100644 --- a/storage/innobase/include/ut0ut.h +++ b/storage/innobase/include/ut0ut.h @@ -71,6 +71,9 @@ Created 1/20/1994 Heikki Tuuri #elif defined(__powerpc__) && defined __GLIBC__ # include <sys/platform/ppc.h> # define UT_RELAX_CPU() __ppc_get_timebase() +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__) +/* Mainly, prevent the compiler from optimizing away delay loops */ +# define UT_RELAX_CPU() __asm__ __volatile__ ("":::"memory") #else # define UT_RELAX_CPU() do { \ volatile int32 volatile_var; \
            • We tried evaluating the patch against 10.2 and we see mixed results with +ve and -ve (especially with increasing scalability). Also, the proposed patch is along with other optimization to use ut_delay as part of the latch acquisition process. The wholistic patch ensures no regression for read-only and gain of 5-12% for update/non-update use-cases (check MDEV-23635 for results using the wholistic patch). Perf profile for 10.2 and 10.4 should be quite different too.
            • Given this I would recommend that we should re-consider applying to the version+ (10.4+ onwards) that started reporting issues.
            krunalbauskar Krunal Bauskar added a comment - We tried evaluating the patch against 10.2 and we see mixed results with +ve and -ve (especially with increasing scalability). Also, the proposed patch is along with other optimization to use ut_delay as part of the latch acquisition process. The wholistic patch ensures no regression for read-only and gain of 5-12% for update/non-update use-cases (check MDEV-23635 for results using the wholistic patch). Perf profile for 10.2 and 10.4 should be quite different too. Given this I would recommend that we should re-consider applying to the version+ (10.4+ onwards) that started reporting issues.

            The equivalent patch for 10.4 would be as follows. It should apply to 10.3 as well, but krunalbauskar is right, the performance impact could be a mixed bag, and we’d better limit us to newer versions. Besides, within the scope of MDEV-14374 on 10.3, this alternative had been considered and rejected (on a different ARMv8 implementation):

            diff --git a/include/my_cpu.h b/include/my_cpu.h
            index b7d7008a8e3..0b51d3ef90f 100644
            --- a/include/my_cpu.h
            +++ b/include/my_cpu.h
            @@ -53,6 +53,7 @@
             #ifdef _WIN32
             #elif defined HAVE_PAUSE_INSTRUCTION
             #elif defined(_ARCH_PWR8)
            +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__)
             #else
             # include "my_atomic.h"
             #endif
            @@ -80,6 +81,9 @@ static inline void MY_RELAX_CPU(void)
             #endif
             #elif defined(_ARCH_PWR8)
               __ppc_get_timebase();
            +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__)
            +  /* Mainly, prevent the compiler from optimizing away delay loops */
            +  __asm__ __volatile__ ("":::"memory")
             #else
               int32 var, oldval = 0;
               my_atomic_cas32_strong_explicit(&var, &oldval, 1, MY_MEMORY_ORDER_RELAXED,
            

            marko Marko Mäkelä added a comment - The equivalent patch for 10.4 would be as follows. It should apply to 10.3 as well, but krunalbauskar is right, the performance impact could be a mixed bag, and we’d better limit us to newer versions. Besides, within the scope of MDEV-14374 on 10.3, this alternative had been considered and rejected (on a different ARMv8 implementation): diff --git a/include/my_cpu.h b/include/my_cpu.h index b7d7008a8e3..0b51d3ef90f 100644 --- a/include/my_cpu.h +++ b/include/my_cpu.h @@ -53,6 +53,7 @@ #ifdef _WIN32 #elif defined HAVE_PAUSE_INSTRUCTION #elif defined(_ARCH_PWR8) +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__) #else # include "my_atomic.h" #endif @@ -80,6 +81,9 @@ static inline void MY_RELAX_CPU(void) #endif #elif defined(_ARCH_PWR8) __ppc_get_timebase(); +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__) + /* Mainly, prevent the compiler from optimizing away delay loops */ + __asm__ __volatile__ ("":::"memory") #else int32 var, oldval = 0; my_atomic_cas32_strong_explicit(&var, &oldval, 1, MY_MEMORY_ORDER_RELAXED,
            • We tested the said patch with MDB-10.4.14 and observed a gain up to 12% for update workload. No regression observed with other workload. Graph for the same has been uploaded.
            • We tested the said patch with MDB-10.5.5 but given 10.5.5 has different performance profile patch fails to prove the value there (already MDB-10.5.5 is reporting TPS that is 2.5x times higher than MDB-10.4.14 so perf hot-spots are completely different though we are expecting more changes to 10.5.5 flushing so will keep 10.5.5 aspect open for now).
            krunalbauskar Krunal Bauskar added a comment - We tested the said patch with MDB-10.4.14 and observed a gain up to 12% for update workload. No regression observed with other workload. Graph for the same has been uploaded. We tested the said patch with MDB-10.5.5 but given 10.5.5 has different performance profile patch fails to prove the value there (already MDB-10.5.5 is reporting TPS that is 2.5x times higher than MDB-10.4.14 so perf hot-spots are completely different though we are expecting more changes to 10.5.5 flushing so will keep 10.5.5 aspect open for now).

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.