[MDEV-14374] UT_DELAY code : Removing hardware barrier for arm64 bit platform Created: 2017-11-13  Updated: 2020-09-04  Resolved: 2017-11-28

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.3.2
Fix Version/s: 10.3.3

Type: Bug Priority: Major
Reporter: Sandeep sethia Assignee: Sergey Vojtovich
Resolution: Fixed Votes: 0
Labels: innodb
Environment:

Ubuntu 16.02 on Arm 64 bit platform


Issue Links:
Relates
relates to MDEV-23633 MY_RELAX_CPU performs unnecessary com... Closed
Epic Link: arm64 optimization

 Description   

would like to propose a change in the existing spin loop code in the function ut_delay () .Currently if the threads are not able to get the lock they spin for a while before going to sleep.
The code is given below

for (i = 0; i < delay * 50; i++) {
                        UT_RELAX_CPU(); Hardware barrier 
                        UT_COMPILER_BARRIER();Compiler barrier
            }

With default innodb_spin_wait_delay =6 the code loop for 300 times on a hardware barrier followed by a compiler barrier. I can understand compiler barrier so that function does not get optimized and it gives a small delay.

# define UT_RELAX_CPU() do { \
     volatile int32      volatile_var; \
     int32 oldval= 0; \
     my_atomic_cas32(&volatile_var, &oldval, 1); \
   } while (0)
#endif
 
#if defined (__GNUC__)
# define UT_COMPILER_BARRIER() __asm__ __volatile__ ("":::"memory").

As Arm does not have Pause instruction Its not a wise idea to loop on a hardware barrier as well for increasing delay .We can get the same delay if we spin for around 200 times instead of default 6. Currently I don’t see any performance bottleneck with the existing code but in future if there is a good improvement in contention code the hardware barrier can cause unnecessary delay ( as barrier are supposed to be slow).

In MYSQL/Percona we have only compiler barrier in place instead of hardware barrier. I did few testing using sysbench update_index and update_non index I see the performance
Of default delay=6 with hardware barrier + compiler barrier is almost similar to delay=200 + compiler barrier.

Note: If we use only compiler barrier with default delay=6 I see huge scalability issue from 64 threads onwards. It could be slowness of our contention code or missing equivalent pause instruction on arm platform.

As the same performance can be achieved with a configuration change I would like to propose the removal of hardware barrier in a spin loop code.

Please share your thoughts and request you to test the same .

I see lot of discussion already on the community in this issue but for POWER PC.



 Comments   
Comment by Sergey Vojtovich [ 2017-11-13 ]

I wonder what if we change

my_atomic_cas32(&volatile_var, &oldval, 1);

to

my_atomic_cas32_explicit(&volatile_var, &oldval, 1, MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_RELAXED);

This should remove memory barriers, while preserving more or less long delay.

Comment by Sandeep sethia [ 2017-11-13 ]

Thanks for the quick response. Will update the numbers asap.

Comment by Sandeep sethia [ 2017-11-15 ]

I tried the approach as suggested by you and see the performance is similar to what we observer using sleep=200.
I used the function

my_atomic_cas32_strong_explicit(&volatile_var, &oldval, 1, MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_RELAXED); . This removes the memory barrier but preserver long delay.

I submitted the patch using git hub, Do let me know if u need any further details.

https://github.com/MariaDB/server/pull/494

Please review and let me know your comments.

Comment by Sandeep sethia [ 2017-11-21 ]

Sergey,

my_atomic_cas32_strong_explicit(&volatile_var, &oldval, 1, MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_RELAXED);

Does the above function adjust the delay if the contention code improves. The configuration innodb_sleep can be increased/decreased to get the optimum delay(now I see 200 for arm) but does it hold good for the above function as well ?

Comment by Sergey Vojtovich [ 2017-11-22 ]

I was thinking about something like https://en.wikipedia.org/wiki/Test_and_test-and-set, which doesn't require any extra function calls. We just need to perform certain amount of spins before going asleep. But it has to be benchmarked on different platforms first.

Comment by Sandeep sethia [ 2017-11-23 ]

test and set is not scalable in Arm platform. I benchmarked it bit the performance is not good compared compare and swap is used in Mariadb.

Comment by Sergey Vojtovich [ 2017-11-23 ]

That's strange. CAS by all means supposed to be more expensive compared to TAS. E.g. TAS could internally be implemented as CAS, but not vice versa.

Anyway, in MariaDB we can't switch to TAS, because we encode "waiters" flag into "lock_word". What I meant was something like this:

 do {
    while (lock_word != UNLOCKED) skip // spin until lock seems free
  } while CAS(locked, UNLOCKED, LOCKED) // actual atomic locking

Comment by Sergey Vojtovich [ 2017-11-23 ]

But anyway, since you say performance is the same without memory barriers I can merge your PR.

Comment by Sandeep sethia [ 2017-11-24 ]

Wil this be part of 10.3.3 ? By when the changes will reflect

Comment by Sergey Vojtovich [ 2017-11-24 ]

ssethia, the patch is ready. It has to be benchmarked on different platforms before it gets pushed. So far I couldn't immediately see scalability regressions on 2 socket/20 cores Intel Broadwell.

Patch is in bb-10.3-svoj and is not that big (should be easily backportable to 10.2): https://github.com/MariaDB/server/commit/952b57d82ad78eea2a3013796a7652464262106a

It would be nice if you or axel could benchmark this on ARM. Different workloads (mostly read-write), different thread counts and compare to non-patched results.

danblack, same on Power8.

The idea of the patch is to never call ut_delay (instead spin on loading m_lock_word) and never call RNG.

Comment by Sergey Vojtovich [ 2017-11-24 ]

Forgot to mention that this patch only fixes issue in InnoDB mutex implementation. RW-locks need separate patch.

Comment by Sandeep sethia [ 2017-11-24 ]

Sergey,

Need to understand the patch .Are u doing a dynamic delay for each thread based on lock availability ?

UT_RELAX_CPU(); is still being called in while loop .

This has hardware barrier which bring in required delay which we already have for ARM .

I am not clear about the difference between the current and this new patch ?

Comment by Sandeep sethia [ 2017-11-24 ]

I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed

Comment by Sergey Vojtovich [ 2017-11-24 ]

ssethia, good catch! I missed to update UT_REPLAX_CPU(). Pushed this addition to bb-10.3-svoj: https://github.com/MariaDB/server/commit/fabc3685d5cfa620dda56c237ee66cd1e65e113c

Delay should not be needed anymore: instead of updating dummy variable, we load m_lock_word in a loop.

Comment by Sandeep sethia [ 2017-11-24 ]

Sergey,

Tried both the patch and its does not work well on Arm platform. I see tps drop by 40% for 64,128 and 256 threads in sysbench oltp_update_index.lua. test . I did a test for 20 tables with 100000 records in each tables.

Can share more details but it the same setup which I sent an email to for tpds drop issue.

Comment by Sergey Vojtovich [ 2017-11-24 ]

ssethia, thanks for looking into this. I guess it may happen because InnoDB rwlocks need similar change. Now I have access to ARM hardware and can do some testing myself.

Comment by Sandeep sethia [ 2017-11-27 ]

Can you push my changes for the barrier as your patch may require lot of testing.

Comment by Sergey Vojtovich [ 2017-11-27 ]

ssethia, two things:
1. Either you or me should modify this patch as Marko suggested in PR. If you take it, please also mention in PR comment that you contribute this patch under terms of BSD-new license.
2. Could you confirm it didn't cause performance regression with default innodb-spin-wait-delay (which is 6)?

Comment by Sergey Vojtovich [ 2017-11-28 ]

Verified that it didn't cause regressions on ARM with default settings.
Applied Marko's version of the patch.

Generated at Thu Feb 08 08:13:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.