[MDEV-14374] UT_DELAY code : Removing hardware barrier for arm64 bit platform Created: 2017-11-13 Updated: 2020-09-04 Resolved: 2017-11-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 10.3.2 |
| Fix Version/s: | 10.3.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sandeep sethia | Assignee: | Sergey Vojtovich |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | innodb | ||
| Environment: |
Ubuntu 16.02 on Arm 64 bit platform |
||
| Issue Links: |
|
||||||||
| Epic Link: | arm64 optimization | ||||||||
| Description |
|
would like to propose a change in the existing spin loop code in the function ut_delay () .Currently if the threads are not able to get the lock they spin for a while before going to sleep.
With default innodb_spin_wait_delay =6 the code loop for 300 times on a hardware barrier followed by a compiler barrier. I can understand compiler barrier so that function does not get optimized and it gives a small delay.
As Arm does not have Pause instruction Its not a wise idea to loop on a hardware barrier as well for increasing delay .We can get the same delay if we spin for around 200 times instead of default 6. Currently I don’t see any performance bottleneck with the existing code but in future if there is a good improvement in contention code the hardware barrier can cause unnecessary delay ( as barrier are supposed to be slow). In MYSQL/Percona we have only compiler barrier in place instead of hardware barrier. I did few testing using sysbench update_index and update_non index I see the performance Note: If we use only compiler barrier with default delay=6 I see huge scalability issue from 64 threads onwards. It could be slowness of our contention code or missing equivalent pause instruction on arm platform. As the same performance can be achieved with a configuration change I would like to propose the removal of hardware barrier in a spin loop code. Please share your thoughts and request you to test the same . I see lot of discussion already on the community in this issue but for POWER PC. |
| Comments |
| Comment by Sergey Vojtovich [ 2017-11-13 ] | |||
|
I wonder what if we change
to
This should remove memory barriers, while preserving more or less long delay. | |||
| Comment by Sandeep sethia [ 2017-11-13 ] | |||
|
Thanks for the quick response. Will update the numbers asap. | |||
| Comment by Sandeep sethia [ 2017-11-15 ] | |||
|
I tried the approach as suggested by you and see the performance is similar to what we observer using sleep=200. my_atomic_cas32_strong_explicit(&volatile_var, &oldval, 1, MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_RELAXED); . This removes the memory barrier but preserver long delay. I submitted the patch using git hub, Do let me know if u need any further details. https://github.com/MariaDB/server/pull/494 Please review and let me know your comments. | |||
| Comment by Sandeep sethia [ 2017-11-21 ] | |||
|
Sergey, my_atomic_cas32_strong_explicit(&volatile_var, &oldval, 1, MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_RELAXED); Does the above function adjust the delay if the contention code improves. The configuration innodb_sleep can be increased/decreased to get the optimum delay(now I see 200 for arm) but does it hold good for the above function as well ? | |||
| Comment by Sergey Vojtovich [ 2017-11-22 ] | |||
|
I was thinking about something like https://en.wikipedia.org/wiki/Test_and_test-and-set, which doesn't require any extra function calls. We just need to perform certain amount of spins before going asleep. But it has to be benchmarked on different platforms first. | |||
| Comment by Sandeep sethia [ 2017-11-23 ] | |||
|
test and set is not scalable in Arm platform. I benchmarked it bit the performance is not good compared compare and swap is used in Mariadb. | |||
| Comment by Sergey Vojtovich [ 2017-11-23 ] | |||
|
That's strange. CAS by all means supposed to be more expensive compared to TAS. E.g. TAS could internally be implemented as CAS, but not vice versa. Anyway, in MariaDB we can't switch to TAS, because we encode "waiters" flag into "lock_word". What I meant was something like this:
| |||
| Comment by Sergey Vojtovich [ 2017-11-23 ] | |||
|
But anyway, since you say performance is the same without memory barriers I can merge your PR. | |||
| Comment by Sandeep sethia [ 2017-11-24 ] | |||
|
Wil this be part of 10.3.3 ? By when the changes will reflect | |||
| Comment by Sergey Vojtovich [ 2017-11-24 ] | |||
|
ssethia, the patch is ready. It has to be benchmarked on different platforms before it gets pushed. So far I couldn't immediately see scalability regressions on 2 socket/20 cores Intel Broadwell. Patch is in bb-10.3-svoj and is not that big (should be easily backportable to 10.2): https://github.com/MariaDB/server/commit/952b57d82ad78eea2a3013796a7652464262106a It would be nice if you or axel could benchmark this on ARM. Different workloads (mostly read-write), different thread counts and compare to non-patched results. danblack, same on Power8. The idea of the patch is to never call ut_delay (instead spin on loading m_lock_word) and never call RNG. | |||
| Comment by Sergey Vojtovich [ 2017-11-24 ] | |||
|
Forgot to mention that this patch only fixes issue in InnoDB mutex implementation. RW-locks need separate patch. | |||
| Comment by Sandeep sethia [ 2017-11-24 ] | |||
|
Sergey, Need to understand the patch .Are u doing a dynamic delay for each thread based on lock availability ? UT_RELAX_CPU(); is still being called in while loop . This has hardware barrier which bring in required delay which we already have for ARM . I am not clear about the difference between the current and this new patch ? | |||
| Comment by Sandeep sethia [ 2017-11-24 ] | |||
|
I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed | |||
| Comment by Sergey Vojtovich [ 2017-11-24 ] | |||
|
ssethia, good catch! I missed to update UT_REPLAX_CPU(). Pushed this addition to bb-10.3-svoj: https://github.com/MariaDB/server/commit/fabc3685d5cfa620dda56c237ee66cd1e65e113c Delay should not be needed anymore: instead of updating dummy variable, we load m_lock_word in a loop. | |||
| Comment by Sandeep sethia [ 2017-11-24 ] | |||
|
Sergey, Tried both the patch and its does not work well on Arm platform. I see tps drop by 40% for 64,128 and 256 threads in sysbench oltp_update_index.lua. test . I did a test for 20 tables with 100000 records in each tables. Can share more details but it the same setup which I sent an email to for tpds drop issue. | |||
| Comment by Sergey Vojtovich [ 2017-11-24 ] | |||
|
ssethia, thanks for looking into this. I guess it may happen because InnoDB rwlocks need similar change. Now I have access to ARM hardware and can do some testing myself. | |||
| Comment by Sandeep sethia [ 2017-11-27 ] | |||
|
Can you push my changes for the barrier as your patch may require lot of testing. | |||
| Comment by Sergey Vojtovich [ 2017-11-27 ] | |||
|
ssethia, two things: | |||
| Comment by Sergey Vojtovich [ 2017-11-28 ] | |||
|
Verified that it didn't cause regressions on ARM with default settings. |