Details
-
Task
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
MariaDB developers:
Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system.
It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise).
The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct.
typedef struct st_my_pthread_fastmutex_t
my_pthread_fastmutex_t;
As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes.
I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes).
My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm.
The rest of this msg shows the improvement in sysbench transaction values for different thread counts.
Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply.
Joe Mario
- sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run
5.5.31-MariaDB 5.5.31-MariaDB-Modified
-------------- -----------------------
Thread cnt:12
transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup.
transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup.
transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup.
Thread cnt:20
transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup.
transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup.
transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup.
Thread cnt:30
transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup.
transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup.
Thread cnt:40
transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup.
transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup.
Thread cnt:50
transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup.
transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup.
Thread cnt:60
transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup.
transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup.
I've attached the patch, since it got mangled when I tried to insert it here.
Joe
Attachments
Activity
Field | Original Value | New Value |
---|---|---|
Attachment | maria_perf.patch [ 23800 ] | |
Description |
MariaDB developers: Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system. It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise). The patch is appended at the end of this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct. typedef struct st_my_pthread_fastmutex_t { pthread_mutex_t mutex; uint spins; uint rng_state; } my_pthread_fastmutex_t; As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes. I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes). My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm. The rest of this msg shows the improvement in sysbench transaction values for different thread counts. Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply. Joe Mario # sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run 5.5.31-MariaDB 5.5.31-MariaDB-Modified -------------- ----------------------- Thread cnt:12 transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup. transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup. transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup. Thread cnt:20 transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup. transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup. transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup. Thread cnt:30 transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup. transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup. Thread cnt:40 transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup. transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup. Thread cnt:50 transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup. transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup. Thread cnt:60 transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup. transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup. diff -up BUILD/mariadb-5.5.31.bk/mysys/thr_mutex.c BUILD/mariadb-5.5.31/mysys/thr_mutex.c --- BUILD/mariadb-5.5.31.bk/mysys/thr_mutex.c 2013-05-21 18:09:52.000000000 -0400 +++ BUILD/mariadb-5.5.31/mysys/thr_mutex.c 2013-09-25 15:59:44.554171774 -0400 @@ -886,10 +886,10 @@ int my_pthread_fastmutex_init(my_pthread Commun. ACM, October 1988, Volume 31, No 10, pages 1192-1201. */ -static double park_rng(my_pthread_fastmutex_t *mp) +static double park_rng(uint *rng_state) { - mp->rng_state= ((my_ulonglong)mp->rng_state * 279470273U) % 4294967291U; - return (mp->rng_state / 2147483647.0); + *rng_state= ((my_ulonglong)*rng_state * 279470273U) % 4294967291U; + return (*rng_state / 2147483647.0); } int my_pthread_fastmutex_lock(my_pthread_fastmutex_t *mp) @@ -897,20 +897,27 @@ int my_pthread_fastmutex_lock(my_pthread int res; uint i; uint maxdelay= MY_PTHREAD_FASTMUTEX_DELAY; + const uint spin_cnt = mp->spins; + uint rng_state = mp->rng_state; - for (i= 0; i < mp->spins; i++) + for (i= 0; i < spin_cnt; i++) { res= pthread_mutex_trylock(&mp->mutex); - if (res == 0) + if (res == 0) { + mp->rng_state = rng_state; return 0; - - if (res != EBUSY) + } + if (res != EBUSY) { + mp->rng_state = rng_state; return res; + } mutex_delay(maxdelay); - maxdelay += park_rng(mp) * MY_PTHREAD_FASTMUTEX_DELAY + 1; + maxdelay += park_rng(&rng_state) * MY_PTHREAD_FASTMUTEX_DELAY + 1; } + + mp->rng_state = rng_state; return pthread_mutex_lock(&mp->mutex); } |
MariaDB developers: Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system. It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise). The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct. typedef struct st_my_pthread_fastmutex_t { pthread_mutex_t mutex; uint spins; uint rng_state; } my_pthread_fastmutex_t; As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes. I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes). My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm. The rest of this msg shows the improvement in sysbench transaction values for different thread counts. Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply. Joe Mario # sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run 5.5.31-MariaDB 5.5.31-MariaDB-Modified -------------- ----------------------- Thread cnt:12 transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup. transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup. transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup. Thread cnt:20 transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup. transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup. transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup. Thread cnt:30 transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup. transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup. Thread cnt:40 transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup. transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup. Thread cnt:50 transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup. transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup. Thread cnt:60 transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup. transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup. I've attached the patch, since it got mangled when I tried to insert it here. Joe |
Assignee | Sergey Vojtovich [ svoj ] |
Attachment | mdev5081.patch [ 23803 ] |
Attachment | mdev5081.pdf [ 23810 ] |
Attachment | reply_to_sergey.txt [ 23811 ] |
Comment |
[ Hi Sergey: See the attached file for my answers to your questions. Attaching it seems to preserve cleaner formatting than inline text. Let me know if you still have questions. Joe ] |
Attachment | base_vs_joechanges.txt [ 23901 ] |
Labels | MariaDB_5.5 |
Fix Version/s | 10.0.9 [ 14400 ] |
Priority | Trivial [ 5 ] | Major [ 3 ] |
Priority | Major [ 3 ] | Minor [ 4 ] |
Assignee | Sergey Vojtovich [ svoj ] | Sergei Golubchik [ serg ] |
Fix Version/s | 10.0.10 [ 14500 ] | |
Fix Version/s | 10.0.9 [ 14400 ] |
Fix Version/s | 10.0.11 [ 15200 ] | |
Fix Version/s | 10.0.10 [ 14500 ] |
Assignee | Sergei Golubchik [ serg ] | Sergey Vojtovich [ svoj ] |
Fix Version/s | 5.5.38 [ 15400 ] | |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Closed [ 6 ] |
Workflow | defaullt [ 29201 ] | MariaDB v2 [ 43925 ] |
Workflow | MariaDB v2 [ 43925 ] | MariaDB v3 [ 64202 ] |
Workflow | MariaDB v3 [ 64202 ] | MariaDB v4 [ 132199 ] |
MariaDB developers:
This is a simple change that got up to a 17% speedup with sysbench on my 4 node system.
How do I find out if someone will pick this up to run with it?
Thanks,
Joe Mario