[MDEV-5081] Simple performance improvement for MariaDB - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Minor
Resolution: Fixed
Fix Version/s: 5.5.38, 10.0.11
Component/s: None
Labels:
None

Description

MariaDB developers:
Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system.
It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise).

The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct.

typedef struct st_my_pthread_fastmutex_t

{ pthread_mutex_t mutex; uint spins; uint rng_state; }

my_pthread_fastmutex_t;

As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes.

I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes).

My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm.

The rest of this msg shows the improvement in sysbench transaction values for different thread counts.

Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply.

Joe Mario

sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run

5.5.31-MariaDB 5.5.31-MariaDB-Modified
-------------- -----------------------
Thread cnt:12
transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup.
transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup.
transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup.

Thread cnt:20
transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup.
transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup.
transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup.

Thread cnt:30
transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup.
transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup.

Thread cnt:40
transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup.
transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup.

Thread cnt:50
transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup.
transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup.

Thread cnt:60
transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup.
transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup.

I've attached the patch, since it got mangled when I tried to insert it here.

Joe

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

maria_perf.patch
2 kB
2013-09-30 20:28
mdev5081.patch
2 kB
2013-10-05 12:11
mdev5081.pdf
16 kB
2013-10-11 08:49
reply_to_sergey.txt
5 kB
2013-10-11 21:59
base_vs_joechanges.txt
4 kB
2013-10-14 14:50

Activity

Ascending order - Click to sort in descending order

Joe Mario created issue - 2013-09-30 20:24

Joe Mario made changes - 2013-09-30 20:28

Field	Original Value	New Value
Attachment		maria_perf.patch [ 23800 ]
Description	MariaDB developers: Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system. It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise). The patch is appended at the end of this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct. typedef struct st_my_pthread_fastmutex_t { pthread_mutex_t mutex; uint spins; uint rng_state; } my_pthread_fastmutex_t; As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes. I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes). My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm. The rest of this msg shows the improvement in sysbench transaction values for different thread counts. Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply. Joe Mario # sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run 5.5.31-MariaDB 5.5.31-MariaDB-Modified -------------- ----------------------- Thread cnt:12 transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup. transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup. transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup. Thread cnt:20 transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup. transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup. transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup. Thread cnt:30 transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup. transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup. Thread cnt:40 transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup. transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup. Thread cnt:50 transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup. transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup. Thread cnt:60 transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup. transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup. diff -up BUILD/mariadb-5.5.31.bk/mysys/thr_mutex.c BUILD/mariadb-5.5.31/mysys/thr_mutex.c --- BUILD/mariadb-5.5.31.bk/mysys/thr_mutex.c 2013-05-21 18:09:52.000000000 -0400 +++ BUILD/mariadb-5.5.31/mysys/thr_mutex.c 2013-09-25 15:59:44.554171774 -0400 @@ -886,10 +886,10 @@ int my_pthread_fastmutex_init(my_pthread Commun. ACM, October 1988, Volume 31, No 10, pages 1192-1201. / -static double park_rng(my_pthread_fastmutex_t mp) +static double park_rng(uint rng_state) { - mp->rng_state= ((my_ulonglong)mp->rng_state 279470273U) % 4294967291U; - return (mp->rng_state / 2147483647.0); + rng_state= ((my_ulonglong)rng_state * 279470273U) % 4294967291U; + return (rng_state / 2147483647.0); } int my_pthread_fastmutex_lock(my_pthread_fastmutex_t mp) @@ -897,20 +897,27 @@ int my_pthread_fastmutex_lock(my_pthread int res; uint i; uint maxdelay= MY_PTHREAD_FASTMUTEX_DELAY; + const uint spin_cnt = mp->spins; + uint rng_state = mp->rng_state; - for (i= 0; i < mp->spins; i++) + for (i= 0; i < spin_cnt; i++) { res= pthread_mutex_trylock(&mp->mutex); - if (res == 0) + if (res == 0) { + mp->rng_state = rng_state; return 0; - - if (res != EBUSY) + } + if (res != EBUSY) { + mp->rng_state = rng_state; return res; + } mutex_delay(maxdelay); - maxdelay += park_rng(mp) * MY_PTHREAD_FASTMUTEX_DELAY + 1; + maxdelay += park_rng(&rng_state) * MY_PTHREAD_FASTMUTEX_DELAY + 1; } + + mp->rng_state = rng_state; return pthread_mutex_lock(&mp->mutex); }	MariaDB developers: Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system. It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise). The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct. typedef struct st_my_pthread_fastmutex_t { pthread_mutex_t mutex; uint spins; uint rng_state; } my_pthread_fastmutex_t; As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes. I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes). My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm. The rest of this msg shows the improvement in sysbench transaction values for different thread counts. Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply. Joe Mario # sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run 5.5.31-MariaDB 5.5.31-MariaDB-Modified -------------- ----------------------- Thread cnt:12 transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup. transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup. transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup. Thread cnt:20 transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup. transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup. transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup. Thread cnt:30 transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup. transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup. Thread cnt:40 transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup. transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup. Thread cnt:50 transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup. transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup. Thread cnt:60 transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup. transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup. I've attached the patch, since it got mangled when I tried to insert it here. Joe

Joe Mario added a comment - 2013-10-03 15:23

MariaDB developers:
This is a simple change that got up to a 17% speedup with sysbench on my 4 node system.
How do I find out if someone will pick this up to run with it?

Thanks,
Joe Mario

Joe Mario added a comment - 2013-10-03 15:23 MariaDB developers: This is a simple change that got up to a 17% speedup with sysbench on my 4 node system. How do I find out if someone will pick this up to run with it? Thanks, Joe Mario

Peter (Stig) Edwards added a comment - 2013-10-04 15:59

Hello Joe,
I am just a passer-by, but I suspect that some subscribers to the maria-developers mailing list:
https://lists.launchpad.net/maria-developers
would be interested in the change. Maybe try posting there "Review ~~MDEV-5081~~ my_pthread_fastmutex_lock patch, sysbench oltp gains for many threads on NUMA".
Thanks and good luck.

Peter (Stig) Edwards added a comment - 2013-10-04 15:59 Hello Joe, I am just a passer-by, but I suspect that some subscribers to the maria-developers mailing list: https://lists.launchpad.net/maria-developers would be interested in the change. Maybe try posting there "Review MDEV-5081 my_pthread_fastmutex_lock patch, sysbench oltp gains for many threads on NUMA". Thanks and good luck.

Sergey Vojtovich made changes - 2013-10-04 18:25

Assignee

Sergey Vojtovich [ svoj ]

Peter (Stig) Edwards added a comment - 2013-10-05 02:37

I was wondering what was being used to analyze sysbench, and I found these resources helpful:
http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/
http://developerblog.redhat.com/2013/05/31/dive-deeper-in-numa-systems/
Thank you Joe (and Don).

Peter (Stig) Edwards added a comment - 2013-10-05 02:37 I was wondering what was being used to analyze sysbench, and I found these resources helpful: http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/ http://developerblog.redhat.com/2013/05/31/dive-deeper-in-numa-systems/ Thank you Joe (and Don).

Sergey Vojtovich made changes - 2013-10-05 12:11

Attachment

mdev5081.patch [ 23803 ]

Sergey Vojtovich added a comment - 2013-10-05 12:20

Joe, thanks for your contribution! I reviewed your analysis and your patch and came up with one small extension: I believe there is no much sense in randomizing timeout value, so I totally removed rng_state.

Spin locks on NUMA might be not that good idea. I'd like to do some extra benchmarking with fast mutexes disabled.

Sergey Vojtovich added a comment - 2013-10-05 12:20 Joe, thanks for your contribution! I reviewed your analysis and your patch and came up with one small extension: I believe there is no much sense in randomizing timeout value, so I totally removed rng_state. Spin locks on NUMA might be not that good idea. I'd like to do some extra benchmarking with fast mutexes disabled.

Joe Mario added a comment - 2013-10-06 03:05

Hi Peter and Sergey:
Glad to be of help.

Sergey:
Do you want me to try your patch on sysbench on our 4-node server? Or did you already try it?

Joe

Joe Mario added a comment - 2013-10-06 03:05 Hi Peter and Sergey: Glad to be of help. Sergey: Do you want me to try your patch on sysbench on our 4-node server? Or did you already try it? Joe

Sergey Vojtovich added a comment - 2013-10-06 07:24

Hi Joe,

we will try it next week. If you have resources and want to test my patch - feel free to do so. Your feedback will be valuable.

Sergey Vojtovich added a comment - 2013-10-06 07:24 Hi Joe, we will try it next week. If you have resources and want to test my patch - feel free to do so. Your feedback will be valuable.

Joe Mario added a comment - 2013-10-07 03:06

Hi Sergey:
I did a build with your patch and compared it with my patch. This was on our 4-node server, with the database located on an SSD drive. I used the same sysbench command that I posted in my opening post.

Here's what I found:
12 Threads: Avg: 3.7% speedup over 3 runs
20 Threads: Avg: 15.5% slowdown over 3 runs
30 Threads: Avg: 15.5% slowdown over 2 runs
40 Threads: Avg: 10.6% slowdown over 2 runs
50 Threads: Avg: 4.3% slowdown over 1 run
60 Threads: Avg: 1.2% slowdown over 1 run

So something about that random backoff does help at the higher threads counts.
However, here's some more info. I see your deleted the "now unused" rng_state field from the struct. In the many iterations I did leading up to my patch, I noticed when I tried to change the size of the struct (either by adding padding or moving things around), I too got a speedup at the lower thread counts and slowdowns at the higher thread counts.

I'm not a regular database user, and I don't know if 12 threads is the sweet spot you're aiming to speedup. I suspect scailability at higher thread counts is important to you - but please confirm.

If you want, I can try again putting that hot mutex in its own aligned cacheline and pad it out so that nothing else conflicts with it. I can try it both with and without the spins field in it. I had mixed results earlier, but never with the rng_state deleted.
Joe

Joe Mario added a comment - 2013-10-07 03:06 Hi Sergey: I did a build with your patch and compared it with my patch. This was on our 4-node server, with the database located on an SSD drive. I used the same sysbench command that I posted in my opening post. Here's what I found: 12 Threads: Avg: 3.7% speedup over 3 runs 20 Threads: Avg: 15.5% slowdown over 3 runs 30 Threads: Avg: 15.5% slowdown over 2 runs 40 Threads: Avg: 10.6% slowdown over 2 runs 50 Threads: Avg: 4.3% slowdown over 1 run 60 Threads: Avg: 1.2% slowdown over 1 run So something about that random backoff does help at the higher threads counts. However, here's some more info. I see your deleted the "now unused" rng_state field from the struct. In the many iterations I did leading up to my patch, I noticed when I tried to change the size of the struct (either by adding padding or moving things around), I too got a speedup at the lower thread counts and slowdowns at the higher thread counts. I'm not a regular database user, and I don't know if 12 threads is the sweet spot you're aiming to speedup. I suspect scailability at higher thread counts is important to you - but please confirm. If you want, I can try again putting that hot mutex in its own aligned cacheline and pad it out so that nothing else conflicts with it. I can try it both with and without the spins field in it. I had mixed results earlier, but never with the rng_state deleted. Joe

Sergey Vojtovich added a comment - 2013-10-07 11:59

Hi Joe,

we're aiming to provide sane performance in all cases. According to the numbers your patch wins.

On my 64bit host structure sizes are not affected by removal of rng_state (due to 8byte alignment I guess):
sizeof(mysql_mutex_t)= 56, sizeof(my_pthread_fastmutex_t)= 48, sizeof(pthread_mutex_t)= 40

Did structure size change on your 4-node server?

Previously maximum possible increment for maxdelay was MY_PTHREAD_FASTMUTEX_DELAY (which is 4). I used this maximum value. Probably we should use lower number and increment maxdelay by 1 instead?

Also I had a look at pthread_mutex_lock() source and found out that it spins if mutex type is PTHREAD_MUTEX_ADAPTIVE_NP (which is the case with fast mutex). In other words we have double spinning.

Sergey Vojtovich added a comment - 2013-10-07 11:59 Hi Joe, we're aiming to provide sane performance in all cases. According to the numbers your patch wins. On my 64bit host structure sizes are not affected by removal of rng_state (due to 8byte alignment I guess): sizeof(mysql_mutex_t)= 56, sizeof(my_pthread_fastmutex_t)= 48, sizeof(pthread_mutex_t)= 40 Did structure size change on your 4-node server? Previously maximum possible increment for maxdelay was MY_PTHREAD_FASTMUTEX_DELAY (which is 4). I used this maximum value. Probably we should use lower number and increment maxdelay by 1 instead? Also I had a look at pthread_mutex_lock() source and found out that it spins if mutex type is PTHREAD_MUTEX_ADAPTIVE_NP (which is the case with fast mutex). In other words we have double spinning.

Joe Mario added a comment - 2013-10-07 14:21

Hi Sergey:
Let me redo my testing that I did yesterday. I want to double check my steps as I may have made a mistake. Give me a day or two to post back with any updated results.
Joe

Joe Mario added a comment - 2013-10-07 14:21 Hi Sergey: Let me redo my testing that I did yesterday. I want to double check my steps as I may have made a mistake. Give me a day or two to post back with any updated results. Joe

Joe Mario added a comment - 2013-10-10 01:27

Hi Sergey:
It's going to be a while before I get back to this. I did a couple of runs - enough to know that your change is within the noise level. But I didn't test it as thoroughly as I usually do, due to being pulled away with other interrupts. So go ahead and use your enhanced version.

In my last post, when I said I made a mistake, I realized I grabbed the last rpmbuild in the shell history, and it wasn't using the spec file I thought it used. Sorry about the confusion.

And I agree with you about the struct size after rng_state is deleted.

If I get a chance to get back and do more testing with this, I will.
Joe

Joe Mario added a comment - 2013-10-10 01:27 Hi Sergey: It's going to be a while before I get back to this. I did a couple of runs - enough to know that your change is within the noise level. But I didn't test it as thoroughly as I usually do, due to being pulled away with other interrupts. So go ahead and use your enhanced version. In my last post, when I said I made a mistake, I realized I grabbed the last rpmbuild in the shell history, and it wasn't using the spec file I thought it used. Sorry about the confusion. And I agree with you about the struct size after rng_state is deleted. If I get a chance to get back and do more testing with this, I will. Joe

Sergey Vojtovich made changes - 2013-10-11 08:49

Attachment

mdev5081.pdf [ 23810 ]

Sergey Vojtovich added a comment - 2013-10-11 08:52

Hi Joe,

thanks a lot for testing it. We did some tests on our side (results attached), but they don't look as good as yours. I'm currently working towards to understand the difference. Could you share you test system details and how did you build MariaDB?

Thanks,
Sergey

Sergey Vojtovich added a comment - 2013-10-11 08:52 Hi Joe, thanks a lot for testing it. We did some tests on our side (results attached), but they don't look as good as yours. I'm currently working towards to understand the difference. Could you share you test system details and how did you build MariaDB? Thanks, Sergey

Joe Mario added a comment - 2013-10-11 21:59

Hi Sergey:
See my reply in the attached file (which seems to preserve cleaner formatting). Let me know if I didn't answer your questions.

Joe

Joe Mario added a comment - 2013-10-11 21:59 Hi Sergey: See my reply in the attached file (which seems to preserve cleaner formatting). Let me know if I didn't answer your questions. Joe

Joe Mario made changes - 2013-10-11 21:59

Attachment

reply_to_sergey.txt [ 23811 ]

Joe Mario made changes - 2013-10-11 22:01

Comment

[ Hi Sergey:
See the attached file for my answers to your questions. Attaching it seems to preserve cleaner formatting than inline text.
Let me know if you still have questions.
Joe ]

Joe Mario added a comment - 2013-10-14 14:50

Hi Sergey:
I did another rerun over the weekend.

I first ran my test script (the one I previously posted) three times against the original v5.5.31 unmodified MariaDB. Then I ran it another three times against the version with my proposed changes.

While I wasn't able to reproduce my original speedup numbers, there is a positive effect as the thread count and contention increases. See the attached file for the results.

I also did two other runs (which I didn't attach).
First I compared your suggested changes against the changes I made. The difference was in the noise.

Second, I added padding to the fast_mutex struct, to make sure nothing else was adding additional contention to the mutex cacheline:
typedef struct st_my_pthread_fastmutex_t

{ pthread_mutex_t mutex; uint spins; uint rng_state; + // Pad out to a cacheline + uint pad[4]; }

my_pthread_fastmutex_t;

The results were inconclusive. It hurt 0 to 2% at 12 threads, but helped up to 10% at 20 threads, and then it was noisy at higher thread counts.
I mention these results just as an fyi.

Hopefully my input in this whole thread will help speedup MariaDB in some way.
Joe

Joe Mario added a comment - 2013-10-14 14:50 Hi Sergey: I did another rerun over the weekend. I first ran my test script (the one I previously posted) three times against the original v5.5.31 unmodified MariaDB. Then I ran it another three times against the version with my proposed changes. While I wasn't able to reproduce my original speedup numbers, there is a positive effect as the thread count and contention increases. See the attached file for the results. I also did two other runs (which I didn't attach). First I compared your suggested changes against the changes I made. The difference was in the noise. Second, I added padding to the fast_mutex struct, to make sure nothing else was adding additional contention to the mutex cacheline: typedef struct st_my_pthread_fastmutex_t { pthread_mutex_t mutex; uint spins; uint rng_state; + // Pad out to a cacheline + uint pad[4]; } my_pthread_fastmutex_t; The results were inconclusive. It hurt 0 to 2% at 12 threads, but helped up to 10% at 20 threads, and then it was noisy at higher thread counts. I mention these results just as an fyi. Hopefully my input in this whole thread will help speedup MariaDB in some way. Joe

Joe Mario made changes - 2013-10-14 14:50

Attachment

base_vs_joechanges.txt [ 23901 ]

Sergey Vojtovich added a comment - 2013-10-14 15:29

Hi Joe,

your input is definitely valuable. We didn't do much wrt scalability on NUMA yet, but there seem to be some low-laying fruits hanging around.
I plan to do some additional tests on our side this week and will hopefully try perf tool that you were suggesting.

Thanks,
Sergey

Sergey Vojtovich added a comment - 2013-10-14 15:29 Hi Joe, your input is definitely valuable. We didn't do much wrt scalability on NUMA yet, but there seem to be some low-laying fruits hanging around. I plan to do some additional tests on our side this week and will hopefully try perf tool that you were suggesting. Thanks, Sergey

Sergei Golubchik made changes - 2013-10-15 06:21

Labels

MariaDB_5.5

Sergey Vojtovich added a comment - 2013-10-21 09:32

Last week I did some tests on 4 CPU (64 cores) Sandy Bridge host. Unfortunately I wasn't able to reproduce performance improvement. Probably it is due to hardware difference.

Sergey Vojtovich added a comment - 2013-10-21 09:32 Last week I did some tests on 4 CPU (64 cores) Sandy Bridge host. Unfortunately I wasn't able to reproduce performance improvement. Probably it is due to hardware difference.

Sergei Golubchik made changes - 2014-02-25 12:40

Fix Version/s

10.0.9 [ 14400 ]

Sergei Golubchik made changes - 2014-02-25 12:41

Priority

Trivial [ 5 ]

Major [ 3 ]

Sergei Golubchik made changes - 2014-02-25 12:41

Priority

Major [ 3 ]

Minor [ 4 ]

Sergey Vojtovich added a comment - 2014-02-28 13:53

I was able to reproduce reported problem. Fast mutexes shown worst throughput compared to other mutex types. Benchmark results are available here:
http://svoj-db.blogspot.ru/2014/02/mariadb-mutexes-scalability.html

Looks like this problem was already raised a few times in MySQL circles:
http://bugs.mysql.com/bug.php?id=58766
http://bugs.mysql.com/bug.php?id=38941
http://dev.mysql.com/worklog/task/?id=4601

Said the above fast mutexes will unlikely scale better than normal mutexes ever. We agreed to disable fast mutexes in our release build configuration.

Sergey Vojtovich added a comment - 2014-02-28 13:53 I was able to reproduce reported problem. Fast mutexes shown worst throughput compared to other mutex types. Benchmark results are available here: http://svoj-db.blogspot.ru/2014/02/mariadb-mutexes-scalability.html Looks like this problem was already raised a few times in MySQL circles: http://bugs.mysql.com/bug.php?id=58766 http://bugs.mysql.com/bug.php?id=38941 http://dev.mysql.com/worklog/task/?id=4601 Said the above fast mutexes will unlikely scale better than normal mutexes ever. We agreed to disable fast mutexes in our release build configuration.

Sergey Vojtovich made changes - 2014-02-28 13:54

Assignee

Sergey Vojtovich [ svoj ]

Sergei Golubchik [ serg ]

Sergey Vojtovich added a comment - 2014-02-28 13:54

Sergei, please review fix for this bug.

Sergey Vojtovich added a comment - 2014-02-28 13:54 Sergei, please review fix for this bug.

Joe Mario added a comment - 2014-02-28 14:42

Hi Sergey and Sergei:
The patches to add the cacheline tugging detection to the perf tool (perf c2c) were recently submitted upstream. See http://lwn.net/Articles/585195/.
They are still in review with some cleanup to be added, but it's moving forward.

If I get a chance, I'll take the version of MariaDB that's part of RHEL, run "perf c2c" on it during a sysbench run, and will post the output here so you can see what the tool is showing.

Joe

Joe Mario added a comment - 2014-02-28 14:42 Hi Sergey and Sergei: The patches to add the cacheline tugging detection to the perf tool (perf c2c) were recently submitted upstream. See http://lwn.net/Articles/585195/ . They are still in review with some cleanup to be added, but it's moving forward. If I get a chance, I'll take the version of MariaDB that's part of RHEL, run "perf c2c" on it during a sysbench run, and will post the output here so you can see what the tool is showing. Joe

Sergei Golubchik made changes - 2014-03-09 22:02

Fix Version/s		10.0.10 [ 14500 ]
Fix Version/s	10.0.9 [ 14400 ]

Sergei Golubchik made changes - 2014-03-31 14:23

Fix Version/s		10.0.11 [ 15200 ]
Fix Version/s	10.0.10 [ 14500 ]

Sergei Golubchik added a comment - 2014-05-05 11:46

ok to push

Sergei Golubchik added a comment - 2014-05-05 11:46 ok to push

Sergei Golubchik made changes - 2014-05-05 11:46

Assignee

Sergei Golubchik [ serg ]

Sergey Vojtovich [ svoj ]

Sergey Vojtovich added a comment - 2014-05-06 09:42

Fixed in 5.5.38:

revno: 4174

revision-id: svoj@mariadb.org-20140228114602-nyj6i2fejiywnhbx

parent: monty@mariadb.org-20140503161217-ac6ec1uoq5sdg40o

committer: Sergey Vojtovich <svoj@mariadb.org>

branch nick: 5.5-mdev5081

timestamp: Fri 2014-02-28 15:46:02 +0400

message:

  MDEV-5081 - Simple performance improvement for MariaDB

  Currently fast mutexes have lower throuput compared to normal mutexes.

  Remove them from release build configuration.

Joe, thanks for the c2c tool link. I will try to make use of it in further benchmark.

Sergey Vojtovich added a comment - 2014-05-06 09:42 Fixed in 5.5.38: revno: 4174 revision-id: svoj@mariadb.org-20140228114602-nyj6i2fejiywnhbx parent: monty@mariadb.org-20140503161217-ac6ec1uoq5sdg40o committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 5.5-mdev5081 timestamp: Fri 2014-02-28 15:46:02 +0400 message: MDEV-5081 - Simple performance improvement for MariaDB Currently fast mutexes have lower throuput compared to normal mutexes. Remove them from release build configuration. Joe, thanks for the c2c tool link. I will try to make use of it in further benchmark.

Sergey Vojtovich made changes - 2014-05-06 09:42

Fix Version/s		5.5.38 [ 15400 ]
Resolution		Fixed [ 1 ]
Status	Open [ 1 ]	Closed [ 6 ]

Sergei Golubchik made changes - 2014-06-13 15:06

Workflow

defaullt [ 29201 ]

MariaDB v2 [ 43925 ]

Rasmus Johansson (Inactive) made changes - 2015-05-18 17:51

Workflow

MariaDB v2 [ 43925 ]

MariaDB v3 [ 64202 ]

Sergei Golubchik made changes - 2021-12-06 21:22

Workflow

MariaDB v3 [ 64202 ]

MariaDB v4 [ 132199 ]

People

Assignee:: Sergey Vojtovich

Reporter:: Joe Mario

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2013-09-30 20:24

Updated:: 2014-05-06 09:42

Resolved:: 2014-05-06 09:42

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration