[MDEV-5081] Simple performance improvement for MariaDB Created: 2013-09-30  Updated: 2014-05-06  Resolved: 2014-05-06

Status: Closed
Project: MariaDB Server
Component/s: None
Fix Version/s: 5.5.38, 10.0.11

Type: Task Priority: Minor
Reporter: Joe Mario Assignee: Sergey Vojtovich
Resolution: Fixed Votes: 2
Labels: None

Attachments: Text File base_vs_joechanges.txt     File maria_perf.patch     Text File mdev5081.patch     PDF File mdev5081.pdf     Text File reply_to_sergey.txt    

 Description   

MariaDB developers:
Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system.
It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise).

The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct.

typedef struct st_my_pthread_fastmutex_t

{ pthread_mutex_t mutex; uint spins; uint rng_state; }

my_pthread_fastmutex_t;

As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes.

I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes).

My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm.

The rest of this msg shows the improvement in sysbench transaction values for different thread counts.

Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply.

Joe Mario

  1. sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run

5.5.31-MariaDB 5.5.31-MariaDB-Modified
-------------- -----------------------
Thread cnt:12
transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup.
transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup.
transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup.

Thread cnt:20
transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup.
transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup.
transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup.

Thread cnt:30
transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup.
transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup.

Thread cnt:40
transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup.
transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup.

Thread cnt:50
transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup.
transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup.

Thread cnt:60
transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup.
transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup.

I've attached the patch, since it got mangled when I tried to insert it here.

Joe



 Comments   
Comment by Joe Mario [ 2013-10-03 ]

MariaDB developers:
This is a simple change that got up to a 17% speedup with sysbench on my 4 node system.
How do I find out if someone will pick this up to run with it?

Thanks,
Joe Mario

Comment by Peter (Stig) Edwards [ 2013-10-04 ]

Hello Joe,
I am just a passer-by, but I suspect that some subscribers to the maria-developers mailing list:
https://lists.launchpad.net/maria-developers
would be interested in the change. Maybe try posting there "Review MDEV-5081 my_pthread_fastmutex_lock patch, sysbench oltp gains for many threads on NUMA".
Thanks and good luck.

Comment by Peter (Stig) Edwards [ 2013-10-05 ]

I was wondering what was being used to analyze sysbench, and I found these resources helpful:
http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/
http://developerblog.redhat.com/2013/05/31/dive-deeper-in-numa-systems/
Thank you Joe (and Don).

Comment by Sergey Vojtovich [ 2013-10-05 ]

Joe, thanks for your contribution! I reviewed your analysis and your patch and came up with one small extension: I believe there is no much sense in randomizing timeout value, so I totally removed rng_state.

Spin locks on NUMA might be not that good idea. I'd like to do some extra benchmarking with fast mutexes disabled.

Comment by Joe Mario [ 2013-10-06 ]

Hi Peter and Sergey:
Glad to be of help.

Sergey:
Do you want me to try your patch on sysbench on our 4-node server? Or did you already try it?

Joe

Comment by Sergey Vojtovich [ 2013-10-06 ]

Hi Joe,

we will try it next week. If you have resources and want to test my patch - feel free to do so. Your feedback will be valuable.

Comment by Joe Mario [ 2013-10-07 ]

Hi Sergey:
I did a build with your patch and compared it with my patch. This was on our 4-node server, with the database located on an SSD drive. I used the same sysbench command that I posted in my opening post.

Here's what I found:
12 Threads: Avg: 3.7% speedup over 3 runs
20 Threads: Avg: 15.5% slowdown over 3 runs
30 Threads: Avg: 15.5% slowdown over 2 runs
40 Threads: Avg: 10.6% slowdown over 2 runs
50 Threads: Avg: 4.3% slowdown over 1 run
60 Threads: Avg: 1.2% slowdown over 1 run

So something about that random backoff does help at the higher threads counts.
However, here's some more info. I see your deleted the "now unused" rng_state field from the struct. In the many iterations I did leading up to my patch, I noticed when I tried to change the size of the struct (either by adding padding or moving things around), I too got a speedup at the lower thread counts and slowdowns at the higher thread counts.

I'm not a regular database user, and I don't know if 12 threads is the sweet spot you're aiming to speedup. I suspect scailability at higher thread counts is important to you - but please confirm.

If you want, I can try again putting that hot mutex in its own aligned cacheline and pad it out so that nothing else conflicts with it. I can try it both with and without the spins field in it. I had mixed results earlier, but never with the rng_state deleted.
Joe

Comment by Sergey Vojtovich [ 2013-10-07 ]

Hi Joe,

we're aiming to provide sane performance in all cases. According to the numbers your patch wins.

On my 64bit host structure sizes are not affected by removal of rng_state (due to 8byte alignment I guess):
sizeof(mysql_mutex_t)= 56, sizeof(my_pthread_fastmutex_t)= 48, sizeof(pthread_mutex_t)= 40

Did structure size change on your 4-node server?

Previously maximum possible increment for maxdelay was MY_PTHREAD_FASTMUTEX_DELAY (which is 4). I used this maximum value. Probably we should use lower number and increment maxdelay by 1 instead?

Also I had a look at pthread_mutex_lock() source and found out that it spins if mutex type is PTHREAD_MUTEX_ADAPTIVE_NP (which is the case with fast mutex). In other words we have double spinning.

Comment by Joe Mario [ 2013-10-07 ]

Hi Sergey:
Let me redo my testing that I did yesterday. I want to double check my steps as I may have made a mistake. Give me a day or two to post back with any updated results.
Joe

Comment by Joe Mario [ 2013-10-10 ]

Hi Sergey:
It's going to be a while before I get back to this. I did a couple of runs - enough to know that your change is within the noise level. But I didn't test it as thoroughly as I usually do, due to being pulled away with other interrupts. So go ahead and use your enhanced version.

In my last post, when I said I made a mistake, I realized I grabbed the last rpmbuild in the shell history, and it wasn't using the spec file I thought it used. Sorry about the confusion.

And I agree with you about the struct size after rng_state is deleted.

If I get a chance to get back and do more testing with this, I will.
Joe

Comment by Sergey Vojtovich [ 2013-10-11 ]

Hi Joe,

thanks a lot for testing it. We did some tests on our side (results attached), but they don't look as good as yours. I'm currently working towards to understand the difference. Could you share you test system details and how did you build MariaDB?

Thanks,
Sergey

Comment by Joe Mario [ 2013-10-11 ]

Hi Sergey:
See my reply in the attached file (which seems to preserve cleaner formatting). Let me know if I didn't answer your questions.

Joe

Comment by Joe Mario [ 2013-10-14 ]

Hi Sergey:
I did another rerun over the weekend.

I first ran my test script (the one I previously posted) three times against the original v5.5.31 unmodified MariaDB. Then I ran it another three times against the version with my proposed changes.

While I wasn't able to reproduce my original speedup numbers, there is a positive effect as the thread count and contention increases. See the attached file for the results.

I also did two other runs (which I didn't attach).
First I compared your suggested changes against the changes I made. The difference was in the noise.

Second, I added padding to the fast_mutex struct, to make sure nothing else was adding additional contention to the mutex cacheline:
typedef struct st_my_pthread_fastmutex_t

{ pthread_mutex_t mutex; uint spins; uint rng_state; + // Pad out to a cacheline + uint pad[4]; }

my_pthread_fastmutex_t;

The results were inconclusive. It hurt 0 to 2% at 12 threads, but helped up to 10% at 20 threads, and then it was noisy at higher thread counts.
I mention these results just as an fyi.

Hopefully my input in this whole thread will help speedup MariaDB in some way.
Joe

Comment by Sergey Vojtovich [ 2013-10-14 ]

Hi Joe,

your input is definitely valuable. We didn't do much wrt scalability on NUMA yet, but there seem to be some low-laying fruits hanging around.
I plan to do some additional tests on our side this week and will hopefully try perf tool that you were suggesting.

Thanks,
Sergey

Comment by Sergey Vojtovich [ 2013-10-21 ]

Last week I did some tests on 4 CPU (64 cores) Sandy Bridge host. Unfortunately I wasn't able to reproduce performance improvement. Probably it is due to hardware difference.

Comment by Sergey Vojtovich [ 2014-02-28 ]

I was able to reproduce reported problem. Fast mutexes shown worst throughput compared to other mutex types. Benchmark results are available here:
http://svoj-db.blogspot.ru/2014/02/mariadb-mutexes-scalability.html

Looks like this problem was already raised a few times in MySQL circles:
http://bugs.mysql.com/bug.php?id=58766
http://bugs.mysql.com/bug.php?id=38941
http://dev.mysql.com/worklog/task/?id=4601

Said the above fast mutexes will unlikely scale better than normal mutexes ever. We agreed to disable fast mutexes in our release build configuration.

Comment by Sergey Vojtovich [ 2014-02-28 ]

Sergei, please review fix for this bug.

Comment by Joe Mario [ 2014-02-28 ]

Hi Sergey and Sergei:
The patches to add the cacheline tugging detection to the perf tool (perf c2c) were recently submitted upstream. See http://lwn.net/Articles/585195/.
They are still in review with some cleanup to be added, but it's moving forward.

If I get a chance, I'll take the version of MariaDB that's part of RHEL, run "perf c2c" on it during a sysbench run, and will post the output here so you can see what the tool is showing.

Joe

Comment by Sergei Golubchik [ 2014-05-05 ]

ok to push

Comment by Sergey Vojtovich [ 2014-05-06 ]

Fixed in 5.5.38:

revno: 4174
revision-id: svoj@mariadb.org-20140228114602-nyj6i2fejiywnhbx
parent: monty@mariadb.org-20140503161217-ac6ec1uoq5sdg40o
committer: Sergey Vojtovich <svoj@mariadb.org>
branch nick: 5.5-mdev5081
timestamp: Fri 2014-02-28 15:46:02 +0400
message:
  MDEV-5081 - Simple performance improvement for MariaDB
 
  Currently fast mutexes have lower throuput compared to normal mutexes.
  Remove them from release build configuration.

Joe, thanks for the c2c tool link. I will try to make use of it in further benchmark.

Generated at Thu Feb 08 07:01:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.