[MDEV-5081] Simple performance improvement for MariaDB Created: 2013-09-30 Updated: 2014-05-06 Resolved: 2014-05-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Fix Version/s: | 5.5.38, 10.0.11 |
| Type: | Task | Priority: | Minor |
| Reporter: | Joe Mario | Assignee: | Sergey Vojtovich |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
MariaDB developers: The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct. typedef struct st_my_pthread_fastmutex_t my_pthread_fastmutex_t; As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes. I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes). My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm. The rest of this msg shows the improvement in sysbench transaction values for different thread counts. Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply. Joe Mario
5.5.31-MariaDB 5.5.31-MariaDB-Modified Thread cnt:20 Thread cnt:30 Thread cnt:40 Thread cnt:50 Thread cnt:60 I've attached the patch, since it got mangled when I tried to insert it here. Joe |
| Comments |
| Comment by Joe Mario [ 2013-10-03 ] | |||||||||||
|
MariaDB developers: Thanks, | |||||||||||
| Comment by Peter (Stig) Edwards [ 2013-10-04 ] | |||||||||||
|
Hello Joe, | |||||||||||
| Comment by Peter (Stig) Edwards [ 2013-10-05 ] | |||||||||||
|
I was wondering what was being used to analyze sysbench, and I found these resources helpful: | |||||||||||
| Comment by Sergey Vojtovich [ 2013-10-05 ] | |||||||||||
|
Joe, thanks for your contribution! I reviewed your analysis and your patch and came up with one small extension: I believe there is no much sense in randomizing timeout value, so I totally removed rng_state. Spin locks on NUMA might be not that good idea. I'd like to do some extra benchmarking with fast mutexes disabled. | |||||||||||
| Comment by Joe Mario [ 2013-10-06 ] | |||||||||||
|
Hi Peter and Sergey: Sergey: Joe | |||||||||||
| Comment by Sergey Vojtovich [ 2013-10-06 ] | |||||||||||
|
Hi Joe, we will try it next week. If you have resources and want to test my patch - feel free to do so. Your feedback will be valuable. | |||||||||||
| Comment by Joe Mario [ 2013-10-07 ] | |||||||||||
|
Hi Sergey: Here's what I found: So something about that random backoff does help at the higher threads counts. I'm not a regular database user, and I don't know if 12 threads is the sweet spot you're aiming to speedup. I suspect scailability at higher thread counts is important to you - but please confirm. If you want, I can try again putting that hot mutex in its own aligned cacheline and pad it out so that nothing else conflicts with it. I can try it both with and without the spins field in it. I had mixed results earlier, but never with the rng_state deleted. | |||||||||||
| Comment by Sergey Vojtovich [ 2013-10-07 ] | |||||||||||
|
Hi Joe, we're aiming to provide sane performance in all cases. According to the numbers your patch wins. On my 64bit host structure sizes are not affected by removal of rng_state (due to 8byte alignment I guess): Did structure size change on your 4-node server? Previously maximum possible increment for maxdelay was MY_PTHREAD_FASTMUTEX_DELAY (which is 4). I used this maximum value. Probably we should use lower number and increment maxdelay by 1 instead? Also I had a look at pthread_mutex_lock() source and found out that it spins if mutex type is PTHREAD_MUTEX_ADAPTIVE_NP (which is the case with fast mutex). In other words we have double spinning. | |||||||||||
| Comment by Joe Mario [ 2013-10-07 ] | |||||||||||
|
Hi Sergey: | |||||||||||
| Comment by Joe Mario [ 2013-10-10 ] | |||||||||||
|
Hi Sergey: In my last post, when I said I made a mistake, I realized I grabbed the last rpmbuild in the shell history, and it wasn't using the spec file I thought it used. Sorry about the confusion. And I agree with you about the struct size after rng_state is deleted. If I get a chance to get back and do more testing with this, I will. | |||||||||||
| Comment by Sergey Vojtovich [ 2013-10-11 ] | |||||||||||
|
Hi Joe, thanks a lot for testing it. We did some tests on our side (results attached), but they don't look as good as yours. I'm currently working towards to understand the difference. Could you share you test system details and how did you build MariaDB? Thanks, | |||||||||||
| Comment by Joe Mario [ 2013-10-11 ] | |||||||||||
|
Hi Sergey: Joe | |||||||||||
| Comment by Joe Mario [ 2013-10-14 ] | |||||||||||
|
Hi Sergey: I first ran my test script (the one I previously posted) three times against the original v5.5.31 unmodified MariaDB. Then I ran it another three times against the version with my proposed changes. While I wasn't able to reproduce my original speedup numbers, there is a positive effect as the thread count and contention increases. See the attached file for the results. I also did two other runs (which I didn't attach). Second, I added padding to the fast_mutex struct, to make sure nothing else was adding additional contention to the mutex cacheline: my_pthread_fastmutex_t; The results were inconclusive. It hurt 0 to 2% at 12 threads, but helped up to 10% at 20 threads, and then it was noisy at higher thread counts. Hopefully my input in this whole thread will help speedup MariaDB in some way. | |||||||||||
| Comment by Sergey Vojtovich [ 2013-10-14 ] | |||||||||||
|
Hi Joe, your input is definitely valuable. We didn't do much wrt scalability on NUMA yet, but there seem to be some low-laying fruits hanging around. Thanks, | |||||||||||
| Comment by Sergey Vojtovich [ 2013-10-21 ] | |||||||||||
|
Last week I did some tests on 4 CPU (64 cores) Sandy Bridge host. Unfortunately I wasn't able to reproduce performance improvement. Probably it is due to hardware difference. | |||||||||||
| Comment by Sergey Vojtovich [ 2014-02-28 ] | |||||||||||
|
I was able to reproduce reported problem. Fast mutexes shown worst throughput compared to other mutex types. Benchmark results are available here: Looks like this problem was already raised a few times in MySQL circles: Said the above fast mutexes will unlikely scale better than normal mutexes ever. We agreed to disable fast mutexes in our release build configuration. | |||||||||||
| Comment by Sergey Vojtovich [ 2014-02-28 ] | |||||||||||
|
Sergei, please review fix for this bug. | |||||||||||
| Comment by Joe Mario [ 2014-02-28 ] | |||||||||||
|
Hi Sergey and Sergei: If I get a chance, I'll take the version of MariaDB that's part of RHEL, run "perf c2c" on it during a sysbench run, and will post the output here so you can see what the tool is showing. Joe | |||||||||||
| Comment by Sergei Golubchik [ 2014-05-05 ] | |||||||||||
|
ok to push | |||||||||||
| Comment by Sergey Vojtovich [ 2014-05-06 ] | |||||||||||
|
Fixed in 5.5.38:
Joe, thanks for the c2c tool link. I will try to make use of it in further benchmark. |