Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5081

Simple performance improvement for MariaDB

Details

    • Task
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • 5.5.38, 10.0.11
    • None
    • None

    Description

      MariaDB developers:
      Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system.
      It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise).

      The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct.

      typedef struct st_my_pthread_fastmutex_t

      { pthread_mutex_t mutex; uint spins; uint rng_state; }

      my_pthread_fastmutex_t;

      As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes.

      I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes).

      My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm.

      The rest of this msg shows the improvement in sysbench transaction values for different thread counts.

      Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply.

      Joe Mario

      1. sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run

      5.5.31-MariaDB 5.5.31-MariaDB-Modified
      -------------- -----------------------
      Thread cnt:12
      transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup.
      transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup.
      transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup.

      Thread cnt:20
      transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup.
      transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup.
      transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup.

      Thread cnt:30
      transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup.
      transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup.

      Thread cnt:40
      transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup.
      transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup.

      Thread cnt:50
      transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup.
      transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup.

      Thread cnt:60
      transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup.
      transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup.

      I've attached the patch, since it got mangled when I tried to insert it here.

      Joe

      Attachments

        1. maria_perf.patch
          2 kB
        2. mdev5081.patch
          2 kB
        3. mdev5081.pdf
          16 kB
        4. reply_to_sergey.txt
          5 kB
        5. base_vs_joechanges.txt
          4 kB

        Activity

          JoeMario Joe Mario created issue -
          JoeMario Joe Mario made changes -
          Field Original Value New Value
          Attachment maria_perf.patch [ 23800 ]
          Description MariaDB developers:
          Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system.
          It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise).

          The patch is appended at the end of this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct.

             typedef struct st_my_pthread_fastmutex_t
             {
               pthread_mutex_t mutex;
               uint spins;
               uint rng_state;
             } my_pthread_fastmutex_t;

          As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes.

          I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes).

          My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm.

          The rest of this msg shows the improvement in sysbench transaction values for different thread counts.

          Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply.

          Joe Mario

           
          # sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run

                                5.5.31-MariaDB 5.5.31-MariaDB-Modified
                                -------------- -----------------------
          Thread cnt:12
           transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup.
           transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup.
           transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup.

          Thread cnt:20
           transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup.
           transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup.
           transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup.

          Thread cnt:30
           transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup.
           transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup.

          Thread cnt:40
           transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup.
           transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup.

          Thread cnt:50
           transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup.
           transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup.

          Thread cnt:60
           transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup.
           transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup.


          diff -up BUILD/mariadb-5.5.31.bk/mysys/thr_mutex.c BUILD/mariadb-5.5.31/mysys/thr_mutex.c
          --- BUILD/mariadb-5.5.31.bk/mysys/thr_mutex.c 2013-05-21 18:09:52.000000000 -0400
          +++ BUILD/mariadb-5.5.31/mysys/thr_mutex.c 2013-09-25 15:59:44.554171774 -0400
          @@ -886,10 +886,10 @@ int my_pthread_fastmutex_init(my_pthread
             Commun. ACM, October 1988, Volume 31, No 10, pages 1192-1201.
           */
           
          -static double park_rng(my_pthread_fastmutex_t *mp)
          +static double park_rng(uint *rng_state)
           {
          - mp->rng_state= ((my_ulonglong)mp->rng_state * 279470273U) % 4294967291U;
          - return (mp->rng_state / 2147483647.0);
          + *rng_state= ((my_ulonglong)*rng_state * 279470273U) % 4294967291U;
          + return (*rng_state / 2147483647.0);
           }
           
           int my_pthread_fastmutex_lock(my_pthread_fastmutex_t *mp)
          @@ -897,20 +897,27 @@ int my_pthread_fastmutex_lock(my_pthread
             int res;
             uint i;
             uint maxdelay= MY_PTHREAD_FASTMUTEX_DELAY;
          + const uint spin_cnt = mp->spins;
          + uint rng_state = mp->rng_state;
           
          - for (i= 0; i < mp->spins; i++)
          + for (i= 0; i < spin_cnt; i++)
             {
               res= pthread_mutex_trylock(&mp->mutex);
           
          - if (res == 0)
          + if (res == 0) {
          + mp->rng_state = rng_state;
                 return 0;
          -
          - if (res != EBUSY)
          + }
          + if (res != EBUSY) {
          + mp->rng_state = rng_state;
                 return res;
          + }
           
               mutex_delay(maxdelay);
          - maxdelay += park_rng(mp) * MY_PTHREAD_FASTMUTEX_DELAY + 1;
          + maxdelay += park_rng(&rng_state) * MY_PTHREAD_FASTMUTEX_DELAY + 1;
             }
          +
          + mp->rng_state = rng_state;
             return pthread_mutex_lock(&mp->mutex);
           }
          MariaDB developers:
          Here's a simple performance improvement I found in MariaDB (v5.5.31) while analyzing sysbench on my 4-node system.
          It improves the sysbench oltp test by 3% to 17%, depending on the number of threads specified (and I'm sure there's some noise).

          The patch is attached to this message. It reduces the memory accesses to the "spins" and "rng_state" fields of the my_pthread_fast_mutex_t struct.

             typedef struct st_my_pthread_fastmutex_t
             {
               pthread_mutex_t mutex;
               uint spins;
               uint rng_state;
             } my_pthread_fastmutex_t;

          As I'm sure you know, the mutex in that struct is very hot. Since it's accessed by cpus on all nodes, a lot of time is wasted tugging the cacheline back-n-forth between numa nodes.

          I noticed the code is repeatedly accessing the "spins" and "rng_state" fields when looping trying to get the mutex. Since those fields reside in the same cacheline as the mutex, and since their accesses come from all cpus on all numa nodes, they were contributing to making the mutex slower (because they increased the cache-to-cache contention between nodes).

          My change is simply to keep the values for "spins" and "rng_state" in local variables (a register) as long as possible and only update their values in memory when necessary. I didn't change anything in the algorithm.

          The rest of this msg shows the improvement in sysbench transaction values for different thread counts.

          Let me know if you have any questions. Since I'm not on the mailing list, please cc me on any reply.

          Joe Mario

           
          # sysbench --test=oltp --num-threads=12 --max-requests=1000000 --max-time=100 run

                                5.5.31-MariaDB 5.5.31-MariaDB-Modified
                                -------------- -----------------------
          Thread cnt:12
           transactions: 572694 (5726.83 per sec.) 589543 (5895.34 per sec.) 2.94% speedup.
           transactions: 564215 (5642.05 per sec.) 582254 (5822.43 per sec.) 3.20% speedup.
           transactions: 565231 (5652.21 per sec.) 583228 (5832.19 per sec.) 3.18% speedup.

          Thread cnt:20
           transactions: 507300 (5072.82 per sec.) 580229 (5802.09 per sec.) 14.38% speedup.
           transactions: 509373 (5093.60 per sec.) 585629 (5856.09 per sec.) 14.97% speedup.
           transactions: 497711 (4976.89 per sec.) 583506 (5834.94 per sec.) 17.24% speedup.

          Thread cnt:30
           transactions: 369979 (3699.66 per sec.) 410698 (4106.74 per sec.) 11.01% speedup.
           transactions: 372194 (3721.70 per sec.) 412884 (4128.65 per sec.) 10.93% speedup.

          Thread cnt:40
           transactions: 366285 (3662.60 per sec.) 401050 (4010.23 per sec.) 9.49% speedup.
           transactions: 369626 (3696.02 per sec.) 401913 (4018.88 per sec.) 8.74% speedup.

          Thread cnt:50
           transactions: 357529 (3574.99 per sec.) 389759 (3897.25 per sec.) 9.01% speedup.
           transactions: 357116 (3570.83 per sec.) 387115 (3870.80 per sec.) 8.40% speedup.

          Thread cnt:60
           transactions: 335427 (3353.88 per sec.) 375134 (3750.91 per sec.) 11.84% speedup.
           transactions: 334128 (3340.90 per sec.) 359116 (3590.78 per sec.) 7.48% speedup.


          I've attached the patch, since it got mangled when I tried to insert it here.

          Joe
          JoeMario Joe Mario added a comment -

          MariaDB developers:
          This is a simple change that got up to a 17% speedup with sysbench on my 4 node system.
          How do I find out if someone will pick this up to run with it?

          Thanks,
          Joe Mario

          JoeMario Joe Mario added a comment - MariaDB developers: This is a simple change that got up to a 17% speedup with sysbench on my 4 node system. How do I find out if someone will pick this up to run with it? Thanks, Joe Mario

          Hello Joe,
          I am just a passer-by, but I suspect that some subscribers to the maria-developers mailing list:
          https://lists.launchpad.net/maria-developers
          would be interested in the change. Maybe try posting there "Review MDEV-5081 my_pthread_fastmutex_lock patch, sysbench oltp gains for many threads on NUMA".
          Thanks and good luck.

          thatsafunnyname Peter (Stig) Edwards added a comment - Hello Joe, I am just a passer-by, but I suspect that some subscribers to the maria-developers mailing list: https://lists.launchpad.net/maria-developers would be interested in the change. Maybe try posting there "Review MDEV-5081 my_pthread_fastmutex_lock patch, sysbench oltp gains for many threads on NUMA". Thanks and good luck.
          svoj Sergey Vojtovich made changes -
          Assignee Sergey Vojtovich [ svoj ]

          I was wondering what was being used to analyze sysbench, and I found these resources helpful:
          http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/
          http://developerblog.redhat.com/2013/05/31/dive-deeper-in-numa-systems/
          Thank you Joe (and Don).

          thatsafunnyname Peter (Stig) Edwards added a comment - I was wondering what was being used to analyze sysbench, and I found these resources helpful: http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/ http://developerblog.redhat.com/2013/05/31/dive-deeper-in-numa-systems/ Thank you Joe (and Don).
          svoj Sergey Vojtovich made changes -
          Attachment mdev5081.patch [ 23803 ]

          Joe, thanks for your contribution! I reviewed your analysis and your patch and came up with one small extension: I believe there is no much sense in randomizing timeout value, so I totally removed rng_state.

          Spin locks on NUMA might be not that good idea. I'd like to do some extra benchmarking with fast mutexes disabled.

          svoj Sergey Vojtovich added a comment - Joe, thanks for your contribution! I reviewed your analysis and your patch and came up with one small extension: I believe there is no much sense in randomizing timeout value, so I totally removed rng_state. Spin locks on NUMA might be not that good idea. I'd like to do some extra benchmarking with fast mutexes disabled.
          JoeMario Joe Mario added a comment -

          Hi Peter and Sergey:
          Glad to be of help.

          Sergey:
          Do you want me to try your patch on sysbench on our 4-node server? Or did you already try it?

          Joe

          JoeMario Joe Mario added a comment - Hi Peter and Sergey: Glad to be of help. Sergey: Do you want me to try your patch on sysbench on our 4-node server? Or did you already try it? Joe

          Hi Joe,

          we will try it next week. If you have resources and want to test my patch - feel free to do so. Your feedback will be valuable.

          svoj Sergey Vojtovich added a comment - Hi Joe, we will try it next week. If you have resources and want to test my patch - feel free to do so. Your feedback will be valuable.
          JoeMario Joe Mario added a comment -

          Hi Sergey:
          I did a build with your patch and compared it with my patch. This was on our 4-node server, with the database located on an SSD drive. I used the same sysbench command that I posted in my opening post.

          Here's what I found:
          12 Threads: Avg: 3.7% speedup over 3 runs
          20 Threads: Avg: 15.5% slowdown over 3 runs
          30 Threads: Avg: 15.5% slowdown over 2 runs
          40 Threads: Avg: 10.6% slowdown over 2 runs
          50 Threads: Avg: 4.3% slowdown over 1 run
          60 Threads: Avg: 1.2% slowdown over 1 run

          So something about that random backoff does help at the higher threads counts.
          However, here's some more info. I see your deleted the "now unused" rng_state field from the struct. In the many iterations I did leading up to my patch, I noticed when I tried to change the size of the struct (either by adding padding or moving things around), I too got a speedup at the lower thread counts and slowdowns at the higher thread counts.

          I'm not a regular database user, and I don't know if 12 threads is the sweet spot you're aiming to speedup. I suspect scailability at higher thread counts is important to you - but please confirm.

          If you want, I can try again putting that hot mutex in its own aligned cacheline and pad it out so that nothing else conflicts with it. I can try it both with and without the spins field in it. I had mixed results earlier, but never with the rng_state deleted.
          Joe

          JoeMario Joe Mario added a comment - Hi Sergey: I did a build with your patch and compared it with my patch. This was on our 4-node server, with the database located on an SSD drive. I used the same sysbench command that I posted in my opening post. Here's what I found: 12 Threads: Avg: 3.7% speedup over 3 runs 20 Threads: Avg: 15.5% slowdown over 3 runs 30 Threads: Avg: 15.5% slowdown over 2 runs 40 Threads: Avg: 10.6% slowdown over 2 runs 50 Threads: Avg: 4.3% slowdown over 1 run 60 Threads: Avg: 1.2% slowdown over 1 run So something about that random backoff does help at the higher threads counts. However, here's some more info. I see your deleted the "now unused" rng_state field from the struct. In the many iterations I did leading up to my patch, I noticed when I tried to change the size of the struct (either by adding padding or moving things around), I too got a speedup at the lower thread counts and slowdowns at the higher thread counts. I'm not a regular database user, and I don't know if 12 threads is the sweet spot you're aiming to speedup. I suspect scailability at higher thread counts is important to you - but please confirm. If you want, I can try again putting that hot mutex in its own aligned cacheline and pad it out so that nothing else conflicts with it. I can try it both with and without the spins field in it. I had mixed results earlier, but never with the rng_state deleted. Joe

          Hi Joe,

          we're aiming to provide sane performance in all cases. According to the numbers your patch wins.

          On my 64bit host structure sizes are not affected by removal of rng_state (due to 8byte alignment I guess):
          sizeof(mysql_mutex_t)= 56, sizeof(my_pthread_fastmutex_t)= 48, sizeof(pthread_mutex_t)= 40

          Did structure size change on your 4-node server?

          Previously maximum possible increment for maxdelay was MY_PTHREAD_FASTMUTEX_DELAY (which is 4). I used this maximum value. Probably we should use lower number and increment maxdelay by 1 instead?

          Also I had a look at pthread_mutex_lock() source and found out that it spins if mutex type is PTHREAD_MUTEX_ADAPTIVE_NP (which is the case with fast mutex). In other words we have double spinning.

          svoj Sergey Vojtovich added a comment - Hi Joe, we're aiming to provide sane performance in all cases. According to the numbers your patch wins. On my 64bit host structure sizes are not affected by removal of rng_state (due to 8byte alignment I guess): sizeof(mysql_mutex_t)= 56, sizeof(my_pthread_fastmutex_t)= 48, sizeof(pthread_mutex_t)= 40 Did structure size change on your 4-node server? Previously maximum possible increment for maxdelay was MY_PTHREAD_FASTMUTEX_DELAY (which is 4). I used this maximum value. Probably we should use lower number and increment maxdelay by 1 instead? Also I had a look at pthread_mutex_lock() source and found out that it spins if mutex type is PTHREAD_MUTEX_ADAPTIVE_NP (which is the case with fast mutex). In other words we have double spinning.
          JoeMario Joe Mario added a comment -

          Hi Sergey:
          Let me redo my testing that I did yesterday. I want to double check my steps as I may have made a mistake. Give me a day or two to post back with any updated results.
          Joe

          JoeMario Joe Mario added a comment - Hi Sergey: Let me redo my testing that I did yesterday. I want to double check my steps as I may have made a mistake. Give me a day or two to post back with any updated results. Joe
          JoeMario Joe Mario added a comment -

          Hi Sergey:
          It's going to be a while before I get back to this. I did a couple of runs - enough to know that your change is within the noise level. But I didn't test it as thoroughly as I usually do, due to being pulled away with other interrupts. So go ahead and use your enhanced version.

          In my last post, when I said I made a mistake, I realized I grabbed the last rpmbuild in the shell history, and it wasn't using the spec file I thought it used. Sorry about the confusion.

          And I agree with you about the struct size after rng_state is deleted.

          If I get a chance to get back and do more testing with this, I will.
          Joe

          JoeMario Joe Mario added a comment - Hi Sergey: It's going to be a while before I get back to this. I did a couple of runs - enough to know that your change is within the noise level. But I didn't test it as thoroughly as I usually do, due to being pulled away with other interrupts. So go ahead and use your enhanced version. In my last post, when I said I made a mistake, I realized I grabbed the last rpmbuild in the shell history, and it wasn't using the spec file I thought it used. Sorry about the confusion. And I agree with you about the struct size after rng_state is deleted. If I get a chance to get back and do more testing with this, I will. Joe
          svoj Sergey Vojtovich made changes -
          Attachment mdev5081.pdf [ 23810 ]

          Hi Joe,

          thanks a lot for testing it. We did some tests on our side (results attached), but they don't look as good as yours. I'm currently working towards to understand the difference. Could you share you test system details and how did you build MariaDB?

          Thanks,
          Sergey

          svoj Sergey Vojtovich added a comment - Hi Joe, thanks a lot for testing it. We did some tests on our side (results attached), but they don't look as good as yours. I'm currently working towards to understand the difference. Could you share you test system details and how did you build MariaDB? Thanks, Sergey
          JoeMario Joe Mario added a comment -

          Hi Sergey:
          See my reply in the attached file (which seems to preserve cleaner formatting). Let me know if I didn't answer your questions.

          Joe

          JoeMario Joe Mario added a comment - Hi Sergey: See my reply in the attached file (which seems to preserve cleaner formatting). Let me know if I didn't answer your questions. Joe
          JoeMario Joe Mario made changes -
          Attachment reply_to_sergey.txt [ 23811 ]
          JoeMario Joe Mario made changes -
          Comment [ Hi Sergey:
           See the attached file for my answers to your questions. Attaching it seems to preserve cleaner formatting than inline text.
           Let me know if you still have questions.
          Joe ]
          JoeMario Joe Mario added a comment -

          Hi Sergey:
          I did another rerun over the weekend.

          I first ran my test script (the one I previously posted) three times against the original v5.5.31 unmodified MariaDB. Then I ran it another three times against the version with my proposed changes.

          While I wasn't able to reproduce my original speedup numbers, there is a positive effect as the thread count and contention increases. See the attached file for the results.

          I also did two other runs (which I didn't attach).
          First I compared your suggested changes against the changes I made. The difference was in the noise.

          Second, I added padding to the fast_mutex struct, to make sure nothing else was adding additional contention to the mutex cacheline:
          typedef struct st_my_pthread_fastmutex_t

          { pthread_mutex_t mutex; uint spins; uint rng_state; + // Pad out to a cacheline + uint pad[4]; }

          my_pthread_fastmutex_t;

          The results were inconclusive. It hurt 0 to 2% at 12 threads, but helped up to 10% at 20 threads, and then it was noisy at higher thread counts.
          I mention these results just as an fyi.

          Hopefully my input in this whole thread will help speedup MariaDB in some way.
          Joe

          JoeMario Joe Mario added a comment - Hi Sergey: I did another rerun over the weekend. I first ran my test script (the one I previously posted) three times against the original v5.5.31 unmodified MariaDB. Then I ran it another three times against the version with my proposed changes. While I wasn't able to reproduce my original speedup numbers, there is a positive effect as the thread count and contention increases. See the attached file for the results. I also did two other runs (which I didn't attach). First I compared your suggested changes against the changes I made. The difference was in the noise. Second, I added padding to the fast_mutex struct, to make sure nothing else was adding additional contention to the mutex cacheline: typedef struct st_my_pthread_fastmutex_t { pthread_mutex_t mutex; uint spins; uint rng_state; + // Pad out to a cacheline + uint pad[4]; } my_pthread_fastmutex_t; The results were inconclusive. It hurt 0 to 2% at 12 threads, but helped up to 10% at 20 threads, and then it was noisy at higher thread counts. I mention these results just as an fyi. Hopefully my input in this whole thread will help speedup MariaDB in some way. Joe
          JoeMario Joe Mario made changes -
          Attachment base_vs_joechanges.txt [ 23901 ]

          Hi Joe,

          your input is definitely valuable. We didn't do much wrt scalability on NUMA yet, but there seem to be some low-laying fruits hanging around.
          I plan to do some additional tests on our side this week and will hopefully try perf tool that you were suggesting.

          Thanks,
          Sergey

          svoj Sergey Vojtovich added a comment - Hi Joe, your input is definitely valuable. We didn't do much wrt scalability on NUMA yet, but there seem to be some low-laying fruits hanging around. I plan to do some additional tests on our side this week and will hopefully try perf tool that you were suggesting. Thanks, Sergey
          serg Sergei Golubchik made changes -
          Labels MariaDB_5.5

          Last week I did some tests on 4 CPU (64 cores) Sandy Bridge host. Unfortunately I wasn't able to reproduce performance improvement. Probably it is due to hardware difference.

          svoj Sergey Vojtovich added a comment - Last week I did some tests on 4 CPU (64 cores) Sandy Bridge host. Unfortunately I wasn't able to reproduce performance improvement. Probably it is due to hardware difference.
          serg Sergei Golubchik made changes -
          Fix Version/s 10.0.9 [ 14400 ]
          serg Sergei Golubchik made changes -
          Priority Trivial [ 5 ] Major [ 3 ]
          serg Sergei Golubchik made changes -
          Priority Major [ 3 ] Minor [ 4 ]

          I was able to reproduce reported problem. Fast mutexes shown worst throughput compared to other mutex types. Benchmark results are available here:
          http://svoj-db.blogspot.ru/2014/02/mariadb-mutexes-scalability.html

          Looks like this problem was already raised a few times in MySQL circles:
          http://bugs.mysql.com/bug.php?id=58766
          http://bugs.mysql.com/bug.php?id=38941
          http://dev.mysql.com/worklog/task/?id=4601

          Said the above fast mutexes will unlikely scale better than normal mutexes ever. We agreed to disable fast mutexes in our release build configuration.

          svoj Sergey Vojtovich added a comment - I was able to reproduce reported problem. Fast mutexes shown worst throughput compared to other mutex types. Benchmark results are available here: http://svoj-db.blogspot.ru/2014/02/mariadb-mutexes-scalability.html Looks like this problem was already raised a few times in MySQL circles: http://bugs.mysql.com/bug.php?id=58766 http://bugs.mysql.com/bug.php?id=38941 http://dev.mysql.com/worklog/task/?id=4601 Said the above fast mutexes will unlikely scale better than normal mutexes ever. We agreed to disable fast mutexes in our release build configuration.
          svoj Sergey Vojtovich made changes -
          Assignee Sergey Vojtovich [ svoj ] Sergei Golubchik [ serg ]

          Sergei, please review fix for this bug.

          svoj Sergey Vojtovich added a comment - Sergei, please review fix for this bug.
          JoeMario Joe Mario added a comment -

          Hi Sergey and Sergei:
          The patches to add the cacheline tugging detection to the perf tool (perf c2c) were recently submitted upstream. See http://lwn.net/Articles/585195/.
          They are still in review with some cleanup to be added, but it's moving forward.

          If I get a chance, I'll take the version of MariaDB that's part of RHEL, run "perf c2c" on it during a sysbench run, and will post the output here so you can see what the tool is showing.

          Joe

          JoeMario Joe Mario added a comment - Hi Sergey and Sergei: The patches to add the cacheline tugging detection to the perf tool (perf c2c) were recently submitted upstream. See http://lwn.net/Articles/585195/ . They are still in review with some cleanup to be added, but it's moving forward. If I get a chance, I'll take the version of MariaDB that's part of RHEL, run "perf c2c" on it during a sysbench run, and will post the output here so you can see what the tool is showing. Joe
          serg Sergei Golubchik made changes -
          Fix Version/s 10.0.10 [ 14500 ]
          Fix Version/s 10.0.9 [ 14400 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 10.0.11 [ 15200 ]
          Fix Version/s 10.0.10 [ 14500 ]

          ok to push

          serg Sergei Golubchik added a comment - ok to push
          serg Sergei Golubchik made changes -
          Assignee Sergei Golubchik [ serg ] Sergey Vojtovich [ svoj ]

          Fixed in 5.5.38:

          revno: 4174
          revision-id: svoj@mariadb.org-20140228114602-nyj6i2fejiywnhbx
          parent: monty@mariadb.org-20140503161217-ac6ec1uoq5sdg40o
          committer: Sergey Vojtovich <svoj@mariadb.org>
          branch nick: 5.5-mdev5081
          timestamp: Fri 2014-02-28 15:46:02 +0400
          message:
            MDEV-5081 - Simple performance improvement for MariaDB
           
            Currently fast mutexes have lower throuput compared to normal mutexes.
            Remove them from release build configuration.

          Joe, thanks for the c2c tool link. I will try to make use of it in further benchmark.

          svoj Sergey Vojtovich added a comment - Fixed in 5.5.38: revno: 4174 revision-id: svoj@mariadb.org-20140228114602-nyj6i2fejiywnhbx parent: monty@mariadb.org-20140503161217-ac6ec1uoq5sdg40o committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 5.5-mdev5081 timestamp: Fri 2014-02-28 15:46:02 +0400 message: MDEV-5081 - Simple performance improvement for MariaDB   Currently fast mutexes have lower throuput compared to normal mutexes. Remove them from release build configuration. Joe, thanks for the c2c tool link. I will try to make use of it in further benchmark.
          svoj Sergey Vojtovich made changes -
          Fix Version/s 5.5.38 [ 15400 ]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Closed [ 6 ]
          serg Sergei Golubchik made changes -
          Workflow defaullt [ 29201 ] MariaDB v2 [ 43925 ]
          ratzpo Rasmus Johansson (Inactive) made changes -
          Workflow MariaDB v2 [ 43925 ] MariaDB v3 [ 64202 ]
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 64202 ] MariaDB v4 [ 132199 ]

          People

            svoj Sergey Vojtovich
            JoeMario Joe Mario
            Votes:
            2 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.