Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14482

Cache line contention on ut_rnd_ulint_counter()

Details

    Description

      Hi Sergey,

      I see cache coherency and interconnect traffic issue with respect to the random function generator ut_rnd_ulint_counter() while doing a backoff in spin lock.

      I was looking at MySQL repository and see that there is a patch

      https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1
      given to resolve this issue. I see a good amount of benefit for Arm platform around 5-10% and for Intel 3-5% for sysbench oltp test write . In general this looks to be a good idea .

      I also see MYSQL backed off it in 5.20 because of performance reason but the changes brought some scalability issue with the patch dedc8b3d567fbb92ce912f1559fe6a08b2857045

      Again on Version 8.0 this has been committed again with the patch https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1with some new changes.

      However I am trying to port this to Mariadb but there a lot of code difference so no way I can port directly and provide a patch to test. I see files my_thread_local.h being not available and if I try to create there are some merge clashes with my_pthread.h .

      Please let me know your opinion and if required I can share a patch soon with the above issues resolved.

      Attachments

        Issue Links

          Activity

            Ok, so we need to be careful with this. I'll do more research on this.

            svoj Sergey Vojtovich added a comment - Ok, so we need to be careful with this. I'll do more research on this.
            svoj Sergey Vojtovich added a comment - marko , please review fix for this bug: https://github.com/MariaDB/server/commit/62eeb1c760196207ac57338237a0d58df925847f

            Looks OK to me.
            Side note: After this fix, ut_rnd_interval() is really only used in one place (if we ignore the uncovered fault injection fil_tablespace_iterate_failure). It could make sense to move the function to the only file where it is being used.

            marko Marko Mäkelä added a comment - Looks OK to me. Side note: After this fix, ut_rnd_interval() is really only used in one place (if we ignore the uncovered fault injection fil_tablespace_iterate_failure ). It could make sense to move the function to the only file where it is being used.

            Sergey,

            In case if the atomics in arm/intel gets optimized or for the matter gets more degraded I feel this can still need to be adjusted. I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed. Because "The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++." is clearly not relevant any more

            ssethia Sandeep sethia added a comment - Sergey, In case if the atomics in arm/intel gets optimized or for the matter gets more degraded I feel this can still need to be adjusted. I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed. Because "The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++." is clearly not relevant any more

            ssethia, the argument is not relevant indeed. But I'm not completely sure what you're suggesting.

            Scalability is at least 3-dimensional thing: hardware, number of threads and workload. Usually when we improve some benchmark results that works on some crossing of these three, we inevitably make another crossing performance worse. The only way that gives non-negative result in all crossings I'm aware of is removing cache line contention (mutex, rwlock, false sharing, variables) from hot path.

            In the scope of this task we removed RNG variable contention, which goes perfectly inline with what I say above.

            If you have any specific suggestion how exactly we should tune mutex delay, I'm all for creating another JIRA task and discussing it separately.

            svoj Sergey Vojtovich added a comment - ssethia , the argument is not relevant indeed. But I'm not completely sure what you're suggesting. Scalability is at least 3-dimensional thing: hardware, number of threads and workload. Usually when we improve some benchmark results that works on some crossing of these three, we inevitably make another crossing performance worse. The only way that gives non-negative result in all crossings I'm aware of is removing cache line contention (mutex, rwlock, false sharing, variables) from hot path. In the scope of this task we removed RNG variable contention, which goes perfectly inline with what I say above. If you have any specific suggestion how exactly we should tune mutex delay, I'm all for creating another JIRA task and discussing it separately.

            People

              svoj Sergey Vojtovich
              ssethia Sandeep sethia
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.