Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21452

Use condition variables and normal mutexes instead of InnoDB os_event and mutex

Details

    Description

      investigation suggested by marko on zulip after reading http://smalldatum.blogspot.com/2020/01/it-is-all-about-constant-factors.html

      no patches - built just straight from 10.4.10 release tag. Built with cmake -DMUTEXTYPE=$type -DCMAKE_PREFIX_INSTALL=/scratch/mariadb-10.4.10-$type $HOME/mariadb-10.4.10
      Distro ubuntu-18.04 compiler.

      TPCCRunner test:

      POWER8, altivec supported - 20 core, 8 thread/core

      $ tail  fullrun-master-fstn4-mariadb-10.4.10-futex-28444.txt   fullrun-master-fstn4-mariadb-10.4.10-event-48215.txt  fullrun-master-fstn4-mariadb-10.4.10-sys-60112.txt
      ==> fullrun-master-fstn4-mariadb-10.4.10-futex-28444.txt <==
       
                    timestamp          tpm      avg_rt      max_rt   avg_db_rt   max_db_rt
                      average  2519939.03       50.01         687       50.00         687
       
         All phase Transactions: 100508512
      Warmup phase Transactions: 24910341
         Run phase Transactions: 75598171
       
      Waiting slaves to terminate users.
      All slaves disconnected.
       
      ==> fullrun-master-fstn4-mariadb-10.4.10-event-48215.txt <==
       
                    timestamp          tpm      avg_rt      max_rt   avg_db_rt   max_db_rt
                      average  1944470.28       63.97         782       63.96         782
       
         All phase Transactions: 466885487
      Warmup phase Transactions: 350217270
         Run phase Transactions: 116668217
       
      Waiting slaves to terminate users.
      All slaves disconnected.
       
      ==> fullrun-master-fstn4-mariadb-10.4.10-sys-60112.txt <==
       
                    timestamp          tpm      avg_rt      max_rt   avg_db_rt   max_db_rt
                      average  2412875.70       51.72         846       51.71         846
       
         All phase Transactions: 579124495
      Warmup phase Transactions: 434351953
         Run phase Transactions: 144772542
       
      Waiting slaves to terminate users.
      All slaves disconnected.
      

      Note: while futex was run for much less time - innodb_buffer_pool_dump_pct=100 from the last run and the 30 minutes was receiving consistent throughput.

      Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz - 22 core, 4 thread/core

      ==> fullrun-master-ka4-mariadb-10.4.10-event-rr-50070.txt <==
       
                    timestamp          tpm      avg_rt      max_rt   avg_db_rt   max_db_rt
                      average  3354020.40       34.35         487       34.32         487
       
         All phase Transactions: 132952152
      Warmup phase Transactions: 32331540
         Run phase Transactions: 100620612
       
      Waiting slaves to terminate users.
      All slaves disconnected.
       
      ==> fullrun-master-ka4-mariadb-10.4.10-futex-rr-63543.txt <==
       
                    timestamp          tpm      avg_rt      max_rt   avg_db_rt   max_db_rt
                      average  3362135.83       33.50         604       33.48         604
       
         All phase Transactions: 131218680
      Warmup phase Transactions: 30354605
         Run phase Transactions: 100864075
       
      Waiting slaves to terminate users.
      All slaves disconnected.
       
      ==> fullrun-master-ka4-mariadb-10.4.10-sys-rr-56865.txt <==
       
                    timestamp          tpm      avg_rt      max_rt   avg_db_rt   max_db_rt
                      average  3363324.87       34.13         996       34.11         996
       
         All phase Transactions: 132642637
      Warmup phase Transactions: 31742891
         Run phase Transactions: 100899746
       
      Waiting slaves to terminate users.
      All slaves disconnected.
      

      Attachments

        1. master.properties.60
          0.8 kB
          Daniel Black
        2. MDEV-21452.ods
          80 kB
          Daniel Black
        3. MDEV-21452.ods
          60 kB
          Axel Schwenke
        4. MDEV-21452-nbl.ods
          66 kB
          Axel Schwenke
        5. my.cnf
          2 kB
          Daniel Black
        6. Screenshot from 2020-03-26 12-08-54.png
          88 kB
          Krunal Bauskar

        Issue Links

          Activity

            commit 9159383f32d8350dfa91bb62c825c64b1dc091d1 (HEAD, origin/bb-10.6-MDEV-21452)
            behaved well during RQG testing.
            Bad effects observed are in the in MariaDB versions without MDEV-21452 too.

            mleich Matthias Leich added a comment - commit 9159383f32d8350dfa91bb62c825c64b1dc091d1 (HEAD, origin/bb-10.6- MDEV-21452 ) behaved well during RQG testing. Bad effects observed are in the in MariaDB versions without MDEV-21452 too.

            I implemented special enforcement of innodb_fatal_semaphore_wait_threshold for dict_sys.mutex and lock_sys.mutex. Due to an observed performance regression at high concurrency, I removed the lock_sys.mutex instrumentation and retained only the one on dict_sys.mutex. If pthread_mutex_trylock() fails, then the current thread would compare-and-swap 0 with its current time before waiting in pthread_mutex_lock(). Either the srv_monitor_task() or a subsequent thread that attempts to acquire dict_sys.mutex would then enforce the innodb_fatal_semaphore_wait_threshold and kill the process if necessary.

            While rewriting the test sys_vars.innodb_fatal_semaphore_wait_threshold accordingly, I noticed that not all hangs would be caught even in the data dictionary cache. For example, if a DDL operation hung while holding both dict_sys.latch and dict_sys.mutex, a subsequent DDL operation would hang while waiting for dict_sys.latch, before even starting the wait for dict_sys.mutex. But, DML threads that are trying to open a table would acquire dict_sys.mutex and be subject to the watchdog. Hopefully this type of watchdog testing will be adequate. We could of course add more instrumentation to debug builds.

            marko Marko Mäkelä added a comment - I implemented special enforcement of innodb_fatal_semaphore_wait_threshold for dict_sys.mutex and lock_sys.mutex . Due to an observed performance regression at high concurrency, I removed the lock_sys.mutex instrumentation and retained only the one on dict_sys.mutex . If pthread_mutex_trylock() fails, then the current thread would compare-and-swap 0 with its current time before waiting in pthread_mutex_lock() . Either the srv_monitor_task() or a subsequent thread that attempts to acquire dict_sys.mutex would then enforce the innodb_fatal_semaphore_wait_threshold and kill the process if necessary. While rewriting the test sys_vars.innodb_fatal_semaphore_wait_threshold accordingly, I noticed that not all hangs would be caught even in the data dictionary cache. For example, if a DDL operation hung while holding both dict_sys.latch and dict_sys.mutex , a subsequent DDL operation would hang while waiting for dict_sys.latch , before even starting the wait for dict_sys.mutex . But, DML threads that are trying to open a table would acquire dict_sys.mutex and be subject to the watchdog. Hopefully this type of watchdog testing will be adequate. We could of course add more instrumentation to debug builds.

            The main reason for having the homebrew mutexes was that their built-in spin loops could lead to better performance than the native implementation on contended mutexes.

            Some performance regression was observed for larger thread counts (exceeding the CPU core count) when updating non-indexed columns. I suspect that the culprit is contention on lock_sys.mutex, and I believe that implementing MDEV-20612 will address that.

            Also log_sys.mutex is known to be a source of contention, but it was changed to a native mutex already in MDEV-23855. MDEV-23855 also removed some contention on fil_system.mutex, but kept it as a homebrew mutex. Contention on these mutexes should be reduced further in MDEV-14425.

            marko Marko Mäkelä added a comment - The main reason for having the homebrew mutexes was that their built-in spin loops could lead to better performance than the native implementation on contended mutexes. Some performance regression was observed for larger thread counts (exceeding the CPU core count) when updating non-indexed columns. I suspect that the culprit is contention on lock_sys.mutex , and I believe that implementing MDEV-20612 will address that. Also log_sys.mutex is known to be a source of contention, but it was changed to a native mutex already in MDEV-23855 . MDEV-23855 also removed some contention on fil_system.mutex , but kept it as a homebrew mutex. Contention on these mutexes should be reduced further in MDEV-14425 .

            We observed frequent timeouts and extremely slow execution time of the test mariabackup.xb_compressed_encrypted especially on Microsoft Windows builders. Those machines have 4 processor cores, and they run 4 client/server process pairs in parallel. (Our Linux builders have a lot more processor cores.) The used to specify innodb_encryption_threads=4. That is, there was one page cleaner thread doing the actual work of writing data pages, and 4 ‘manager’ threads that fight each other to see who gets to wield the shovel and add more dirt to the pile that the page cleaner is trying to transport away. Changing the test to use innodb_encryption_threads=1 seems to have fixed the problem.

            With the previous setting, the test timed out on win32-debug on two successive runs; with the lower setting innodb_encryption_threads=1 it passed (at least once), consuming 13, 14, and 41 seconds on win32-debug and 14, 22, 27 seconds on win64-debug. On a previous run with innodb_encryption_threads=4 , the execution time was more than 500 seconds on win64-debug, and for 2 of the 3 innodb_page_size values, the execution time exceeded 900 seconds on win32-debug.

            Thanks to wlad for making the observation that the encryption threads were conflicting with each other. In MDEV-22258 we did experiment with different settings, and back then (still with the homebrew mutexes) there seemed to be some benefit of having multiple encryption (page-dirtying) threads.

            This highlights a benefit of the homebrew mutexes that we removed: Spinning may yield a little better throughput when there is a lot of contention. I agree with the opnion that svoj has stated earlier: it is better to fix the underlying contention than to implement workarounds. I am confident that with MDEV-14425 and MDEV-20612 we will regain some scalability when the number of concurrent connections exceeds the number of processor cores. We already reduced buf_pool.mutex contention in MDEV-15053 and MDEV-23399 et al, and fil_system.mutex contention in MDEV-23855.

            marko Marko Mäkelä added a comment - We observed frequent timeouts and extremely slow execution time of the test mariabackup.xb_compressed_encrypted especially on Microsoft Windows builders. Those machines have 4 processor cores, and they run 4 client/server process pairs in parallel. (Our Linux builders have a lot more processor cores.) The used to specify innodb_encryption_threads=4 . That is, there was one page cleaner thread doing the actual work of writing data pages, and 4 ‘manager’ threads that fight each other to see who gets to wield the shovel and add more dirt to the pile that the page cleaner is trying to transport away. Changing the test to use innodb_encryption_threads=1 seems to have fixed the problem. With the previous setting, the test timed out on win32-debug on two successive runs; with the lower setting innodb_encryption_threads=1 it passed (at least once), consuming 13, 14, and 41 seconds on win32-debug and 14, 22, 27 seconds on win64-debug. On a previous run with innodb_encryption_threads=4 , the execution time was more than 500 seconds on win64-debug, and for 2 of the 3 innodb_page_size values, the execution time exceeded 900 seconds on win32-debug. Thanks to wlad for making the observation that the encryption threads were conflicting with each other. In MDEV-22258 we did experiment with different settings, and back then (still with the homebrew mutexes) there seemed to be some benefit of having multiple encryption (page-dirtying) threads. This highlights a benefit of the homebrew mutexes that we removed: Spinning may yield a little better throughput when there is a lot of contention. I agree with the opnion that svoj has stated earlier: it is better to fix the underlying contention than to implement workarounds. I am confident that with MDEV-14425 and MDEV-20612 we will regain some scalability when the number of concurrent connections exceeds the number of processor cores. We already reduced buf_pool.mutex contention in MDEV-15053 and MDEV-23399 et al, and fil_system.mutex contention in MDEV-23855 .

            The problem with the constantly sleeping and waking encryption threads was partially addressed in MDEV-24426. On GNU/Linux, with the native mutexes and condition variables, the CPU usage was low, but with the homebrew mutexes and events all threads seemed to spinning constantly. Maybe on Microsoft Windows the flood of sleeps and wakeups performs worse?

            marko Marko Mäkelä added a comment - The problem with the constantly sleeping and waking encryption threads was partially addressed in MDEV-24426 . On GNU/Linux, with the native mutexes and condition variables, the CPU usage was low, but with the homebrew mutexes and events all threads seemed to spinning constantly. Maybe on Microsoft Windows the flood of sleeps and wakeups performs worse?

            People

              marko Marko Mäkelä
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.