Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25062

InnoDB read-write workload hits contention arising from rollback segment mutex

Details

    Description

      • Each read-write transaction uses a rollback segment to store undo-records (needed for rollback).
      • Currently, InnoDB has 128 rollback segments (max) and they are shared by all the threads.
      • This means if a user is running 1024 threads workload, 8 threads will use the same rollback segment. Given each thread is running a transaction at a given point in time.
      • Put this on numa scale and the factor gets multiplied by numa scalability bottleneck that means the same rseg needs to be accessed by multiple threads located (possibly) across numa.
      • All this makes rseg-mutex one of the hottest mutex.

      Testing carried out using sysbench-update-index with 1024 threads on a machine with 4 numa (2 sockets). 10.6 branch #80ac9ec1).

      EVENT_NAME WAIT_MS COUNT_STAR
      wait/synch/mutex/innodb/redo_rseg_mutex 102697560.5683 58758773
      wait/synch/mutex/innodb/log_sys_mutex 49862366.5623 73277659
      wait/synch/mutex/innodb/dict_sys_mutex 21348825.2544 71961431
      wait/synch/mutex/innodb/redo_rseg_mutex 0.0905 2058

      <70 secs of update-index workload with 1024 threads>

      wait/synch/mutex/innodb/redo_rseg_mutex 21298612.2935 17804286

      <70 secs of update-index workload with 1024 threads>

      wait/synch/mutex/innodb/redo_rseg_mutex 61677736.5123 37997346

      <70 secs of update-index workload with 1024 threads>

      wait/synch/mutex/innodb/redo_rseg_mutex 102884379.5462 58871725

      Attachments

        Issue Links

          Activity

            The observed performance bottleneck can be addressed in two ways. The short-term solution is to retain the current file format and allow a more efficient assignment of the 128 rollback segments to transactions.

            The long-term solution would be a file format change. When it comes to that, I noticed my old comment in MDEV-11657 (which is basically a scratchpad of loose ideas):

            In DB_ROLL_PTR, the rollback segment ID could identify the undo tablespace. Theoretically, given that each DB_TRX_ID has only one persistent rollback segment, we would not even need that; MVCC could look up the undo tablespace based on the DB_TRX_ID. This would require extending main memory data structures so that some data of committed transactions would be stored until the transactions are purged.

            We could repurpose the 7 bits in DB_ROLL_PTR to be flags for future use (always write them as zero from now on), and retire the TRX_SYS page which was demoted into a mere directory of undo tablespace header pages in MDEV-15158.

            We could allow any number of undo tablespaces (much larger than 128). On startup, we would recover the undo log header pages from each undo tablespace that is found (based on a file name), as well as recover the rollback segment of each active transaction. Each undo tablespace could contain multiple rollback segments, as defined by the new undo tablespace format.

            If we went this route, we would probably refuse server startup if the undo logs are not empty, so that we will not have to support two undo log formats in the same executable.

            marko Marko Mäkelä added a comment - The observed performance bottleneck can be addressed in two ways. The short-term solution is to retain the current file format and allow a more efficient assignment of the 128 rollback segments to transactions. The long-term solution would be a file format change. When it comes to that, I noticed my old comment in MDEV-11657 (which is basically a scratchpad of loose ideas): In DB_ROLL_PTR , the rollback segment ID could identify the undo tablespace. Theoretically, given that each DB_TRX_ID has only one persistent rollback segment, we would not even need that; MVCC could look up the undo tablespace based on the DB_TRX_ID . This would require extending main memory data structures so that some data of committed transactions would be stored until the transactions are purged. We could repurpose the 7 bits in DB_ROLL_PTR to be flags for future use (always write them as zero from now on), and retire the TRX_SYS page which was demoted into a mere directory of undo tablespace header pages in MDEV-15158 . We could allow any number of undo tablespaces (much larger than 128). On startup, we would recover the undo log header pages from each undo tablespace that is found (based on a file name), as well as recover the rollback segment of each active transaction. Each undo tablespace could contain multiple rollback segments, as defined by the new undo tablespace format. If we went this route, we would probably refuse server startup if the undo logs are not empty, so that we will not have to support two undo log formats in the same executable.

            I understood that there is an observable performance regression on 10.6 compared to 10.4. It could be possibly related to MDEV-21452, which removed the spinloop on the rollback segment mutex.

            I just finished a prototype that not only replaces the normal mutex with srw_mutex (so that it will use a spinloop on Linux and OpenBSD again, and be SRWLOCK on Windows, and pthread_mutex_t on anything else) but also removes some completely needless acquisition of the mutex. Furthermore, we will use relaxed atomic memory operations around the reference-counting, so that the mutex will not be needed at transaction start.

            marko Marko Mäkelä added a comment - I understood that there is an observable performance regression on 10.6 compared to 10.4. It could be possibly related to MDEV-21452 , which removed the spinloop on the rollback segment mutex. I just finished a prototype that not only replaces the normal mutex with srw_mutex (so that it will use a spinloop on Linux and OpenBSD again, and be SRWLOCK on Windows, and pthread_mutex_t on anything else) but also removes some completely needless acquisition of the mutex. Furthermore, we will use relaxed atomic memory operations around the reference-counting, so that the mutex will not be needed at transaction start.

            People

              marko Marko Mäkelä
              krunalbauskar Krunal Bauskar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.