Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23855

InnoDB log checkpointing causes regression for write-heavy OLTP

Details

    Description

      Workload: sysbench OLTP write-only, i.e. as it is used by the regression benchmark suite in t_writes_innodb_multi.

      Setup: 16G buffer pool. 4G redo log. 4G or 8G data set. innodb_flush_neighbors = 0, innodb_io_capacity = 1000 (or 5000, 10000)

      Observation: after starting high, performance drops after ~ 1 minute. If waiting long enough, one can see oscillations in throughput. This seems to be related to Innodb_checkpoint_age reaching Innodb_checkpoint_max_age. There seems to be no LRU flushing at all, only flush_list flushing.

      Attachments

        1. Innodb_io_capacity.png
          Innodb_io_capacity.png
          66 kB
        2. Innodb-max-dirty-pages-pct-lwm.png
          Innodb-max-dirty-pages-pct-lwm.png
          60 kB
        3. oltp_ts_128.png
          oltp_ts_128.png
          30 kB
        4. oltp_ts_256.png
          oltp_ts_256.png
          39 kB
        5. oltp_ts_64.png
          oltp_ts_64.png
          27 kB
        6. oltp_ts_64.png
          oltp_ts_64.png
          41 kB
        7. oltp_ts_8.png
          oltp_ts_8.png
          26 kB
        8. Screen Shot 2020-10-20 at 5.59.20 PM.png
          Screen Shot 2020-10-20 at 5.59.20 PM.png
          690 kB

        Issue Links

          Activity

            The innodb_max_dirty_pages_pct_lwm that was changed by MDEV-23399 from 0 to 75 was reverted, because users may expect the shutdown of an idle server to be fast. We might change that in a next major release, but not in the middle of a GA release series.

            The remaining observed contention around fil_system.mutex is related to fil_system.unflushed_spaces when innodb_flush_method=O_DIRECT_NO_FSYNC is not being used, and to FILE_MODIFY records that we will write out during log checkpoints until MDEV-14425 will change the log format.

            In a future release, we might want to try to replace fil_system.unflushed_spaces with something lock-free, or simply change the default innodb_flush_method unless there are good reasons against it.

            Log checkpoints will flush pages of InnoDB temporary tables as ‘collateral damage’ until MDEV-12227 has been addressed.

            marko Marko Mäkelä added a comment - The innodb_max_dirty_pages_pct_lwm that was changed by MDEV-23399 from 0 to 75 was reverted, because users may expect the shutdown of an idle server to be fast. We might change that in a next major release, but not in the middle of a GA release series. The remaining observed contention around fil_system.mutex is related to fil_system.unflushed_spaces when innodb_flush_method=O_DIRECT_NO_FSYNC is not being used, and to FILE_MODIFY records that we will write out during log checkpoints until MDEV-14425 will change the log format. In a future release, we might want to try to replace fil_system.unflushed_spaces with something lock-free, or simply change the default innodb_flush_method unless there are good reasons against it. Log checkpoints will flush pages of InnoDB temporary tables as ‘collateral damage’ until MDEV-12227 has been addressed.

            What the source code refers to by "flush_list flushing" and "LRU flushing" should better be called checkpoint flushing (writing out data pages for a log checkpoint) and eviction flushing (writing out modified pages so that they can be replaced).

            The ultimate purpose of the checkpoint flushing is to avoid a situation where the circular log file becomes unrecoverable due to the tail of the log overwriting the head. The checkpoint would logically remove the head of the log, making the recovery start from a newer checkpoint LSN. This ticket (MDEV-23855) significantly reduced the latency related to checkpoint flushing, especially in the case when the log capacity is reached and the condition in log_free_check() would hold.

            The eviction flushing is necessary when we are running out of buffer pool, and there exist unwritten changes for the least recently used pages, possibly because the buffer pool is small and the log file is large (possibly much larger than the buffer pool; our recovery can consist of multiple batches without problems, ever since MDEV-21351 was implemented). Performance problems related to that were addressed in MDEV-23399 (which also removed the special case of "single-page flushing").

            The further improvement in MDEV-12227 will be that pages belonging to temporary tables may only be written as part of eviction flushing, and never for checkpoint flushing.

            marko Marko Mäkelä added a comment - What the source code refers to by " flush_list flushing" and " LRU flushing" should better be called checkpoint flushing (writing out data pages for a log checkpoint) and eviction flushing (writing out modified pages so that they can be replaced). The ultimate purpose of the checkpoint flushing is to avoid a situation where the circular log file becomes unrecoverable due to the tail of the log overwriting the head. The checkpoint would logically remove the head of the log, making the recovery start from a newer checkpoint LSN. This ticket ( MDEV-23855 ) significantly reduced the latency related to checkpoint flushing, especially in the case when the log capacity is reached and the condition in log_free_check() would hold. The eviction flushing is necessary when we are running out of buffer pool, and there exist unwritten changes for the least recently used pages, possibly because the buffer pool is small and the log file is large (possibly much larger than the buffer pool; our recovery can consist of multiple batches without problems, ever since MDEV-21351 was implemented). Performance problems related to that were addressed in MDEV-23399 (which also removed the special case of "single-page flushing"). The further improvement in MDEV-12227 will be that pages belonging to temporary tables may only be written as part of eviction flushing, and never for checkpoint flushing.

            Unfortunately, innodb_flush_sync=OFF is broken (MDEV-24109) in MariaDB 10.5.7 as a result of these changes. With that fix, the checkpoint flushing is always performed by the page cleaner thread, in batches of innodb_io_capacity_max pages, and setting innodb_flush_sync=OFF will limit the rate to innodb_io_capacity_max pages per second.

            marko Marko Mäkelä added a comment - Unfortunately, innodb_flush_sync=OFF is broken ( MDEV-24109 ) in MariaDB 10.5.7 as a result of these changes. With that fix, the checkpoint flushing is always performed by the page cleaner thread, in batches of innodb_io_capacity_max pages, and setting innodb_flush_sync=OFF will limit the rate to innodb_io_capacity_max pages per second .
            axel Axel Schwenke added a comment -

            In order to make the most out of the changes done in this task (MDEV-23855) you must also configure InnoDB correctly. There are two important server variables to look at. Both give the DBA a chance to balance between IO needs and SQL throughput capabilities.

            There are two timeseries plots attached. Innodb-max-dirty-pages-pct-lwm.png shows how the server variable innodb_max_dirty_pages_pct_lwm affects flushing. The plot begins with innodb_max_dirty_pages_pct_lwm=0 and increases it by 10 every 1800 seconds. Higher values reduce flushing activity. The number of dirty pages in the buffer pool increases, more changes to a page can be combined to a single write, reducing the IO per transaction.

            On the downside this also increases the InnoDB checkpoint age. When you hit the soft limit at ~80% of the redo log capacity, InnoDB will start furious flushing, resulting in oscillations. The plot shows this after 10800 seconds when innodb_max_dirty_pages_pct_lwm is changed from 50 to 60. Keep in mind that the best values depends on workload, data set size, redo log capacity and some more.

            The other variable is innodb_io_capacity, covered by the Innodb_io_capacity.png timeseries plot. In this plot we start with innodb_io_capacity=500 and increase it every 2000 seconds up to 12000. The variable sets the target flushing rate (pages per second). If this is set too low, InnoDB will again reach the checkpoint age limit and oscillate beween furious and relaxed flushing. Only when the flushing rate is high enough to keep pace with the rate of pages made dirty, InnoDB reaches a steady state. If you set innodb_io_capacity higher than needed, InnoDB will flush pages more often, leading to fewer writes being combined and more IO per transaction.

            axel Axel Schwenke added a comment - In order to make the most out of the changes done in this task ( MDEV-23855 ) you must also configure InnoDB correctly. There are two important server variables to look at. Both give the DBA a chance to balance between IO needs and SQL throughput capabilities. There are two timeseries plots attached. Innodb-max-dirty-pages-pct-lwm.png shows how the server variable innodb_max_dirty_pages_pct_lwm affects flushing. The plot begins with innodb_max_dirty_pages_pct_lwm=0 and increases it by 10 every 1800 seconds. Higher values reduce flushing activity. The number of dirty pages in the buffer pool increases, more changes to a page can be combined to a single write, reducing the IO per transaction. On the downside this also increases the InnoDB checkpoint age. When you hit the soft limit at ~80% of the redo log capacity, InnoDB will start furious flushing, resulting in oscillations. The plot shows this after 10800 seconds when innodb_max_dirty_pages_pct_lwm is changed from 50 to 60. Keep in mind that the best values depends on workload, data set size, redo log capacity and some more. The other variable is innodb_io_capacity , covered by the Innodb_io_capacity.png timeseries plot. In this plot we start with innodb_io_capacity=500 and increase it every 2000 seconds up to 12000. The variable sets the target flushing rate (pages per second). If this is set too low, InnoDB will again reach the checkpoint age limit and oscillate beween furious and relaxed flushing. Only when the flushing rate is high enough to keep pace with the rate of pages made dirty, InnoDB reaches a steady state. If you set innodb_io_capacity higher than needed, InnoDB will flush pages more often, leading to fewer writes being combined and more IO per transaction.

            With MDEV-24537 fixed (in 10.5.9), the default value innodb_max_dirty_pages_pct_lwm=0 will retain its previously undocumented meaning: innodb_max_dirty_pages_pct (a similarly-named parameter without the _lwm suffix) will be consulted instead. To have other versions behave like MariaDB 10.5.7 and 10.5.8 with regard to this parameter, I believe that the following should work:

            SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.0001;
            

            It is obvious that having the page cleaner thread perform eager ‘pre-flushing’ will reduce the age of the log checkpoint, and thus will reduce the likelihood that user threads will have to wait for a checkpoint flushing batch to successfully reduce the age of the checkpoint.

            As part of this task, the page cleaner was almost rewritten, and many CPU contention points were removed. The contention on fil_system.mutex was reduced, and the doublewrite buffer initiates asynchronous writes, instead of synchronous ones. Thanks to these changes, the maximum latency (even in the case that the ‘pre-flushing’ is disabled) should be lower. In MDEV-24537 I posted the result of a quick test on a hard disk.

            One more thing worth noting is that the parameter innodb_io_capacity is not only the target number of pages per second to write during the ‘background pre-flushing’ (when it is enabled by innodb_max_dirty_pages_pct_lwm and other parameters), but also the number of pages to write per batch during the furious checkpoint flushing (enabled by default by innodb_flush_sync=ON).

            My intuition says that a DBA who wants to minimize ‘write amplification’ should try to disable the pre-flushing altogether (so that repeatedly modified pages will be rewritten to data files less frequently). In that kind of a scenario, I would expect that setting innodb_io_capacity to a small value will reduce the wait times in buf_flush_wait_flushed().

            I think that we will need more benchmarking for this. One parameter that I do not think we have not measured is ‘average number of page writes per transaction’. I think that we should try to minimize that while simultaneously trying to keep the throughput and latency as stable as possible. Possibly, the function buf_flush_sync_for_checkpoint() will need a separate parameter for the number of pages written per batch, instead of reusing innodb_io_capacity. (After all, that function is executing in I/O bound mode, potentially writing many more than innodb_io_capacity pages per second.)

            marko Marko Mäkelä added a comment - With MDEV-24537 fixed (in 10.5.9), the default value innodb_max_dirty_pages_pct_lwm=0 will retain its previously undocumented meaning: innodb_max_dirty_pages_pct (a similarly-named parameter without the _lwm suffix) will be consulted instead. To have other versions behave like MariaDB 10.5.7 and 10.5.8 with regard to this parameter, I believe that the following should work: SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.0001; It is obvious that having the page cleaner thread perform eager ‘pre-flushing’ will reduce the age of the log checkpoint, and thus will reduce the likelihood that user threads will have to wait for a checkpoint flushing batch to successfully reduce the age of the checkpoint. As part of this task, the page cleaner was almost rewritten, and many CPU contention points were removed. The contention on fil_system.mutex was reduced, and the doublewrite buffer initiates asynchronous writes, instead of synchronous ones. Thanks to these changes, the maximum latency (even in the case that the ‘pre-flushing’ is disabled) should be lower. In MDEV-24537 I posted the result of a quick test on a hard disk. One more thing worth noting is that the parameter innodb_io_capacity is not only the target number of pages per second to write during the ‘background pre-flushing’ (when it is enabled by innodb_max_dirty_pages_pct_lwm and other parameters), but also the number of pages to write per batch during the furious checkpoint flushing (enabled by default by innodb_flush_sync=ON ). My intuition says that a DBA who wants to minimize ‘write amplification’ should try to disable the pre-flushing altogether (so that repeatedly modified pages will be rewritten to data files less frequently). In that kind of a scenario, I would expect that setting innodb_io_capacity to a small value will reduce the wait times in buf_flush_wait_flushed() . I think that we will need more benchmarking for this. One parameter that I do not think we have not measured is ‘average number of page writes per transaction’. I think that we should try to minimize that while simultaneously trying to keep the throughput and latency as stable as possible. Possibly, the function buf_flush_sync_for_checkpoint() will need a separate parameter for the number of pages written per batch, instead of reusing innodb_io_capacity . (After all, that function is executing in I/O bound mode, potentially writing many more than innodb_io_capacity pages per second.)

            People

              marko Marko Mäkelä
              axel Axel Schwenke
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.