Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23399

10.5 performance regression with IO-bound tpcc

Details

    Description

      Triggered by this blog post from Percona.
      Problem could be reproduced with sysbench-tpcc and Percona settings. Buffer pool 25G for 100G data set (1000 warehouses). Datadir located on SSD. tpcc workload with 32 benchmark threads on hardware with 16 cores/32 hyperthreads.
      Throughput starts high and then decreases over a varying time period (500 .. 1200 seconds) to reach ~200 tps. Performance schema shows lots of time spent with buf_pool_mutex. CPU usage of the mariadbd process is rather low around 300%.
      MySQL 8.0 does not show that problem. MariaDB 10.5.4 performs better than pre-10.5.5 shapshot.

      Attachments

        Issue Links

          Activity

            wlad, please review the squashed commit.

            axel and krunalbauskar, please test the performance. I think that we must deal with MDEV-23855 separately.

            mleich, please run the wide battery of stress tests. In previous tests more than a week ago, some corruption or crashes on crash recovery occurred. I believe that the problem may have been fixed since then.

            marko Marko Mäkelä added a comment - wlad , please review the squashed commit . axel and krunalbauskar , please test the performance. I think that we must deal with MDEV-23855 separately. mleich , please run the wide battery of stress tests. In previous tests more than a week ago, some corruption or crashes on crash recovery occurred. I believe that the problem may have been fixed since then.

            Hello,

            I can see that this JIRA has moved to "stalled". I was wondering if it could be possible to have an understanding on the current state of the fix and a target minor version were this fix will arrive.

            Thanks in advance.

            Bernardo Perez Bernardo Perez added a comment - Hello, I can see that this JIRA has moved to "stalled". I was wondering if it could be possible to have an understanding on the current state of the fix and a target minor version were this fix will arrive. Thanks in advance.

            This scenario (write-heavy workload that does not fit in the buffer pool) was addressed by rewriting most of the page cleaner thread and page flushing, by simplifying related data structures and reducing mutex operations. LRU flushing will now only be initiated by user threads, and the page cleaner thread will perform solely checkpoint-related flushing. There is no single-page flushing anymore, and the page cleaner will not wait for log writes or page latches.

            Performance will be improved further in MDEV-23855 for write-heavy cases where all data does fit in the buffer pool. Among other things, that will remove contention on fil_system.mutex between the page cleaner and threads executing write completion callbacks. The work is mostly done.

            marko Marko Mäkelä added a comment - This scenario (write-heavy workload that does not fit in the buffer pool) was addressed by rewriting most of the page cleaner thread and page flushing, by simplifying related data structures and reducing mutex operations. LRU flushing will now only be initiated by user threads, and the page cleaner thread will perform solely checkpoint-related flushing. There is no single-page flushing anymore, and the page cleaner will not wait for log writes or page latches. Performance will be improved further in MDEV-23855 for write-heavy cases where all data does fit in the buffer pool. Among other things, that will remove contention on fil_system.mutex between the page cleaner and threads executing write completion callbacks. The work is mostly done.
            sayap Yap Sok Ann added a comment -

            ... Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2.

            Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked.

            This seems like a very nice design, but I have some concern about how it was done previously, and how it is still being done in the latest MySQL/Percona:

            1. Call log_write_up_to() with the newest LSN of the modified page
            2. Write out the modified page

            As the block mutex is not held, does it mean that in between step 1 and step 2, some mtr can always further modify the page with a newer LSN?

            If that's the case, a crash after step 2 would mean that the data files are now ahead of the redo log. What would be the consequences of that?

            Sorry if this is a noob question. I am rather interested about innodb page flushing performance, and after trying to understand the code a little (still stuck with PXC 5.6 here), I am really curious what's the point of step 1 if it can't guarantee anything.

            sayap Yap Sok Ann added a comment - ... Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. This seems like a very nice design, but I have some concern about how it was done previously, and how it is still being done in the latest MySQL/Percona: Call log_write_up_to() with the newest LSN of the modified page Write out the modified page As the block mutex is not held, does it mean that in between step 1 and step 2, some mtr can always further modify the page with a newer LSN? If that's the case, a crash after step 2 would mean that the data files are now ahead of the redo log. What would be the consequences of that? Sorry if this is a noob question. I am rather interested about innodb page flushing performance, and after trying to understand the code a little (still stuck with PXC 5.6 here), I am really curious what's the point of step 1 if it can't guarantee anything.

            sayap, sorry, I did not notice your comment. Generally, https://mariadb.zulipchat.com/ would be a better platform for such discussions.

            In MDEV-25948 we actually backtracked a little and removed the log_flush_task that would potentially reduce the amount of calls to log_flush_up_to(). There were several improvements to page flushing performance in MariaDB 10.5.12 and 10.6.4, and our testing in MDEV-25451 is indicating rather stable throughput.

            The block mutex was removed already in MDEV-15053. I suppose that you mean the page latch? I think that we always hold the page latch when writing out a modified page. Before we write it, we will ensure that the FIL_PAGE_LSN is not ahead the durable position of the write-ahead log. Page writes are generally optional (MDEV-24626 removed the last exception). Only for log checkpoints, we must advance the MIN(oldest_modification) by page writes.

            marko Marko Mäkelä added a comment - sayap , sorry, I did not notice your comment. Generally, https://mariadb.zulipchat.com/ would be a better platform for such discussions. In MDEV-25948 we actually backtracked a little and removed the log_flush_task that would potentially reduce the amount of calls to log_flush_up_to() . There were several improvements to page flushing performance in MariaDB 10.5.12 and 10.6.4, and our testing in MDEV-25451 is indicating rather stable throughput. The block mutex was removed already in MDEV-15053 . I suppose that you mean the page latch? I think that we always hold the page latch when writing out a modified page. Before we write it, we will ensure that the FIL_PAGE_LSN is not ahead the durable position of the write-ahead log. Page writes are generally optional ( MDEV-24626 removed the last exception). Only for log checkpoints, we must advance the MIN(oldest_modification) by page writes.

            People

              marko Marko Mäkelä
              axel Axel Schwenke
              Votes:
              7 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.