Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35155

Performance degradation and unstable observed on 10.6.19

Details

    Description

      I plan to migrate our MariaDB instances from `10.2.10` to `10.6.19`, and have run some performance benchmarks. And I observed performance is not stable compared to `10.2.10`, especially for in-memory workload.

      Here is my test setup.
      Test tool: sysbench 1.0.X
      OS: CentOS 7.9 X86_64
      MariaDB version: 10.2.10 10.6.19
      Dataset: create 10 tables and each with 5M rows, each table ~ 1.2GB, the total size is ~ 12GB
      Almost all config options are the same, except I removed some options which is deprecated/removed in 10.6, e.g. `innodb_buffer_pool_instances`, `innodb_page_cleaners`,`innodb-thread-concurrency`,`innodb_checksum_algorithm` etc.

      Test 1.
      In-memory workset, with `innodb_buffer_pool_size`=188GB
      > NOTE:
      > TPS-X means using X threads run sysbench `oltp_read_write.lua` test

      10.2.10

      10.6.19

      We can see there are performance drops periodically with version `10.6.19`. The `10.6.19` can keep stable only in `4` threads case, while `10.2.10` 's performance is always stable with threads `4, 8, 16, and 32`.

      Test 2:
      Disk io bund test with `innodb_buffer_pool_size=2G`

      10.2.10

      10.6.19

      you can see `10.2.10` is also more stable compared to `10.6.19`.

      Attachments

        1. image-2024-10-15-11-41-04-176.png
          image-2024-10-15-11-41-04-176.png
          44 kB
        2. image-2024-10-15-11-43-21-120.png
          image-2024-10-15-11-43-21-120.png
          60 kB
        3. image-2024-10-15-11-46-38-886.png
          image-2024-10-15-11-46-38-886.png
          101 kB
        4. image-2024-10-15-11-47-23-300.png
          image-2024-10-15-11-47-23-300.png
          140 kB
        5. screenshot-1.png
          screenshot-1.png
          56 kB
        6. screenshot-2.png
          screenshot-2.png
          42 kB
        7. image-2024-10-16-14-52-49-455.png
          image-2024-10-16-14-52-49-455.png
          59 kB
        8. image-2024-10-16-16-03-42-724.png
          image-2024-10-16-16-03-42-724.png
          30 kB
        9. screenshot-3.png
          screenshot-3.png
          40 kB
        10. screenshot-4.png
          screenshot-4.png
          44 kB
        11. screenshot-5.png
          screenshot-5.png
          40 kB
        12. screenshot-6.png
          screenshot-6.png
          145 kB
        13. screenshot-7.png
          screenshot-7.png
          39 kB
        14. 10.6.20_write_only.zip
          1.08 MB
        15. 10.6-March25..31.pdf
          52 kB
        16. timeseries-77bebe9eb08.png
          timeseries-77bebe9eb08.png
          23 kB

        Issue Links

          Activity

            debarun recently worked on MDEV-36226, which is in the same generic area.

            marko Marko Mäkelä added a comment - debarun recently worked on MDEV-36226 , which is in the same generic area.
            axel Axel Schwenke added a comment - - edited

            This is a result from the continuously running regression tests. Something has changed between March 28 and March 31 in InnoDB that has both improved InnoDB write performance and fixed the IO spike I mentioned in a former comment:

            10.6-March25..31.pdf

            (this is probably the effect ofthe fix for MDEV-36226)

            the exact commits were 31c06951c61 (Mar 28th) and 77bebe9eb08 (Mar 31st)

            axel Axel Schwenke added a comment - - edited This is a result from the continuously running regression tests. Something has changed between March 28 and March 31 in InnoDB that has both improved InnoDB write performance and fixed the IO spike I mentioned in a former comment: 10.6-March25..31.pdf (this is probably the effect ofthe fix for MDEV-36226 ) the exact commits were 31c06951c61 (Mar 28th) and 77bebe9eb08 (Mar 31st)

            There are only 2 InnoDB changes between those: a change of a debug assertion (which only affects debug builds) and MDEV-36226. So, it would seem very plausible that this problem was fixed by MDEV-36226.

            marko Marko Mäkelä added a comment - There are only 2 InnoDB changes between those: a change of a debug assertion (which only affects debug builds) and MDEV-36226 . So, it would seem very plausible that this problem was fixed by MDEV-36226 .

            Thanks axel, marko for the information. Right, MDEV-36226 improves the free page availability in general and could improve overall read-write performance especially improving the stall like situations.

            I ran 32 threads sysbench read-write test in sdp against 10.6.19 and 10.6.22 latest. My observation is that the instability in TPS is improved in 10.6.22. Looks like purge and undo page usage also has reduced a lot in latest which could also be a factor.

            axel Can you please confirm if the bug scenario/regression is fixed in latest 10.6 (10.6.22) ?

            debarun Debarun Banerjee added a comment - Thanks axel , marko for the information. Right, MDEV-36226 improves the free page availability in general and could improve overall read-write performance especially improving the stall like situations. I ran 32 threads sysbench read-write test in sdp against 10.6.19 and 10.6.22 latest. My observation is that the instability in TPS is improved in 10.6.22. Looks like purge and undo page usage also has reduced a lot in latest which could also be a factor. axel Can you please confirm if the bug scenario/regression is fixed in latest 10.6 (10.6.22) ?
            axel Axel Schwenke added a comment -

            Axel Schwenke Can you please confirm if the bug scenario/regression is fixed in latest 10.6 (10.6.22) ?

            The throughput stalls in connection with page flushing have become less often, but still exist. I see for example a 15 second stall for single-threaded (!) running workload sysbench OLTP write-only with binlog:


            (full details here: http://g5.xentio.lan/benchmark-archive/MDEV-36226/250331.171739.regressiontest.mariadb-community.10.6/t_oltp_writes_innodb_binlog/plots/)

            I remember to have seen more such stalls when I run the regression test suite on bigger hardware. I started a comparison of 10.6.21 and current 10.6 now.

            I want to stress the point that the stall happens at a single threaded execution. And there was 0 (zero) flushing activity prior to the stall. Only when the checkpoint age is near to the allowed maximum (not in the screenshot; checkpoint age was ~1.6GB for 2GB redo log) flushing starts. And then in emergeny (eager) mode. It looks as if the adaptive flushing is still not working correctly.

            axel Axel Schwenke added a comment - Axel Schwenke Can you please confirm if the bug scenario/regression is fixed in latest 10.6 (10.6.22) ? The throughput stalls in connection with page flushing have become less often, but still exist. I see for example a 15 second stall for single-threaded (!) running workload sysbench OLTP write-only with binlog: (full details here: http://g5.xentio.lan/benchmark-archive/MDEV-36226/250331.171739.regressiontest.mariadb-community.10.6/t_oltp_writes_innodb_binlog/plots/ ) I remember to have seen more such stalls when I run the regression test suite on bigger hardware. I started a comparison of 10.6.21 and current 10.6 now. I want to stress the point that the stall happens at a single threaded execution. And there was 0 (zero) flushing activity prior to the stall. Only when the checkpoint age is near to the allowed maximum (not in the screenshot; checkpoint age was ~1.6GB for 2GB redo log) flushing starts. And then in emergeny (eager) mode. It looks as if the adaptive flushing is still not working correctly.

            People

              debarun Debarun Banerjee
              lujinke Luke
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.