Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34462

SemiSync replication underperforming and stalling throughput

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.6.17
    • N/A
    • Replication
    • None
    • Debian Bullseye

    Description

      We upgraded some 10.1.48 MariaDB servers to 10.6.17 and noticed that by having semi-sync replication enabled, throughput was stalled and context switching has been heavily impacted leading to poor performance across the board.

      By looking at `information_schema.processlist` we were able to detect that most of the queries were stuck for several seconds in

      Waiting for semi-sync ACK from slave
      

      STATE.

      After disabling semisync replication on the primary server, everything came back to life.

      This is a high-throughput high QPS environment (80K QPS avg, 140K on pak, >2500 sessions at any given time)

      We've found a MySQL bug reported several years ago that looks related but although the relevant patch can't be found in the equivalent MariaDB code position, I'm not confident that it hasn't been ported in some other way over the years.

      The only relevant change I've found in upgrade docs between 10.1 and 10.6 is this one but it doesn't look guilty by itself.

      Given that this caused serious performance degradation in our case after upgrading to 10.6, let us know if there is something more we can help with in order to spot the root cause of the issue.

      Attachments

        Issue Links

          Activity

            For completeness the current settings on our primary server are like below right now:

            MariaDB [marvin_production]> show global variables like '%semi%';
            +---------------------------------------+--------------+
            | Variable_name                         | Value        |
            +---------------------------------------+--------------+
            | rpl_semi_sync_master_enabled          | OFF          |
            | rpl_semi_sync_master_timeout          | 2000         |
            | rpl_semi_sync_master_trace_level      | 32           |
            | rpl_semi_sync_master_wait_no_slave    | ON           |
            | rpl_semi_sync_master_wait_point       | AFTER_COMMIT |
            | rpl_semi_sync_slave_delay_master      | OFF          |
            | rpl_semi_sync_slave_enabled           | OFF          |
            | rpl_semi_sync_slave_kill_conn_timeout | 5            |
            | rpl_semi_sync_slave_trace_level       | 32           |
            +---------------------------------------+--------------+
            9 rows in set (0.001 sec)
            

            It's important to highlight that when we switched `rpl_semi_sync_master_enabled` to OFF all problems went away. We didn't have any networking changes or issues running at the time of the problem and the issue started manifesting immediately after switching over from the 10.1.48 to the 10.6.17 server.

            Fardelas Kostis Fardelas added a comment - For completeness the current settings on our primary server are like below right now: MariaDB [marvin_production]> show global variables like '%semi%' ; +---------------------------------------+--------------+ | Variable_name | Value | +---------------------------------------+--------------+ | rpl_semi_sync_master_enabled | OFF | | rpl_semi_sync_master_timeout | 2000 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_COMMIT | | rpl_semi_sync_slave_delay_master | OFF | | rpl_semi_sync_slave_enabled | OFF | | rpl_semi_sync_slave_kill_conn_timeout | 5 | | rpl_semi_sync_slave_trace_level | 32 | +---------------------------------------+--------------+ 9 rows in set ( 0.001 sec) It's important to highlight that when we switched `rpl_semi_sync_master_enabled` to OFF all problems went away. We didn't have any networking changes or issues running at the time of the problem and the issue started manifesting immediately after switching over from the 10.1.48 to the 10.6.17 server.
            knielsen Kristian Nielsen added a comment - - edited

            This could be a duplicate of MDEV-33551, which is fixed in 10.6.18.

            If so a work-around (other than upgrading) may be to use AFTER_SYNC instead of AFTER_COMMIT, since AFTER_SYNC shouldn't suffer from the extreme thread contention of MDEV-33551. (These days, due to group commit, there isn't that much of a difference between AFTER_SYNC and AFTER_COMMIT).

            Of course, disabling semi-sync will also work around the problem, if applicable.

            knielsen Kristian Nielsen added a comment - - edited This could be a duplicate of MDEV-33551 , which is fixed in 10.6.18. If so a work-around (other than upgrading) may be to use AFTER_SYNC instead of AFTER_COMMIT, since AFTER_SYNC shouldn't suffer from the extreme thread contention of MDEV-33551 . (These days, due to group commit, there isn't that much of a difference between AFTER_SYNC and AFTER_COMMIT). Of course, disabling semi-sync will also work around the problem, if applicable.

            Thank you Kristian. This resembles a lot to our case. We will try to reproduce the case on our side and get back to you.

            Fardelas Kostis Fardelas added a comment - Thank you Kristian. This resembles a lot to our case. We will try to reproduce the case on our side and get back to you.

            Fardelas Did you find anything? Should I close this as a duplicate of MDEV-33551?

            knielsen Kristian Nielsen added a comment - Fardelas Did you find anything? Should I close this as a duplicate of MDEV-33551 ?

            Hey knielsen, I wasn't able to replicate the issue with 10.6.18 so feel free to close this as you mentioned. Thanks for your help.

            Fardelas Kostis Fardelas added a comment - Hey knielsen , I wasn't able to replicate the issue with 10.6.18 so feel free to close this as you mentioned. Thanks for your help.

            Closing as duplicate of MDEV-33551.

            knielsen Kristian Nielsen added a comment - Closing as duplicate of MDEV-33551 .

            People

              knielsen Kristian Nielsen
              Fardelas Kostis Fardelas
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.