Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34201

Excessive timeout from --binlog-commit-wait-count for FL_WAITED GTID in optimistic parallel replication

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.4(EOL)
    • None
    • Replication
    • None

    Description

      The --binlog-commit-wait-count=N option delays commits until N commits can group-commit together.

      In parallel replication, transactions must commit in order, so if some transaction T2 needs to wait for commit of the previous transaction T1, this will prevent any later transaction from group-committing with T1 and reaching the desired count N of commits in the group. To avoid excessive timeouts, an InnoDB row lock wait in T2 for a lock held by T1 will abort the wait and cause T1 to complete commit immediately, without waiting for N commits in the group.

      However, another type of wait can cause this problem to happen. If T2 has the FL_WAITED flag set in the GTID (because T2 had a row lock wait on the orginal master), or if FL_ALLOW_PARALLEL is not set (becase @@skip_parallel_replication was set on the master), then T2 will on the slave wait for T1 to commit first, preventing T1 from getting N commits in the group. This will lead to timeout (--binlog-commit-wait-usec) and can reduce slave throughput in optimistic parallel replication.

      Aggressive parallel replication mode is not affected, as it does not do these waits.

      I think conservative mode also is mostly not affected, as it has mostly transactions waiting for earlier transactions to start commit, which should not block group commits. This needs to be investigated. Also other types of wait might be worth it to investigate if they are affected similarly.

      It would be better if these kinds of waits would trigger immediate commit of the earlier transaction just like innodb row lock waits.

      Attached a test case rpl_commit_wait.test which shows the issue in optimistic mode that causes commit wait timeouts. Changing the test to aggressive removes the excessive timeouts.

      Attachments

        Activity

          People

            Unassigned Unassigned
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.