Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4506

MWL#184: Parallel replication of group-committed transactions

Details

    • Task
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.0.5
    • None
    • None

    Description

      This task is the old idea from here:

      http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184

      Current state:

      The basic things have been implemented in a working prototype. Initial
      benchmarks are promising:

      https://lists.launchpad.net/maria-developers/msg05869.html

      The main missing things are to fix up remaining part of the existing
      replication code to correctly handle things like stopping the slave, error
      handling, and relay-log.info updates.

      Design:

      The idea behind this task is to detect on the master that two or more
      transactions are independent and can be replicated in parallel on the
      slave.

      This uses binlog group commit. When two transactions are group-committed
      together, they are known to be non-conflicting (otherwise one of them would
      have had to wait for the other to commit first and release its row locks).
      The basic idea is the same as the MySQL 5.7 feature
      slave-parallel-type=LOGICAL_CLOCK:

      http://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_parallel_type

      The GTID event attached to every event group in the binlog is extended with a
      64-bit number, the group commit id. The id is incremented after each group
      commit. Thus, on the slave, if two transactions have the same group commit id,
      we can execute them in parallel.

      A new flag FL_GROUP_COMMIT_ID is introduced for the GTID event. The group
      commit id is only present if this flag is set (no group commit id is assigned
      when the group commit contains only a single transaction). When the group
      commit id is present, the size of the GTID event is increased by 2 bytes
      (there were already 6 unused bytes present to preserve backwards compatibility
      with MySQL slaves).

      If a slave connects that does not understand the new feature (older MariaDB or
      any Oracle MySQL), backwards compatibility is ensured, since in this case the
      GTID event is replaced transparently with a BEGIN event, as normal.

      Additionally, if two transactions have GTIDs with different replication domain
      ids, this means that the user has marked that they are independent and can be
      executed in parallel on a slave. Domain ids are already implemented in the
      global transaction ID feature, MDEV-26. This task extends it by actually
      executing such transactions in parallel on the slave, if requested by the
      user.

      Implementation:

      The parallel replication is enabled by setting @@slave_parallel_threads to a
      value of 2 or greater. The variable can be changed dynamically (without
      restarting the server), however all slave threads must be stopped first with
      STOP SLAVE. When parallel replication is enabled, a pool of that many worker
      threads is set up. Each worker thread can potentially execute a binlog event
      in parallel with all other workers, meaning that the value of
      @@slave_parallel_threads gives the maximum potential parallelism of the slave.

      When parallel replication is enabled, the SQL thread no longer executes binlog
      events directly. Instead, each event is queued for one of the worker threads
      to execute. If a transaction is found to be independent of already executing
      transactions, a new worker thread is allocated from the pool, if available,
      and the transaction is queued for parallel execution. Otherwise, the
      transaction is queued for serial executing in an already allocated worker
      thread, and/or it is marked so that the worker thread will wait for all prior
      dependent transactions to complete before starting execution.

      When two transactions with the same replication domain id are replicated in
      parallel (because they share the same commit id), they are still forced to be
      committed in the same order on the slave as on the master. This is important
      for correctness. Without having same commit order, it would be possible for an
      application to see transaction A committed while B was not on one server,
      while on another server B was seen as committed but not A. Preserving commit
      order is also necessary to make promotion of a slave to a new master work;
      this requires that the binlog order is the same on all involved servers.

      When two consecutive transactions A and B are executed in parallel on the
      slave and must preserve commit order, transaction B is marked internally as
      having to wait for A to commit first. If B happens to reach the COMMIT step
      first, it waits for A to commit first. The binlog group commit code is
      extended to take this waiting into account. Thus, when A later gets ready to
      (group) commit, it checks if any transactions are waiting. When it finds B, it
      puts both A itself and B on the group commit list. This ensures that binlog
      group commit is still possible on the slave even though the commits are
      actually serialised, which is an important optimisation when sync_binlog=1.

      When two transactions are replicated in parallel because of different
      replication domain id, there is no requirement to preserve the commit order.
      The different domain ids is an explicit declaration by the user that the
      transactions are independent from the point of view of the application and may
      be re-ordered without violating correctness. Parallel replication of
      transactions in different replication domains is only enabled when using
      global transaction ID (master_use_gtid=slave_pos|current_pos). The current
      GTID position is maintained per domain id, so promotion of slave to master
      works regardless of binlog order.

      Parallel replication can be used both with and without global transaction ID
      enabled.

      Use cases:

      Using the binlog group commit technique, parallel replication can speed up
      replication on the slave automatically, without any other visible changes or
      limitations for applications. The impact of this will however depend on how
      many transactions end up in the same group commit.

      If the commit step on the master is slow (eg. --sync-binlog=1 and
      --innodb-flush-logs-at-trx-commit=1 with slow disk system) and number of
      transactions per second is high, then there will generally be many
      transactions in each group commit, and high opportunity for parallelism on the
      slaves. The average number of transactions per binlog group commit can be
      found by dividing binlog_commits with binlog_group_commits (from SHOW STATUS).

      But if there are fewer but larger transactions, and/or the commit step is fast
      (battery-backed up RAID controller), then there may be too little opportunity
      for parallelism on the slave. To get more transactions included in each group
      commit, the variables @@binlog_commit_wait_count and @@binlog_commit_wait_usec
      can be set. When set, the commit step will wait at most
      @@binlog_commit_wait_usec microseconds for at least @@binlog_commit_wait_count
      transactions to group commit together. This may also decrease disk pressure on
      the master (by reducing number of syncs to disk during commit). However, if
      there is not enough parallelism in the load on the master, setting these
      variables too high may significantly decrease throughput on the master, as the
      database sits idle waiting for commits to arrive that never do.

      When using parallel replication together with global transaction ID, more
      opportunities for parallelism on the slave can be explicitly declared by the
      user/application by setting the @@gtid_domain_id. When the domain id is set
      different for two transactions, this is an explicit statement that those two
      transactions can not conflict, and that the application can tolerate them
      being committed in any order on a slave. This can then be used on slaves for
      parallel replication, if enabled with @@slave_parallel_threads.

      One use case is for long-running queries like a big ALTER TABLE or something
      like that:

          SET @@gtid_domain_id=1;
          ALTER TABLE my_big_table ...

      Normally, such a long-running ALTER TABLE would block subsequent events from
      replicating, causing significant delay in replication. By assigning that
      statement to a different replication domain, other events (in default domain
      0) can then freely replicate in parallel. It is then the responsibility of the
      user to ensure that the ALTER statement has time to complete on all slaves
      before any queries are done on the master that depends on the new table.

      Another use case is to allow two independent applications to replicate
      independently of each other. For example, they might use each their own schema
      and never touch the schema of the other.

      In such cases, the user can set @@gtid_domain_id=1 in one application and
      @@gtid_domain_id=2 in the other, and the transactions from the two
      applications will then be able to replicate in parallel on slaves. This is
      similar to the multi-threaded slave feature of MySQL 5.6; however note that it
      is not limited to using different schema. Any transactions can be marked for
      independent parallel replication, even against the same table, as long as the
      user/application ensures that the transactions are in fact independent.

      Changing @@gtid_domain_id currently requires SUPER privilege, which limits
      this technique to some degree. The reason for the SUPER requirement is that
      unlimited change of @@gtid_domain_id can easily break replication.

      Attachments

        Issue Links

          Activity

            knielsen Kristian Nielsen created issue -
            knielsen Kristian Nielsen made changes -
            Field Original Value New Value
            Assignee Kristian Nielsen [ knielsen ]
            knielsen Kristian Nielsen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            knielsen Kristian Nielsen made changes -
            Priority Trivial [ 5 ] Major [ 3 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.5 [ 13201 ]
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            knielsen Kristian Nielsen made changes -
            Description This task is the old idea from here:

                http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184

            When two (or more) transactions are group committed on the master, we know
            that they do not conflict, because any such conflict would have forced one to
            wait for the other. So if we mark in the binlog that a group of transactions
            were group committed together, we can on the slave apply those events in
            parallel.

            I think the main challence with this task is to come up with a good framework
            for scheduling different event groups to different threads. We need to come up
            with something simple but general, that can also work for the other types of
            parallel replication we need (parallel apply of events in different
            replication domains, and TaoBao-style parallel replication of row events with
            non-conflicting unique key values).

            With global transaction ID now in 10.0, I think the GTID event should be used
            when logging events to identify which events belong to the same group. There
            is actually redundant space in the GTID event, it should be used to add some
            ID that is the same for events in the same group commit, but different for
            different group commits. (This is better than some kind of begin/end events
            around the group. Because these will cause problems if they are filtered out,
            interleaved with events from other replication domains, etc).

            This task should probably use my patch for in-order commit:
            https://lists.launchpad.net/maria-developers/msg05057.html .
            This patch adds a facility to group commit that allows a later transaction to
            delay commit until after commit of the prior transaction, while perserving
            group commit. This is needed to preserve same commit order on the slave as on
            the master, which is required by global transaction ID.

            To get good parallelisation, we need to implement some facility on the master
            for optionally waiting for more transactions to group commit together. I
            imagine something like wait at most M milliseconds for at least N transactions
            to group commit together.
            This task is the old idea from here:

            http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184


            Current state:

            The basic things have been implemented in a working prototype. Initial
            benchmarks are promising:

                https://lists.launchpad.net/maria-developers/msg05869.html

            The main missing things are to fix up remaining part of the existing
            replication code to correctly handle things like stopping the slave, error
            handling, and relay-log.info updates.


            Design:

            The idea behind this task is to detect on the master that two or more
            transactions are independent and can be replicated in parallel on the
            slave.

            This uses binlog group commit. When two transactions are group-committed
            together, they are known to be non-conflicting (otherwise one of them would
            have had to wait for the other to commit first and release its row locks).
            The basic idea is the same as the MySQL 5.7 feature
            slave-parallel-type=LOGICAL_CLOCK:

                http://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_parallel_type

            The GTID event attached to every event group in the binlog is extended with a
            64-bit number, the group commit id. The id is incremented after each group
            commit. Thus, on the slave, if two transactions have the same group commit id,
            we can execute them in parallel.

            A new flag FL_GROUP_COMMIT_ID is introduced for the GTID event. The group
            commit id is only present if this flag is set (no group commit id is assigned
            when the group commit contains only a single transaction). When the group
            commit id is present, the size of the GTID event is increased by 2 bytes
            (there were already 6 unused bytes present to preserve backwards compatibility
            with MySQL slaves).

            If a slave connects that does not understand the new feature (older MariaDB or
            any Oracle MySQL), backwards compatibility is ensured, since in this case the
            GTID event is replaced transparently with a BEGIN event, as normal.

            Additionally, if two transactions have GTIDs with different replication domain
            ids, this means that the user has marked that they are independent and can be
            executed in parallel on a slave. Domain ids are already implemented in the
            global transaction ID feature, MDEV-26. This task extends it by actually
            executing such transactions in parallel on the slave, if requested by the
            user.


            Implementation:

            The parallel replication is enabled by setting @@slave_parallel_threads to a
            value of 2 or greater. The variable can be changed dynamically (without
            restarting the server), however all slave threads must be stopped first with
            STOP SLAVE. When parallel replication is enabled, a pool of that many worker
            threads is set up. Each worker thread can potentially execute a binlog event
            in parallel with all other workers, meaning that the value of
            @@slave_parallel_threads gives the maximum potential parallelism of the slave.

            When parallel replication is enabled, the SQL thread no longer executes binlog
            events directly. Instead, each event is queued for one of the worker threads
            to execute. If a transaction is found to be independent of already executing
            transactions, a new worker thread is allocated from the pool, if available,
            and the transaction is queued for parallel execution. Otherwise, the
            transaction is queued for serial executing in an already allocated worker
            thread, and/or it is marked so that the worker thread will wait for all prior
            dependent transactions to complete before starting execution.

            When two transactions with the same replication domain id are replicated in
            parallel (because they share the same commit id), they are still forced to be
            committed in the same order on the slave as on the master. This is important
            for correctness. Without having same commit order, it would be possible for an
            application to see transaction A committed while B was not on one server,
            while on another server B was seen as committed but not A. Preserving commit
            order is also necessary to make promotion of a slave to a new master work;
            this requires that the binlog order is the same on all involved servers.

            When two consecutive transactions A and B are executed in parallel on the
            slave and must preserve commit order, transaction B is marked internally as
            having to wait for A to commit first. If B happens to reach the COMMIT step
            first, it waits for A to commit first. The binlog group commit code is
            extended to take this waiting into account. Thus, when A later gets ready to
            (group) commit, it checks if any transactions are waiting. When it finds B, it
            puts both A itself and B on the group commit list. This ensures that binlog
            group commit is still possible on the slave even though the commits are
            actually serialised, which is an important optimisation when sync_binlog=1.

            When two transactions are replicated in parallel because of different
            replication domain id, there is no requirement to preserve the commit order.
            The different domain ids is an explicit declaration by the user that the
            transactions are independent from the point of view of the application and may
            be re-ordered without violating correctness. Parallel replication of
            transactions in different replication domains is only enabled when using
            global transaction ID (master_use_gtid=slave_pos|current_pos). The current
            GTID position is maintained per domain id, so promotion of slave to master
            works regardless of binlog order.

            Parallel replication can be used both with and without global transaction ID
            enabled.


            Use cases:

            Using the binlog group commit technique, parallel replication can speed up
            replication on the slave automatically, without any other visible changes or
            limitations for applications. The impact of this will however depend on how
            many transactions end up in the same group commit.

            If the commit step on the master is slow (eg. --sync-binlog=1 and
            --innodb-flush-logs-at-trx-commit=1 with slow disk system) and number of
            transactions per second is high, then there will generally be many
            transactions in each group commit, and high opportunity for parallelism on the
            slaves. The average number of transactions per binlog group commit can be
            found by dividing binlog_commits with binlog_group_commits (from SHOW STATUS).

            But if there are fewer but larger transactions, and/or the commit step is fast
            (battery-backed up RAID controller), then there may be too little opportunity
            for parallelism on the slave. To get more transactions included in each group
            commit, the variables @@binlog_commit_wait_count and @@binlog_commit_wait_usec
            can be set. When set, the commit step will wait at most
            @@binlog_commit_wait_usec microseconds for at least @@binlog_commit_wait_count
            transactions to group commit together. This may also decrease disk pressure on
            the master (by reducing number of syncs to disk during commit). However, if
            there is not enough parallelism in the load on the master, setting these
            variables too high may significantly decrease throughput on the master, as the
            database sits idle waiting for commits to arrive that never do.

            When using parallel replication together with global transaction ID, more
            opportunities for parallelism on the slave can be explicitly declared by the
            user/application by setting the @@gtid_domain_id. When the domain id is set
            different for two transactions, this is an explicit statement that those two
            transactions can not conflict, and that the application can tolerate them
            being committed in any order on a slave. This can then be used on slaves for
            parallel replication, if enabled with @@slave_parallel_threads.

            One use case is for long-running queries like a big ALTER TABLE or something
            like that:

            {noformat}
                SET @@gtid_domain_id=1;
                ALTER TABLE my_big_table ...
            {noformat}

            Normally, such a long-running ALTER TABLE would block subsequent events from
            replicating, causing significant delay in replication. By assigning that
            statement to a different replication domain, other events (in default domain
            0) can then freely replicate in parallel. It is then the responsibility of the
            user to ensure that the ALTER statement has time to complete on all slaves
            before any queries are done on the master that depends on the new table.

            Another use case is to allow two independent applications to replicate
            independently of each other. For example, they might use each their own schema
            and never touch the schema of the other.

            In such cases, the user can set @@gtid_domain_id=1 in one application and
            @@gtid_domain_id=2 in the other, and the transactions from the two
            applications will then be able to replicate in parallel on slaves. This is
            similar to the multi-threaded slave feature of MySQL 5.6; however note that it
            is not limited to using different schema. Any transactions can be marked for
            independent parallel replication, even against the same table, as long as the
            user/application ensures that the transactions are in fact independent.

            Changing @@gtid_domain_id currently requires SUPER privilege, which limits
            this technique to some degree. The reason for the SUPER requirement is that
            unlimited change of @@gtid_domain_id can easily break replication.
            jeremycole Jeremy Cole added a comment -

            Kristian, can you comment on whether we should expect any problems combining parallel replication with GTID strict mode?

            jeremycole Jeremy Cole added a comment - Kristian, can you comment on whether we should expect any problems combining parallel replication with GTID strict mode?

            > Kristian, can you comment on whether we should expect any problems combining
            > parallel replication with GTID strict mode?

            On the contrary, using GTID, and GTID strict mode, is the recommended way to
            work with parallel replication.

            knielsen Kristian Nielsen added a comment - > Kristian, can you comment on whether we should expect any problems combining > parallel replication with GTID strict mode? On the contrary, using GTID, and GTID strict mode, is the recommended way to work with parallel replication.
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.6 [ 13202 ]
            Fix Version/s 10.0.5 [ 13201 ]

            The code is pushed into 10.0.5 (though it's far from complete)

            knielsen Kristian Nielsen added a comment - The code is pushed into 10.0.5 (though it's far from complete)
            knielsen Kristian Nielsen made changes -
            Fix Version/s 10.0.5 [ 13201 ]
            Fix Version/s 10.0.6 [ 13202 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            knielsen Kristian Nielsen made changes -
            knielsen Kristian Nielsen made changes -
            knielsen Kristian Nielsen made changes -
            knielsen Kristian Nielsen made changes -
            serg Sergei Golubchik made changes -
            Workflow defaullt [ 27369 ] MariaDB v2 [ 42974 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 42974 ] MariaDB v3 [ 61590 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 61590 ] MariaDB v4 [ 132123 ]

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.