Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30421

SAMU-64 Allow administrators to enable or disable parallel replication on a per-table basis

Details

    • New Feature
    • Status: Needs Feedback (View Workflow)
    • Major
    • Resolution: Unresolved
    • 12.1
    • Replication
    • None

    Description

      SAMU-64 Allow administrators to enable or disable parallel replication on a per-table basis

      Per-domain dedicated thread for processing ordered transactions. The thread is reserved from the total number of domain threads (controlled by slave_parallel_threads and slave_domain_parallel_threads). Whether the event goes to ordered thread depends on FL_ALLOW_PARALLEL flag as well as several other conditions. FL_ALLOW_PARALLEL is passed from master and is set for the event depending on master configuration directives. To allow dedicated slave on server one must enable it explicitly with configuration directive:

        set global slave_ordered_thread= 1;
      

      Originally it was controlled by skip_parallel_replication session variable which can be changed per-statement. This patch adds several more directives to control it on per-schema and per-table levels:

        parallel_do_db
        parallel_do_table
        parallel_ignore_db
        parallel_ignore_table
        parallel_wild_do_table
        parallel_wild_ignore_table
      

      Each directive is comma-separated list of fully-qualified table names. Spaces after comma are ignored (but not before).

      "Table" directives take precedence over "db" directives. "Do" directives take precedence over "ignore" directives. "Wild" directives are checked if "do" and "ignore" directives did not match.

      If none of the above directives present everything is considered parallel. If any of the above directives present and the table did not match anything in the lists it is considered ordered.

      Examples:

        set @@global.parallel_do_db=  "db_parallel";
        set @@global.parallel_ignore_db= "db_serial";
        set global parallel_do_table=  "db_serial.t3,  db_serial.t1";
        set global parallel_wild_ignore_table= "db_parallel.non_parallel_%"
      

      Normal behavior of ordered transaction is before start to wait any of prior transactions to commit: they get into different commit groups. But since all the ordered transactions (within one domain) go to a single thread we may avoid that restriction with this directive on slave:

        set global slave_ordered_dont_wait= 1;
      

      When set events without explicit FL_WAITED flag going to ordered thread nonetheless accept optimistic speculation. I.e. they get into same commit group with parallel events: ordered event is executed in parallel with parallel events.

      Attachments

        Issue Links

          Activity

            ElkinFollowing-up on my last post, the added complexity of expanding this via GTID domain ID could be a little rough.

            The default assumption of different GTID domains is that they exist for when multiple datasets that do not have interdependencies of any kind with each other exist on the same MariaDB Server. This differs from what we're discussing in this MDEV.

            For the original implementation, ignoring the above was fine because the end user does not use GTID domains, so it was safe for us to "abuse" GTID domain ID functionality to handle this because we know, intrinsically, that we still need to synchronize between that GTID domain and the main one.

            However, for more generalized use-cases where end users have multiple GTID domains already in-use, there becomes the question of how to map things. Ex, say I have gtid_domain_id=1, gtid_domain_id=2, and gtid_domain_id=3 on my MariaDB Server 10.6. These are all unique replication streams which do not depend on each other at all and can run in complete parallel with each other.

            Now say that I am ready to upgrade to a MariaDB version that supports something like this MDEV-30421. Say I have a table in gtid_domain_id=2 which runs afoul of slave_parallel_mode=optimistic, so I want to separate that table into a serial replication setup.

            I think the real problem to solve is how we'd handle the above. Just creating gtid_domain_id=99 and configuring that table to use single-thread replication and for that to synchronize with gtid_domain_id=2 might be a little difficult.

            It may be worth exploring other nomenclature pathways for this. Whether that would require adding to GTID (ex- domain_id-stream_id-server_id-position) or using server_id instead (maybe update the documentation on that to make it a little more abstract) as that might map a little closer to what we're doing logically here (though from a technical standpoint, that seems to have a lot of obstacles as well, not the least of which that it would raise some major questions about how to configure primaries and replicas such that when a primary fails, a replica can be promoted without needing its configuration changed extensively).

            Is this more of what you were getting at Elkin?

            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) added a comment - Elkin Following-up on my last post, the added complexity of expanding this via GTID domain ID could be a little rough. The default assumption of different GTID domains is that they exist for when multiple datasets that do not have interdependencies of any kind with each other exist on the same MariaDB Server. This differs from what we're discussing in this MDEV. For the original implementation, ignoring the above was fine because the end user does not use GTID domains, so it was safe for us to "abuse" GTID domain ID functionality to handle this because we know, intrinsically, that we still need to synchronize between that GTID domain and the main one. However, for more generalized use-cases where end users have multiple GTID domains already in-use, there becomes the question of how to map things. Ex, say I have gtid_domain_id=1 , gtid_domain_id=2 , and gtid_domain_id=3 on my MariaDB Server 10.6. These are all unique replication streams which do not depend on each other at all and can run in complete parallel with each other. Now say that I am ready to upgrade to a MariaDB version that supports something like this MDEV-30421 . Say I have a table in gtid_domain_id=2 which runs afoul of slave_parallel_mode=optimistic , so I want to separate that table into a serial replication setup. I think the real problem to solve is how we'd handle the above. Just creating gtid_domain_id=99 and configuring that table to use single-thread replication and for that to synchronize with gtid_domain_id=2 might be a little difficult. It may be worth exploring other nomenclature pathways for this. Whether that would require adding to GTID (ex- domain_id-stream_id-server_id-position ) or using server_id instead (maybe update the documentation on that to make it a little more abstract) as that might map a little closer to what we're doing logically here (though from a technical standpoint, that seems to have a lot of obstacles as well, not the least of which that it would raise some major questions about how to configure primaries and replicas such that when a primary fails, a replica can be promoted without needing its configuration changed extensively). Is this more of what you were getting at Elkin ?
            Elkin Andrei Elkin added a comment -

            rob.schwyzer@mariadb.com, on the points that you raise,
            1. indeed something like [gtid-domain-99] is imaginable and achievable, including domain dependency;
            2. you're right that if some form of synchronization between the main domain and "99" domain is necessary then
            SET @@session.skip_parallel_replication=1 already provides correctness that is for a general case when
            DBA is unaware of synchronization properties (such as two transactions from different domain may have common data, or they must commit in certain order).
            3. I did not consider the mapped domain dependency on the main one. I feel it's a productive area of research. Like your consideration of server-id hints at the notion of sub-domain that also inherits the new property of the arbitrary size. Transaction that are assigned to a sub-domain must serialize their commit with the rest of the domain.

            I'd spend some little time to prove it for myself that it's not a trap

            Elkin Andrei Elkin added a comment - rob.schwyzer@mariadb.com , on the points that you raise, 1. indeed something like [gtid-domain-99] is imaginable and achievable, including domain dependency; 2. you're right that if some form of synchronization between the main domain and "99" domain is necessary then SET @@session.skip_parallel_replication=1 already provides correctness that is for a general case when DBA is unaware of synchronization properties (such as two transactions from different domain may have common data, or they must commit in certain order). 3. I did not consider the mapped domain dependency on the main one. I feel it's a productive area of research. Like your consideration of server-id hints at the notion of sub-domain that also inherits the new property of the arbitrary size. Transaction that are assigned to a sub-domain must serialize their commit with the rest of the domain. I'd spend some little time to prove it for myself that it's not a trap
            Elkin Andrei Elkin added a comment -

            To expand on p.3, we don't have to introduce the sub-domain formally. In fact since MDEV-11675 we have gotten a functionality to distribute quasi-parallelizable load to respect some dependencies. Later MDEV-33668 extended that to XA transactions. For example two prepared having the same xid X transactions T_1(X) --> T_k(X) are scheduled to the same worker.
            Let's exploit that, to effectively construct multiple "sub-domains" each served by a single worker and the overall domain transactions' commits respect the binlog order of the domain. E.g a setting like

            set @@global.replicate_db_to_subdomain= "db_A->sd1";
            set @@global.replicate_db_to_subdomain= "db_B->sd2";
            

            implements 2 sub-domains. The sub-domain id can be understood as a worker index, to suggest it to be an integer encoded into Gtid_log_event. Such way marked GTID:s will be handled by the parallel slave distributor (aka SQL/Driver) thread similarly to ALTER fragments or XA parts.

            Elkin Andrei Elkin added a comment - To expand on p.3, we don't have to introduce the sub-domain formally. In fact since MDEV-11675 we have gotten a functionality to distribute quasi-parallelizable load to respect some dependencies. Later MDEV-33668 extended that to XA transactions. For example two prepared having the same xid X transactions T_1(X) --> T_k(X) are scheduled to the same worker. Let's exploit that, to effectively construct multiple "sub-domains" each served by a single worker and the overall domain transactions' commits respect the binlog order of the domain. E.g a setting like set @@global.replicate_db_to_subdomain= "db_A->sd1"; set @@global.replicate_db_to_subdomain= "db_B->sd2"; implements 2 sub-domains. The sub-domain id can be understood as a worker index, to suggest it to be an integer encoded into Gtid_log_event . Such way marked GTID :s will be handled by the parallel slave distributor (aka SQL/Driver) thread similarly to ALTER fragments or XA parts.

            That sounds promising!

            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) added a comment - That sounds promising!
            Elkin Andrei Elkin added a comment - - edited

            Let's sum up the discussion points.
            The ticket aims at finding methods to improve the parallel slave performance in cases the master's binlog/slave's-workload consists of dependent/conflicting transactions.

            When conflicts can be identified/localized within sets of databases or tables, methods like one of the description or a later proposed "sub-domain" might make sense. As either type features a single threaded applying of conflicting transactions
            all these methods can be qualified as best as hopeful. Please read on to find another approach that really scales up.

            But first few remarks on already discussed candidates.
            Considering the description's one I would still prefer to convert it so that the master side setting would define GTID domains. That is a proposed
            set global parallel_do_table= "db_serial.t3, db_serial.t1";
            whould be converted into something like
            set global binlog_do_tables_to_domain= 99,"db_serial.t3, db_serial.t1"
            where the assigned value's first number, 99 - a "problematic" domain, is the gtid domain to binlog with.

            This measure naturally partitions binlog into domains so the slave side might like or just need an option to control the size of the worker pool for a specific domain, that is to a single worker to handle the conflicting stream of transactions. That can be technically arranged, borrowing perhaps a pattern from multi-source replication.

            On the other hand the sub-domain method does not require the master side changes nor the per-domain pool size tweak.

            As noted it must be fair to suspect that neither of the two methods may actually achieve any better performance in practical deployment.
            Indeed their success is lesser the greater percentage of conflicting transactions in the slave workload.
            Obviously we can blame their single-threaded-ness. However it's also about granularity of conflicting objects, which are the entire transactions.

            Back at the time when these two methods were actively discussed a general approach was also crossed mind of few people, incl serg. On a recent replication team meeting we crossed it again. And that is why won't we compute from or log dependency information into replication (ROW-format) events? And having that a transaction would be executed to respect dependency. The dependency can be defined
            on as finest as the record level. E.g T2 depends on T1 (think of 1 and 2 as gtid:s)

            T2(r_k,r_m) -> T1(r_m)

            via the common record r_m that is modified by both. Here _k identifies a record in the table.
            Assume T1 is the very first to execute on slave. At scheduling of T1 its r_m^1 record's id would be registered in some conflict set of such id:s. Here ^1,^2 denote transactions that operate over the record. Now when T1 would be running T2 turns in for scheduling. Its r_k and r_m^2 would be also checked in the conflict set. r_k would be accepted right away, but r_m^2 would be only queued for acceptance. Further on T2 would start running, first on r_k, and afterward it would switch to r_m^2. At this point depending on T1 pace the record would either be found available or not yet. In the latter situation T2 waits.

            This 3rd method in my view would be a great candidate as it apparently scales.
            Even with pretty contentious workload the record level granularity of conflict handling must still promise some scale up. Let me leave out low level technical aspects for the nonce, mentioning though the record id encoding (e.g on master) into binlog and decoding is somewhat standard practice, at least when tables are PK-equipped.

            As a final point, regardless of whether or not this general method will be endorsed and before committing to any of them, I (knielsen points to that too) believe we need to spend time on benchmarking (close to) practical or modeled workload which would unequivocally reveal the optimistic parallel slave's retry indeed may bear unacceptable execution cost. For that analysis we also have now as a useful tool (being reviewed) MDEV-35217 parallel replication stats.

            Elkin Andrei Elkin added a comment - - edited Let's sum up the discussion points. The ticket aims at finding methods to improve the parallel slave performance in cases the master's binlog/slave's-workload consists of dependent/conflicting transactions. When conflicts can be identified/localized within sets of databases or tables, methods like one of the description or a later proposed "sub-domain" might make sense. As either type features a single threaded applying of conflicting transactions all these methods can be qualified as best as hopeful. Please read on to find another approach that really scales up. But first few remarks on already discussed candidates. Considering the description's one I would still prefer to convert it so that the master side setting would define GTID domains. That is a proposed set global parallel_do_table= "db_serial.t3, db_serial.t1"; whould be converted into something like set global binlog_do_tables_to_domain= 99,"db_serial.t3, db_serial.t1" where the assigned value's first number, 99 - a "problematic" domain, is the gtid domain to binlog with. This measure naturally partitions binlog into domains so the slave side might like or just need an option to control the size of the worker pool for a specific domain, that is to a single worker to handle the conflicting stream of transactions. That can be technically arranged, borrowing perhaps a pattern from multi-source replication. On the other hand the sub-domain method does not require the master side changes nor the per-domain pool size tweak. As noted it must be fair to suspect that neither of the two methods may actually achieve any better performance in practical deployment. Indeed their success is lesser the greater percentage of conflicting transactions in the slave workload. Obviously we can blame their single-threaded-ness. However it's also about granularity of conflicting objects, which are the entire transactions. Back at the time when these two methods were actively discussed a general approach was also crossed mind of few people, incl serg . On a recent replication team meeting we crossed it again. And that is why won't we compute from or log dependency information into replication (ROW-format) events? And having that a transaction would be executed to respect dependency. The dependency can be defined on as finest as the record level. E.g T2 depends on T1 (think of 1 and 2 as gtid:s) T2(r_k,r_m) -> T1(r_m) via the common record r_m that is modified by both. Here _k identifies a record in the table. Assume T1 is the very first to execute on slave. At scheduling of T1 its r_m^1 record's id would be registered in some conflict set of such id:s. Here ^1,^2 denote transactions that operate over the record. Now when T1 would be running T2 turns in for scheduling. Its r_k and r_m^2 would be also checked in the conflict set. r_k would be accepted right away, but r_m^2 would be only queued for acceptance. Further on T2 would start running, first on r_k , and afterward it would switch to r_m^2 . At this point depending on T1 pace the record would either be found available or not yet. In the latter situation T2 waits. This 3rd method in my view would be a great candidate as it apparently scales. Even with pretty contentious workload the record level granularity of conflict handling must still promise some scale up. Let me leave out low level technical aspects for the nonce, mentioning though the record id encoding (e.g on master) into binlog and decoding is somewhat standard practice, at least when tables are PK-equipped. As a final point, regardless of whether or not this general method will be endorsed and before committing to any of them, I ( knielsen points to that too) believe we need to spend time on benchmarking (close to) practical or modeled workload which would unequivocally reveal the optimistic parallel slave's retry indeed may bear unacceptable execution cost. For that analysis we also have now as a useful tool (being reviewed) MDEV-35217 parallel replication stats.

            People

              Elkin Andrei Elkin
              midenok Aleksey Midenkov
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.