[MDEV-30421] SAMU-64 Allow administrators to enable or disable parallel replication on a per-table basis - Jira

Details

Type: New Feature
Status: Needs Feedback (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 12.1
Component/s: Replication
Labels:
None

Description

SAMU-64 Allow administrators to enable or disable parallel replication on a per-table basis

Per-domain dedicated thread for processing ordered transactions. The thread is reserved from the total number of domain threads (controlled by slave_parallel_threads and slave_domain_parallel_threads). Whether the event goes to ordered thread depends on FL_ALLOW_PARALLEL flag as well as several other conditions. FL_ALLOW_PARALLEL is passed from master and is set for the event depending on master configuration directives. To allow dedicated slave on server one must enable it explicitly with configuration directive:
  set global slave_ordered_thread= 1;
Originally it was controlled by skip_parallel_replication session variable which can be changed per-statement. This patch adds several more directives to control it on per-schema and per-table levels:
  parallel_do_db
  parallel_do_table
  parallel_ignore_db
  parallel_ignore_table
  parallel_wild_do_table
  parallel_wild_ignore_table
Each directive is comma-separated list of fully-qualified table names. Spaces after comma are ignored (but not before).

"Table" directives take precedence over "db" directives. "Do" directives take precedence over "ignore" directives. "Wild" directives are checked if "do" and "ignore" directives did not match.

If none of the above directives present everything is considered parallel. If any of the above directives present and the table did not match anything in the lists it is considered ordered.

Examples:
  set @@global.parallel_do_db=  "db_parallel";
  set @@global.parallel_ignore_db= "db_serial";
  set global parallel_do_table=  "db_serial.t3,  db_serial.t1";
  set global parallel_wild_ignore_table= "db_parallel.non_parallel_%"
Normal behavior of ordered transaction is before start to wait any of prior transactions to commit: they get into different commit groups. But since all the ordered transactions (within one domain) go to a single thread we may avoid that restriction with this directive on slave:
  set global slave_ordered_dont_wait= 1;
When set events without explicit FL_WAITED flag going to ordered thread nonetheless accept optimistic speculation. I.e. they get into same commit group with parallel events: ordered event is executed in parallel with parallel events.

Attachments

Issue Links

mentioned in: Page Failed to load; Page Failed to load; Page Failed to load; Page Failed to load; Page Failed to load; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

relates to: SAMU-64 Loading...

(32 mentioned in, 1 relates to)

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Rob Schwyzer (Inactive) added a comment - 2024-04-18 20:47

ElkinFollowing-up on my last post, the added complexity of expanding this via GTID domain ID could be a little rough.

The default assumption of different GTID domains is that they exist for when multiple datasets that do not have interdependencies of any kind with each other exist on the same MariaDB Server. This differs from what we're discussing in this MDEV.

For the original implementation, ignoring the above was fine because the end user does not use GTID domains, so it was safe for us to "abuse" GTID domain ID functionality to handle this because we know, intrinsically, that we still need to synchronize between that GTID domain and the main one.

However, for more generalized use-cases where end users have multiple GTID domains already in-use, there becomes the question of how to map things. Ex, say I have gtid_domain_id=1, gtid_domain_id=2, and gtid_domain_id=3 on my MariaDB Server 10.6. These are all unique replication streams which do not depend on each other at all and can run in complete parallel with each other.

Now say that I am ready to upgrade to a MariaDB version that supports something like this MDEV-30421. Say I have a table in gtid_domain_id=2 which runs afoul of slave_parallel_mode=optimistic, so I want to separate that table into a serial replication setup.

I think the real problem to solve is how we'd handle the above. Just creating gtid_domain_id=99 and configuring that table to use single-thread replication and for that to synchronize with gtid_domain_id=2 might be a little difficult.

It may be worth exploring other nomenclature pathways for this. Whether that would require adding to GTID (ex- domain_id-stream_id-server_id-position) or using server_id instead (maybe update the documentation on that to make it a little more abstract) as that might map a little closer to what we're doing logically here (though from a technical standpoint, that seems to have a lot of obstacles as well, not the least of which that it would raise some major questions about how to configure primaries and replicas such that when a primary fails, a replica can be promoted without needing its configuration changed extensively).

Is this more of what you were getting at Elkin?

Rob Schwyzer (Inactive) added a comment - 2024-04-18 20:47 Elkin Following-up on my last post, the added complexity of expanding this via GTID domain ID could be a little rough. The default assumption of different GTID domains is that they exist for when multiple datasets that do not have interdependencies of any kind with each other exist on the same MariaDB Server. This differs from what we're discussing in this MDEV. For the original implementation, ignoring the above was fine because the end user does not use GTID domains, so it was safe for us to "abuse" GTID domain ID functionality to handle this because we know, intrinsically, that we still need to synchronize between that GTID domain and the main one. However, for more generalized use-cases where end users have multiple GTID domains already in-use, there becomes the question of how to map things. Ex, say I have gtid_domain_id=1 , gtid_domain_id=2 , and gtid_domain_id=3 on my MariaDB Server 10.6. These are all unique replication streams which do not depend on each other at all and can run in complete parallel with each other. Now say that I am ready to upgrade to a MariaDB version that supports something like this MDEV-30421 . Say I have a table in gtid_domain_id=2 which runs afoul of slave_parallel_mode=optimistic , so I want to separate that table into a serial replication setup. I think the real problem to solve is how we'd handle the above. Just creating gtid_domain_id=99 and configuring that table to use single-thread replication and for that to synchronize with gtid_domain_id=2 might be a little difficult. It may be worth exploring other nomenclature pathways for this. Whether that would require adding to GTID (ex- domain_id-stream_id-server_id-position ) or using server_id instead (maybe update the documentation on that to make it a little more abstract) as that might map a little closer to what we're doing logically here (though from a technical standpoint, that seems to have a lot of obstacles as well, not the least of which that it would raise some major questions about how to configure primaries and replicas such that when a primary fails, a replica can be promoted without needing its configuration changed extensively). Is this more of what you were getting at Elkin ?

Andrei Elkin added a comment - 2024-04-19 19:21

rob.schwyzer@mariadb.com, on the points that you raise,
1. indeed something like [gtid-domain-99] is imaginable and achievable, including domain dependency;
2. you're right that if some form of synchronization between the main domain and "99" domain is necessary then
SET @@session.skip_parallel_replication=1 already provides correctness that is for a general case when
DBA is unaware of synchronization properties (such as two transactions from different domain may have common data, or they must commit in certain order).
3. I did not consider the mapped domain dependency on the main one. I feel it's a productive area of research. Like your consideration of server-id hints at the notion of sub-domain that also inherits the new property of the arbitrary size. Transaction that are assigned to a sub-domain must serialize their commit with the rest of the domain.

I'd spend some little time to prove it for myself that it's not a trap

Andrei Elkin added a comment - 2024-04-19 19:21 rob.schwyzer@mariadb.com , on the points that you raise, 1. indeed something like [gtid-domain-99] is imaginable and achievable, including domain dependency; 2. you're right that if some form of synchronization between the main domain and "99" domain is necessary then SET @@session.skip_parallel_replication=1 already provides correctness that is for a general case when DBA is unaware of synchronization properties (such as two transactions from different domain may have common data, or they must commit in certain order). 3. I did not consider the mapped domain dependency on the main one. I feel it's a productive area of research. Like your consideration of server-id hints at the notion of sub-domain that also inherits the new property of the arbitrary size. Transaction that are assigned to a sub-domain must serialize their commit with the rest of the domain. I'd spend some little time to prove it for myself that it's not a trap

Andrei Elkin added a comment - 2024-04-20 11:10

To expand on p.3, we don't have to introduce the sub-domain formally. In fact since ~~MDEV-11675~~ we have gotten a functionality to distribute quasi-parallelizable load to respect some dependencies. Later ~~MDEV-33668~~ extended that to XA transactions. For example two prepared having the same xid X transactions T_1(X) --> T_k(X) are scheduled to the same worker.
Let's exploit that, to effectively construct multiple "sub-domains" each served by a single worker and the overall domain transactions' commits respect the binlog order of the domain. E.g a setting like

set @@global.replicate_db_to_subdomain= "db_A->sd1";

set @@global.replicate_db_to_subdomain= "db_B->sd2";

implements 2 sub-domains. The sub-domain id can be understood as a worker index, to suggest it to be an integer encoded into Gtid_log_event. Such way marked GTID:s will be handled by the parallel slave distributor (aka SQL/Driver) thread similarly to ALTER fragments or XA parts.

Andrei Elkin added a comment - 2024-04-20 11:10 To expand on p.3, we don't have to introduce the sub-domain formally. In fact since MDEV-11675 we have gotten a functionality to distribute quasi-parallelizable load to respect some dependencies. Later MDEV-33668 extended that to XA transactions. For example two prepared having the same xid X transactions T_1(X) --> T_k(X) are scheduled to the same worker. Let's exploit that, to effectively construct multiple "sub-domains" each served by a single worker and the overall domain transactions' commits respect the binlog order of the domain. E.g a setting like set @@global.replicate_db_to_subdomain= "db_A->sd1"; set @@global.replicate_db_to_subdomain= "db_B->sd2"; implements 2 sub-domains. The sub-domain id can be understood as a worker index, to suggest it to be an integer encoded into Gtid_log_event . Such way marked GTID :s will be handled by the parallel slave distributor (aka SQL/Driver) thread similarly to ALTER fragments or XA parts.

Rob Schwyzer (Inactive) added a comment - 2024-04-22 21:00

That sounds promising!

Rob Schwyzer (Inactive) added a comment - 2024-04-22 21:00 That sounds promising!

Andrei Elkin added a comment - 2025-02-18 19:27 - edited

Let's sum up the discussion points.
The ticket aims at finding methods to improve the parallel slave performance in cases the master's binlog/slave's-workload consists of dependent/conflicting transactions.

When conflicts can be identified/localized within sets of databases or tables, methods like one of the description or a later proposed "sub-domain" might make sense. As either type features a single threaded applying of conflicting transactions
all these methods can be qualified as best as hopeful. Please read on to find another approach that really scales up.

But first few remarks on already discussed candidates.
Considering the description's one I would still prefer to convert it so that the master side setting would define GTID domains. That is a proposed
set global parallel_do_table= "db_serial.t3, db_serial.t1";
whould be converted into something like
set global binlog_do_tables_to_domain= 99,"db_serial.t3, db_serial.t1"
where the assigned value's first number, 99 - a "problematic" domain, is the gtid domain to binlog with.

This measure naturally partitions binlog into domains so the slave side might like or just need an option to control the size of the worker pool for a specific domain, that is to a single worker to handle the conflicting stream of transactions. That can be technically arranged, borrowing perhaps a pattern from multi-source replication.

On the other hand the sub-domain method does not require the master side changes nor the per-domain pool size tweak.

As noted it must be fair to suspect that neither of the two methods may actually achieve any better performance in practical deployment.
Indeed their success is lesser the greater percentage of conflicting transactions in the slave workload.
Obviously we can blame their single-threaded-ness. However it's also about granularity of conflicting objects, which are the entire transactions.

Back at the time when these two methods were actively discussed a general approach was also crossed mind of few people, incl serg. On a recent replication team meeting we crossed it again. And that is why won't we compute from or log dependency information into replication (ROW-format) events? And having that a transaction would be executed to respect dependency. The dependency can be defined
on as finest as the record level. E.g T2 depends on T1 (think of 1 and 2 as gtid:s)

T2(r_k,r_m) -> T1(r_m)

via the common record r_m that is modified by both. Here _k identifies a record in the table.
Assume T1 is the very first to execute on slave. At scheduling of T1 its r_m^1 record's id would be registered in some conflict set of such id:s. Here ^1,^2 denote transactions that operate over the record. Now when T1 would be running T2 turns in for scheduling. Its r_k and r_m^2 would be also checked in the conflict set. r_k would be accepted right away, but r_m^2 would be only queued for acceptance. Further on T2 would start running, first on r_k, and afterward it would switch to r_m^2. At this point depending on T1 pace the record would either be found available or not yet. In the latter situation T2 waits.

This 3rd method in my view would be a great candidate as it apparently scales.
Even with pretty contentious workload the record level granularity of conflict handling must still promise some scale up. Let me leave out low level technical aspects for the nonce, mentioning though the record id encoding (e.g on master) into binlog and decoding is somewhat standard practice, at least when tables are PK-equipped.

As a final point, regardless of whether or not this general method will be endorsed and before committing to any of them, I (knielsen points to that too) believe we need to spend time on benchmarking (close to) practical or modeled workload which would unequivocally reveal the optimistic parallel slave's retry indeed may bear unacceptable execution cost. For that analysis we also have now as a useful tool (being reviewed) MDEV-35217 parallel replication stats.

Andrei Elkin added a comment - 2025-02-18 19:27 - edited Let's sum up the discussion points. The ticket aims at finding methods to improve the parallel slave performance in cases the master's binlog/slave's-workload consists of dependent/conflicting transactions. When conflicts can be identified/localized within sets of databases or tables, methods like one of the description or a later proposed "sub-domain" might make sense. As either type features a single threaded applying of conflicting transactions all these methods can be qualified as best as hopeful. Please read on to find another approach that really scales up. But first few remarks on already discussed candidates. Considering the description's one I would still prefer to convert it so that the master side setting would define GTID domains. That is a proposed set global parallel_do_table= "db_serial.t3, db_serial.t1"; whould be converted into something like set global binlog_do_tables_to_domain= 99,"db_serial.t3, db_serial.t1" where the assigned value's first number, 99 - a "problematic" domain, is the gtid domain to binlog with. This measure naturally partitions binlog into domains so the slave side might like or just need an option to control the size of the worker pool for a specific domain, that is to a single worker to handle the conflicting stream of transactions. That can be technically arranged, borrowing perhaps a pattern from multi-source replication. On the other hand the sub-domain method does not require the master side changes nor the per-domain pool size tweak. As noted it must be fair to suspect that neither of the two methods may actually achieve any better performance in practical deployment. Indeed their success is lesser the greater percentage of conflicting transactions in the slave workload. Obviously we can blame their single-threaded-ness. However it's also about granularity of conflicting objects, which are the entire transactions. Back at the time when these two methods were actively discussed a general approach was also crossed mind of few people, incl serg . On a recent replication team meeting we crossed it again. And that is why won't we compute from or log dependency information into replication (ROW-format) events? And having that a transaction would be executed to respect dependency. The dependency can be defined on as finest as the record level. E.g T2 depends on T1 (think of 1 and 2 as gtid:s) T2(r_k,r_m) -> T1(r_m) via the common record r_m that is modified by both. Here _k identifies a record in the table. Assume T1 is the very first to execute on slave. At scheduling of T1 its r_m^1 record's id would be registered in some conflict set of such id:s. Here ^1,^2 denote transactions that operate over the record. Now when T1 would be running T2 turns in for scheduling. Its r_k and r_m^2 would be also checked in the conflict set. r_k would be accepted right away, but r_m^2 would be only queued for acceptance. Further on T2 would start running, first on r_k , and afterward it would switch to r_m^2 . At this point depending on T1 pace the record would either be found available or not yet. In the latter situation T2 waits. This 3rd method in my view would be a great candidate as it apparently scales. Even with pretty contentious workload the record level granularity of conflict handling must still promise some scale up. Let me leave out low level technical aspects for the nonce, mentioning though the record id encoding (e.g on master) into binlog and decoding is somewhat standard practice, at least when tables are PK-equipped. As a final point, regardless of whether or not this general method will be endorsed and before committing to any of them, I ( knielsen points to that too) believe we need to spend time on benchmarking (close to) practical or modeled workload which would unequivocally reveal the optimistic parallel slave's retry indeed may bear unacceptable execution cost. For that analysis we also have now as a useful tool (being reviewed) MDEV-35217 parallel replication stats.

People

Assignee:: Andrei Elkin

Reporter:: Aleksey Midenkov

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 2023-01-17 12:49

Updated:: 2025-03-24 07:55

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.