[MDEV-34485] Ignored GTID domain IDs still appear in gtid_slave_pos - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6.18
Fix Version/s: None
Component/s: Replication
Labels:
None

Description

Hello,

I have an asynchronous replica set up to a Galera cluster with a load balancer in front. The replica may be connected to either cluster member.

I have configured wsrep_mode on, wsrep_gtid_domain_id to 4248433653, and the 2 node's (local) gtid_domain_id to 293176233 and 1908279884. On the replica, I have configured Replicate_Ignore_Domain_Ids: 293176233, 1908279884.

When setting up replication, I removed those domains from the slave position:
SET GLOBAL gtid_slave_pos="4248433653-293176233-12529601";
CHANGE MASTER TO master_use_gtid=slave_pos, replicate_ignore_domain_ids=(293176233, 1908279884);
START SLAVE;

Initially, replication comes up and I am able to switch it over between the 2 cluster node.

But after a rolling schema upgrade on the cluster, the local gtid appears in the gtid_slave_pos on the replica. Then, when failing the replication over to another node, the replication does not start up.

It is not a work-around to simply remove that domain id from the replication

I expect that GTID to not appear in the gtid_slave_pos as it is ignored.

Is this intended behaviour? And if it is, how can I avoid these domains to appear in gtid_slave_pos? Otherwise it is quite difficult to replicate asynchronously from a galera cluster.

Attachments

Issue Links

causes

MDEV-34487 GTID positioning: Ignore filtered domain ids

Open

MDEV-34495 GTID positioning: Newest position for some of the domain ids

Open

duplicates

MDEV-9345 Replication to enable filtering on master

Open

relates to

MDEV-9108 "GTID not in master's binlog" error with {ignore|do}_domain_ids

Open

MDEV-22905 Support DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS in replication handshake

Closed

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Michaël de groot added a comment - 2024-06-29 09:29

I don't think a primary allows a position in a domain that is not yet known. Did not try this work-around, though, as the replica is configured by automation, and I prefer that the slave reconfiguration remains purely mariadb/load balancer.

I did make another work-around; on the side of the primary this removes the gtid domains that are not desired:

https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and
and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh

The work-around will purge binary logs as needed, so it is not without downsides.

Michaël de groot added a comment - 2024-06-29 09:29 I don't think a primary allows a position in a domain that is not yet known. Did not try this work-around, though, as the replica is configured by automation, and I prefer that the slave reconfiguration remains purely mariadb/load balancer. I did make another work-around; on the side of the primary this removes the gtid domains that are not desired: https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh The work-around will purge binary logs as needed, so it is not without downsides.

Kristian Nielsen added a comment - 2024-06-29 10:20

> I don't think a primary allows a position in a domain that is not yet known

This is normally true, but it is allowed when the --gtid-ignore-duplicates option is set on the slave. The idea is that the slave wants to ignore duplicate GTIDs that it already received through a different path, even if this master didn't even see those GTIDs yet. Or in the case of the proposed workaround, the slave wants to ignore all past and future GTIDs in that domain.

But good if you found a work-around that works for you. If you are able to solve the problem with DELETE_DOMAIN_ID, it sounds like the problem is old transactions/DDL that maybe should never have been binlogged in the first place (SQL_LOG_BIN=0 ?), but perhaps more familiarity with Galera is needed to understand what happened there.

Kristian Nielsen added a comment - 2024-06-29 10:20 > I don't think a primary allows a position in a domain that is not yet known This is normally true, but it is allowed when the --gtid-ignore-duplicates option is set on the slave. The idea is that the slave wants to ignore duplicate GTIDs that it already received through a different path, even if this master didn't even see those GTIDs yet. Or in the case of the proposed workaround, the slave wants to ignore all past and future GTIDs in that domain. But good if you found a work-around that works for you. If you are able to solve the problem with DELETE_DOMAIN_ID, it sounds like the problem is old transactions/DDL that maybe should never have been binlogged in the first place (SQL_LOG_BIN=0 ?), but perhaps more familiarity with Galera is needed to understand what happened there.

Michaël de groot added a comment - 2024-06-29 12:28

Ah, that explains, and could be another work-around. At the moment of setting up the replica, I do have those domains id's already fetched, to add them to the ignore list. Thank you, I will try.

When changing the schema in a Galera cluster the way to do this without downtime is called rolling schema upgrade. It's comparable to (with asynchronous replication) change the replica first, and then switching over. The transactions on each cluster member normally generate a transaction in the local gtid domain. If there is no desired for the failover through the load balancer, these transactions would just replicate to the asynchronous replica (when they are changed on the cluster member who happens to be the primary for that replica). I guess ideally, 2 different features should be implemented on the side of the primary:
1. Ignore the domain in positioning and don't send any transactions for that domain
2. Take the current position of that domain and send updates from that position onwards
Your work around works for situation/feature 1.
With the 2nd feature we would not have to execute the schema migration on the asynchronous replica's. I will add that tot he feature request.

Fyi, I took the approach of filtering these domains out because the asynchronous replica is used for point in time recovery as well. So in case there are big issues on the cluster a fast switchover is essential.

Michaël de groot added a comment - 2024-06-29 12:28 Ah, that explains, and could be another work-around. At the moment of setting up the replica, I do have those domains id's already fetched, to add them to the ignore list. Thank you, I will try. When changing the schema in a Galera cluster the way to do this without downtime is called rolling schema upgrade. It's comparable to (with asynchronous replication) change the replica first, and then switching over. The transactions on each cluster member normally generate a transaction in the local gtid domain. If there is no desired for the failover through the load balancer, these transactions would just replicate to the asynchronous replica (when they are changed on the cluster member who happens to be the primary for that replica). I guess ideally, 2 different features should be implemented on the side of the primary: 1. Ignore the domain in positioning and don't send any transactions for that domain 2. Take the current position of that domain and send updates from that position onwards Your work around works for situation/feature 1. With the 2nd feature we would not have to execute the schema migration on the asynchronous replica's. I will add that tot he feature request. Fyi, I took the approach of filtering these domains out because the asynchronous replica is used for point in time recovery as well. So in case there are big issues on the cluster a fast switchover is essential.

Kristian Nielsen added a comment - 2024-06-29 12:46

Aha, so the "same" transaction/DDL is run manually on all the nodes, and this way they get different GTID even though they are logically the same one.

Well, you could actually explicitly set the GTID when doing the DDL on a node, to ensure it gets the same GTID as on the other nodes. That would be the "GTID" way, and ensure consistent binlog and GTID positions across all servers. I'm just mentioning this for completeness, in the end whatever works for your setup is the right solution for you.

For (2), I think that's possible by simply setting the current position of that domain explicitly in gtid_slave_pos before disabling the domain filter and restarting the slave.

Thanks for the additional information and clarification.

- Kristian.

Kristian Nielsen added a comment - 2024-06-29 12:46 Aha, so the "same" transaction/DDL is run manually on all the nodes, and this way they get different GTID even though they are logically the same one. Well, you could actually explicitly set the GTID when doing the DDL on a node, to ensure it gets the same GTID as on the other nodes. That would be the "GTID" way, and ensure consistent binlog and GTID positions across all servers. I'm just mentioning this for completeness, in the end whatever works for your setup is the right solution for you. For (2), I think that's possible by simply setting the current position of that domain explicitly in gtid_slave_pos before disabling the domain filter and restarting the slave. Thanks for the additional information and clarification. - Kristian.

Michaël de groot added a comment - 2024-06-29 14:29

I can confirm that your work-around works.

For situation #2, that is not easily done, as when there is a fail over of the replication stream, the automation who could look that up (or the human who could look that up) is not involved. But for (1) I confirm it works

Michaël de groot added a comment - 2024-06-29 14:29 I can confirm that your work-around works. For situation #2, that is not easily done, as when there is a fail over of the replication stream, the automation who could look that up (or the human who could look that up) is not involved. But for (1) I confirm it works

MariaDB Server

Ignored GTID domain IDs still appear in gtid_slave_pos

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration