Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34485

Ignored GTID domain IDs still appear in gtid_slave_pos

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.18
    • None
    • Replication
    • None

    Description

      Hello,

      I have an asynchronous replica set up to a Galera cluster with a load balancer in front. The replica may be connected to either cluster member.

      I have configured wsrep_mode on, wsrep_gtid_domain_id to 4248433653, and the 2 node's (local) gtid_domain_id to 293176233 and 1908279884. On the replica, I have configured Replicate_Ignore_Domain_Ids: 293176233, 1908279884.

      When setting up replication, I removed those domains from the slave position:
      SET GLOBAL gtid_slave_pos="4248433653-293176233-12529601";
      CHANGE MASTER TO master_use_gtid=slave_pos, replicate_ignore_domain_ids=(293176233, 1908279884);
      START SLAVE;

      Initially, replication comes up and I am able to switch it over between the 2 cluster node.

      But after a rolling schema upgrade on the cluster, the local gtid appears in the gtid_slave_pos on the replica. Then, when failing the replication over to another node, the replication does not start up.

      It is not a work-around to simply remove that domain id from the replication

      I expect that GTID to not appear in the gtid_slave_pos as it is ignored.

      Is this intended behaviour? And if it is, how can I avoid these domains to appear in gtid_slave_pos? Otherwise it is quite difficult to replicate asynchronously from a galera cluster.

      Attachments

        Issue Links

          Activity

            I don't think a primary allows a position in a domain that is not yet known. Did not try this work-around, though, as the replica is configured by automation, and I prefer that the slave reconfiguration remains purely mariadb/load balancer.

            I did make another work-around; on the side of the primary this removes the gtid domains that are not desired:

            https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and
            and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh

            The work-around will purge binary logs as needed, so it is not without downsides.

            michaeldg Michaël de groot added a comment - I don't think a primary allows a position in a domain that is not yet known. Did not try this work-around, though, as the replica is configured by automation, and I prefer that the slave reconfiguration remains purely mariadb/load balancer. I did make another work-around; on the side of the primary this removes the gtid domains that are not desired: https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh The work-around will purge binary logs as needed, so it is not without downsides.

            > I don't think a primary allows a position in a domain that is not yet known

            This is normally true, but it is allowed when the --gtid-ignore-duplicates option is set on the slave. The idea is that the slave wants to ignore duplicate GTIDs that it already received through a different path, even if this master didn't even see those GTIDs yet. Or in the case of the proposed workaround, the slave wants to ignore all past and future GTIDs in that domain.

            But good if you found a work-around that works for you. If you are able to solve the problem with DELETE_DOMAIN_ID, it sounds like the problem is old transactions/DDL that maybe should never have been binlogged in the first place (SQL_LOG_BIN=0 ?), but perhaps more familiarity with Galera is needed to understand what happened there.

            knielsen Kristian Nielsen added a comment - > I don't think a primary allows a position in a domain that is not yet known This is normally true, but it is allowed when the --gtid-ignore-duplicates option is set on the slave. The idea is that the slave wants to ignore duplicate GTIDs that it already received through a different path, even if this master didn't even see those GTIDs yet. Or in the case of the proposed workaround, the slave wants to ignore all past and future GTIDs in that domain. But good if you found a work-around that works for you. If you are able to solve the problem with DELETE_DOMAIN_ID, it sounds like the problem is old transactions/DDL that maybe should never have been binlogged in the first place (SQL_LOG_BIN=0 ?), but perhaps more familiarity with Galera is needed to understand what happened there.

            Ah, that explains, and could be another work-around. At the moment of setting up the replica, I do have those domains id's already fetched, to add them to the ignore list. Thank you, I will try.

            When changing the schema in a Galera cluster the way to do this without downtime is called rolling schema upgrade. It's comparable to (with asynchronous replication) change the replica first, and then switching over. The transactions on each cluster member normally generate a transaction in the local gtid domain. If there is no desired for the failover through the load balancer, these transactions would just replicate to the asynchronous replica (when they are changed on the cluster member who happens to be the primary for that replica). I guess ideally, 2 different features should be implemented on the side of the primary:
            1. Ignore the domain in positioning and don't send any transactions for that domain
            2. Take the current position of that domain and send updates from that position onwards
            Your work around works for situation/feature 1.
            With the 2nd feature we would not have to execute the schema migration on the asynchronous replica's. I will add that tot he feature request.

            Fyi, I took the approach of filtering these domains out because the asynchronous replica is used for point in time recovery as well. So in case there are big issues on the cluster a fast switchover is essential.

            michaeldg Michaël de groot added a comment - Ah, that explains, and could be another work-around. At the moment of setting up the replica, I do have those domains id's already fetched, to add them to the ignore list. Thank you, I will try. When changing the schema in a Galera cluster the way to do this without downtime is called rolling schema upgrade. It's comparable to (with asynchronous replication) change the replica first, and then switching over. The transactions on each cluster member normally generate a transaction in the local gtid domain. If there is no desired for the failover through the load balancer, these transactions would just replicate to the asynchronous replica (when they are changed on the cluster member who happens to be the primary for that replica). I guess ideally, 2 different features should be implemented on the side of the primary: 1. Ignore the domain in positioning and don't send any transactions for that domain 2. Take the current position of that domain and send updates from that position onwards Your work around works for situation/feature 1. With the 2nd feature we would not have to execute the schema migration on the asynchronous replica's. I will add that tot he feature request. Fyi, I took the approach of filtering these domains out because the asynchronous replica is used for point in time recovery as well. So in case there are big issues on the cluster a fast switchover is essential.

            Aha, so the "same" transaction/DDL is run manually on all the nodes, and this way they get different GTID even though they are logically the same one.

            Well, you could actually explicitly set the GTID when doing the DDL on a node, to ensure it gets the same GTID as on the other nodes. That would be the "GTID" way, and ensure consistent binlog and GTID positions across all servers. I'm just mentioning this for completeness, in the end whatever works for your setup is the right solution for you.

            For (2), I think that's possible by simply setting the current position of that domain explicitly in gtid_slave_pos before disabling the domain filter and restarting the slave.

            Thanks for the additional information and clarification.

            - Kristian.

            knielsen Kristian Nielsen added a comment - Aha, so the "same" transaction/DDL is run manually on all the nodes, and this way they get different GTID even though they are logically the same one. Well, you could actually explicitly set the GTID when doing the DDL on a node, to ensure it gets the same GTID as on the other nodes. That would be the "GTID" way, and ensure consistent binlog and GTID positions across all servers. I'm just mentioning this for completeness, in the end whatever works for your setup is the right solution for you. For (2), I think that's possible by simply setting the current position of that domain explicitly in gtid_slave_pos before disabling the domain filter and restarting the slave. Thanks for the additional information and clarification. - Kristian.

            I can confirm that your work-around works.

            For situation #2, that is not easily done, as when there is a fail over of the replication stream, the automation who could look that up (or the human who could look that up) is not involved. But for (1) I confirm it works

            michaeldg Michaël de groot added a comment - I can confirm that your work-around works. For situation #2, that is not easily done, as when there is a fail over of the replication stream, the automation who could look that up (or the human who could look that up) is not involved. But for (1) I confirm it works

            People

              Unassigned Unassigned
              michaeldg Michaël de groot
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.