Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9108

"GTID not in master's binlog" error with {ignore|do}_domain_ids

Details

    Description

      Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:

      SETTING: Server_id: 1 IP: 10.0.3.223

      STOP ALL SLAVES;
      CHANGE MASTER "S1_R2" TO
      master_host = "10.0.3.136",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (2)
      CHANGE MASTER "S1_R3" TO
      master_host = "10.0.3.171",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (3)
      START ALL SLAVES;

      SETTING: Server_id: 2 IP: 10.0.3.136

      STOP ALL SLAVES;
      CHANGE MASTER "S2_R1" TO
      master_host = "10.0.3.223",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (1)
      CHANGE MASTER "S2_R3" TO
      master_host = "10.0.3.171",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (3)
      START ALL SLAVES;

      SETTING: Server_id: 3 IP: 10.0.3.171

      STOP ALL SLAVES;
      CHANGE MASTER "S3_R1" TO
      master_host = "10.0.3.223",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (1)
      CHANGE MASTER "S3_R2" TO
      master_host = "10.0.3.136",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (2)
      START ALL SLAVES;

      After initially starting all replications:
      1. stop server 1
      2. issue a INSERT|UPDATE|DELETE on server 2
      3. stop server 2
      4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
      5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.

      Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.

      The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by

      {ignore|do}

      _domain_ids.

      Attachments

        Issue Links

          Activity

            I believe there were some expectation that fixing MDEV-9033 would fix this issue but unfortunately it didn't happen.

            rsevero Rodrigo Severo added a comment - I believe there were some expectation that fixing MDEV-9033 would fix this issue but unfortunately it didn't happen.
            knielsen Kristian Nielsen added a comment - - edited

            Generally, a slave is not allowed to connect to a master on a GTID which is
            missing in the master's binlog. This is to prevent silent corruption.

            There are a couple of exceptions to this rule. One is that if the master has
            no GTIDs in a domain, then that domain is ignored. I think another is that
            the rule is relaxed in case of --gtid-ignore-duplicates=1, for reasons like
            described in this report.

            I think the request here is for another similar exception in case of
            --do-domain-ids. This could be reasonable, but it is not implemented
            currently.

            An implementation might be as the reporter suggests. When the slave sends
            its replication position to the master, omit those domains that are
            configured to be ignored. However, some careful thought is needed to
            consider all possible scenarios and ensure that this does not lead to
            incorrect results.

            knielsen Kristian Nielsen added a comment - - edited Generally, a slave is not allowed to connect to a master on a GTID which is missing in the master's binlog. This is to prevent silent corruption. There are a couple of exceptions to this rule. One is that if the master has no GTIDs in a domain, then that domain is ignored. I think another is that the rule is relaxed in case of --gtid-ignore-duplicates=1, for reasons like described in this report. I think the request here is for another similar exception in case of --do-domain-ids. This could be reasonable, but it is not implemented currently. An implementation might be as the reporter suggests. When the slave sends its replication position to the master, omit those domains that are configured to be ignored. However, some careful thought is needed to consider all possible scenarios and ensure that this does not lead to incorrect results.

            knielsen, do you want it to be converted into a feature request?

            elenst Elena Stepanova added a comment - knielsen , do you want it to be converted into a feature request?
            Elkin Andrei Elkin added a comment -

            The ignore_domain_ids options could be helpful to
            make the 12012 post-gtid-enabled slave to successfully connect,
            not requiring the masters to forget/purge their old domain events.

            Elkin Andrei Elkin added a comment - The ignore_domain_ids options could be helpful to make the 12012 post-gtid-enabled slave to successfully connect, not requiring the masters to forget/purge their old domain events.
            michaeldg Michaël de groot added a comment - I created a work-around for this: https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh The work-around will remove undesired gtid domains from the primary.

            People

              knielsen Kristian Nielsen
              rsevero Rodrigo Severo
              Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.