Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9108

"GTID not in master's binlog" error with {ignore|do}_domain_ids

Details

    Description

      Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:

      SETTING: Server_id: 1 IP: 10.0.3.223

      STOP ALL SLAVES;
      CHANGE MASTER "S1_R2" TO
      master_host = "10.0.3.136",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (2)
      CHANGE MASTER "S1_R3" TO
      master_host = "10.0.3.171",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (3)
      START ALL SLAVES;

      SETTING: Server_id: 2 IP: 10.0.3.136

      STOP ALL SLAVES;
      CHANGE MASTER "S2_R1" TO
      master_host = "10.0.3.223",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (1)
      CHANGE MASTER "S2_R3" TO
      master_host = "10.0.3.171",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (3)
      START ALL SLAVES;

      SETTING: Server_id: 3 IP: 10.0.3.171

      STOP ALL SLAVES;
      CHANGE MASTER "S3_R1" TO
      master_host = "10.0.3.223",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (1)
      CHANGE MASTER "S3_R2" TO
      master_host = "10.0.3.136",
      master_user = "replicator",
      master_use_gtid = slave_pos,
      master_password = "password",
      do_domain_ids = (2)
      START ALL SLAVES;

      After initially starting all replications:
      1. stop server 1
      2. issue a INSERT|UPDATE|DELETE on server 2
      3. stop server 2
      4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
      5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.

      Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.

      The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by

      {ignore|do}

      _domain_ids.

      Attachments

        Issue Links

          Activity

            rsevero Rodrigo Severo created issue -
            rsevero Rodrigo Severo made changes -
            Field Original Value New Value
            Description Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:
             
            SETTING: Server_id: 1 | IP: 10.0.3.223
            STOP ALL SLAVES;
            CHANGE MASTER "S1_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "fabrica",
            do_domain_ids = (2)
            CHANGE MASTER "S1_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "fabrica",
            do_domain_ids = (3)
            START ALL SLAVES;
             
             
            SETTING: Server_id: 2 | IP: 10.0.3.136
            STOP ALL SLAVES;
            CHANGE MASTER "S2_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "fabrica",
            do_domain_ids = (1)
            CHANGE MASTER "S2_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "fabrica",
            do_domain_ids = (3)
            START ALL SLAVES;
             
             
            SETTING: Server_id: 3 | IP: 10.0.3.171
            STOP ALL SLAVES;
            CHANGE MASTER "S3_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "fabrica",
            do_domain_ids = (1)
            CHANGE MASTER "S3_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "fabrica",
            do_domain_ids = (2)
            START ALL SLAVES;
             
            After initially starting all replications:
            1. stop server 1
            2. issue a INSERT|UPDATE|DELETE on server 2
            3. stop server 2
            4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
            5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.
             
            Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.
             
            The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids.
            Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:
             
            SETTING: Server_id: 1 | IP: 10.0.3.223
            STOP ALL SLAVES;
            CHANGE MASTER "S1_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            CHANGE MASTER "S1_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3)
            START ALL SLAVES;
             
             
            SETTING: Server_id: 2 | IP: 10.0.3.136
            STOP ALL SLAVES;
            CHANGE MASTER "S2_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            CHANGE MASTER "S2_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3)
            START ALL SLAVES;
             
             
            SETTING: Server_id: 3 | IP: 10.0.3.171
            STOP ALL SLAVES;
            CHANGE MASTER "S3_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            CHANGE MASTER "S3_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            START ALL SLAVES;
             
            After initially starting all replications:
            1. stop server 1
            2. issue a INSERT|UPDATE|DELETE on server 2
            3. stop server 2
            4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
            5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.
             
            Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.
             
            The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids.
            elenst Elena Stepanova made changes -
            Description Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:
             
            SETTING: Server_id: 1 | IP: 10.0.3.223
            STOP ALL SLAVES;
            CHANGE MASTER "S1_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            CHANGE MASTER "S1_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3)
            START ALL SLAVES;
             
             
            SETTING: Server_id: 2 | IP: 10.0.3.136
            STOP ALL SLAVES;
            CHANGE MASTER "S2_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            CHANGE MASTER "S2_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3)
            START ALL SLAVES;
             
             
            SETTING: Server_id: 3 | IP: 10.0.3.171
            STOP ALL SLAVES;
            CHANGE MASTER "S3_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            CHANGE MASTER "S3_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            START ALL SLAVES;
             
            After initially starting all replications:
            1. stop server 1
            2. issue a INSERT|UPDATE|DELETE on server 2
            3. stop server 2
            4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
            5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.
             
            Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.
             
            The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids.
            Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:
             
            {code:sql|title=SETTING: Server_id: 1 IP: 10.0.3.223}
            STOP ALL SLAVES;
            CHANGE MASTER "S1_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            CHANGE MASTER "S1_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3)
            START ALL SLAVES;
            {code}
             
            {code:sql|title=SETTING: Server_id: 2 IP: 10.0.3.136}
            STOP ALL SLAVES;
            CHANGE MASTER "S2_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            CHANGE MASTER "S2_R3" TO
            master_host = "10.0.3.171",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3)
            START ALL SLAVES;
            {code}
             
            {code:sql|title=SETTING: Server_id: 3 IP: 10.0.3.171}
            STOP ALL SLAVES;
            CHANGE MASTER "S3_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            CHANGE MASTER "S3_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            START ALL SLAVES;
            {code}

            After initially starting all replications:
            1. stop server 1
            2. issue a INSERT|UPDATE|DELETE on server 2
            3. stop server 2
            4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
            5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.
             
            Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.
             
            The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids.
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -

            Assuming for the time being that my theory in the comment to MDEV-9107 is correct, and events from the domain ID should actually be tracked (just not applied) – does the S2's binlog actually have GTID 2-2-10?

            elenst Elena Stepanova added a comment - Assuming for the time being that my theory in the comment to MDEV-9107 is correct, and events from the domain ID should actually be tracked (just not applied) – does the S2's binlog actually have GTID 2-2-10?
            elenst Elena Stepanova made changes -
            Labels need_feedback

            S2's binlog certainly has GTID 2-2-10 as it points to an event that was issued directly on S2.

            S3 has GTID 2-2-10 on it's binlog also as it got from S2 through replication channel S3_R2 .

            But S1 doesn't have GTID 2-2-10 in it's binlog as it never got it from S2 (observe that GTID 2-2-10 was issued in S2 after S1 was stopped). This fact isn't a problem at all.

            The problem is that because GTIDs for ALL tracked domain ids on the slave are sent to the master on replication channel (re)start, when S2 tries to re-establish replication channel S2_R1 with S1, it will fail several times until S1 finally gets GTID 2-2-10.

            The point here is that in this particular situation we are dealing "only" with a transient, annoying and silly failure as S1 will eventually get GTID 2-2-10 through replication channel S1_R2. I say it's silly because why the eventual unavailability of a GTID related to a domain id that should be filtered out of the channel blocks the channel from going up? Observe that the problematic GTID is 2-2-10 and channel S2_R1 should deal only with domain id 1. I would really expect that whatever the GTID position of domain id 2 is, it won't affect the availability of a replication channel that is set to deal with domain id 1.

            Here I'm mentioning a transient failure but now I will mention an even simplier and more common scenario where this replication scenario start failure will be permanent and catastrophic:

            Let's consider 2 masters:

            SETTING: Server_id: 1 | IP: 10.0.3.223
            STOP ALL SLAVES;
            CHANGE MASTER "S1_R2" TO
            master_host = "10.0.3.136",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2)
            START ALL SLAVES;

            SETTING: Server_id: 2 | IP: 10.0.3.136
            STOP ALL SLAVES;
            CHANGE MASTER "S2_R1" TO
            master_host = "10.0.3.223",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1)
            START ALL SLAVES;

            This could be used so that having 2 different databases, each databases were changed on one server and the other works as slave.

            Because a network connection problem between the servers replication channels from both servers stop. But only the network connection between the servers have stopped. Both are still accessible from their main softwares.

            In this case the GTID of both servers will be updated during the inter network connection breakup.

            When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will.

            This happens because during replication channel start GTID for all tracked domain on the slave are sent to the master. If the slave only sent GTIDs for the domain ids the replication channel should actually deal with, this problem wouldn't happen at all and the 2 replication channels of all these setups would go back immediatelly without any hiccups at all.

            And again, please observe that on the first scenario we are talking about an annoying delay on the replication channel restart but on the second scenario we are talking about a fatal failure that actually blocks indefinitely the replication channels from being restarted at all.

            rsevero Rodrigo Severo added a comment - S2's binlog certainly has GTID 2-2-10 as it points to an event that was issued directly on S2. S3 has GTID 2-2-10 on it's binlog also as it got from S2 through replication channel S3_R2 . But S1 doesn't have GTID 2-2-10 in it's binlog as it never got it from S2 (observe that GTID 2-2-10 was issued in S2 after S1 was stopped). This fact isn't a problem at all. The problem is that because GTIDs for ALL tracked domain ids on the slave are sent to the master on replication channel (re)start, when S2 tries to re-establish replication channel S2_R1 with S1, it will fail several times until S1 finally gets GTID 2-2-10. The point here is that in this particular situation we are dealing "only" with a transient, annoying and silly failure as S1 will eventually get GTID 2-2-10 through replication channel S1_R2. I say it's silly because why the eventual unavailability of a GTID related to a domain id that should be filtered out of the channel blocks the channel from going up? Observe that the problematic GTID is 2-2-10 and channel S2_R1 should deal only with domain id 1. I would really expect that whatever the GTID position of domain id 2 is, it won't affect the availability of a replication channel that is set to deal with domain id 1. Here I'm mentioning a transient failure but now I will mention an even simplier and more common scenario where this replication scenario start failure will be permanent and catastrophic: Let's consider 2 masters: SETTING: Server_id: 1 | IP: 10.0.3.223 STOP ALL SLAVES; CHANGE MASTER "S1_R2" TO master_host = "10.0.3.136", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (2) START ALL SLAVES; SETTING: Server_id: 2 | IP: 10.0.3.136 STOP ALL SLAVES; CHANGE MASTER "S2_R1" TO master_host = "10.0.3.223", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1) START ALL SLAVES; This could be used so that having 2 different databases, each databases were changed on one server and the other works as slave. Because a network connection problem between the servers replication channels from both servers stop. But only the network connection between the servers have stopped. Both are still accessible from their main softwares. In this case the GTID of both servers will be updated during the inter network connection breakup. When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will. This happens because during replication channel start GTID for all tracked domain on the slave are sent to the master. If the slave only sent GTIDs for the domain ids the replication channel should actually deal with, this problem wouldn't happen at all and the 2 replication channels of all these setups would go back immediatelly without any hiccups at all. And again, please observe that on the first scenario we are talking about an annoying delay on the replication channel restart but on the second scenario we are talking about a fatal failure that actually blocks indefinitely the replication channels from being restarted at all.

            rsevero,

            Did you actually try the scenario you have described in the last comment? It's a good one, because as you said it is (almost) deterministic, so you'll get a reliable result. If you didn't try it yet, please do. If you tried and got the failure you are describing, please send cnf files from both servers, binary logs, error logs and output of show all slaves status \G where you observe the errors. Maybe you have encountered another bug there.

            Meanwhile, please consider – if it works as you described with do_domain_ids, it should be surely so without any do_domain_ids clauses, right? In that case, servers would have no excuse whatsoever for not tracking both domains, so the starting GTIDs are certainly stored, and following the same logic, they will definitely fail upon reconnect?
            But at the same time, if any M<=>M setup failed so badly upon any disconnect, we would have complaints all over, and it is not happening.

            I think the reason is here:

            In this case the GTID of both servers will be updated during the inter network connection breakup.

            When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will.

            What will be updated during network problems is @@gtid_current_pos, @@gtid_binlog_pos, and Gtid_Slave_Pos in the slave status. But these are not the values that the slave uses while reconnecting to the master. Neither @@gtid_slave_pos, nor mysql.gtid_slave_pos table, nor Gtid_IO_Pos will be updated. So, when the network connection is restored, the slave will connect to the master requesting the exact same position it had upon replicating the last event.

            Note: While looking into it, I realized there is a good reason for confusion – we were trying so hard to give meaningful names for GTID fields and variables, but the result turned out horrible: we have @@gtid_slave_pos variable, and we have Gtid_Slave_Pos field in the SHOW SLAVE STATUS output, and these values have nearly nothing in common...

            Now, back to your original scenario – I cannot reproduce the problem. I have no doubts that you actually observed it; it's possible that there is a race condition that I'm failing to hit, but it's also possible that it is a weird consequence of MDEV-9033. If you are getting the problem fairly frequently, it would be interesting to see if you can reproduce it without do_domain_ids (but with gtid-ignore-duplicates of course). As said above, if the problem is as you describe, do_domain_ids should make no difference; on the other hand, removing them will rule out MDEV-9033 effect.
            If it is not reproducible or you cannot experiment, I suggest to keep it open and wait till nirbhay_c confirms/declines that tracking the position of all domains is a part of do_domain_ids design, and till MDEV-9033 is fixed. After that, we can re-visit it.

            elenst Elena Stepanova added a comment - rsevero , Did you actually try the scenario you have described in the last comment? It's a good one, because as you said it is (almost) deterministic, so you'll get a reliable result. If you didn't try it yet, please do. If you tried and got the failure you are describing, please send cnf files from both servers, binary logs, error logs and output of show all slaves status \G where you observe the errors. Maybe you have encountered another bug there. Meanwhile, please consider – if it works as you described with do_domain_ids , it should be surely so without any do_domain_ids clauses, right? In that case, servers would have no excuse whatsoever for not tracking both domains, so the starting GTIDs are certainly stored, and following the same logic, they will definitely fail upon reconnect? But at the same time, if any M<=>M setup failed so badly upon any disconnect, we would have complaints all over, and it is not happening. I think the reason is here: In this case the GTID of both servers will be updated during the inter network connection breakup. When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will. What will be updated during network problems is @@gtid_current_pos , @@gtid_binlog_pos , and Gtid_Slave_Pos in the slave status. But these are not the values that the slave uses while reconnecting to the master. Neither @@gtid_slave_pos , nor mysql.gtid_slave_pos table, nor Gtid_IO_Pos will be updated. So, when the network connection is restored, the slave will connect to the master requesting the exact same position it had upon replicating the last event. Note: While looking into it, I realized there is a good reason for confusion – we were trying so hard to give meaningful names for GTID fields and variables, but the result turned out horrible: we have @@gtid_slave_pos variable, and we have Gtid_Slave_Pos field in the SHOW SLAVE STATUS output, and these values have nearly nothing in common... Now, back to your original scenario – I cannot reproduce the problem. I have no doubts that you actually observed it; it's possible that there is a race condition that I'm failing to hit, but it's also possible that it is a weird consequence of MDEV-9033 . If you are getting the problem fairly frequently, it would be interesting to see if you can reproduce it without do_domain_ids (but with gtid-ignore-duplicates of course). As said above, if the problem is as you describe, do_domain_ids should make no difference; on the other hand, removing them will rule out MDEV-9033 effect. If it is not reproducible or you cannot experiment, I suggest to keep it open and wait till nirbhay_c confirms/declines that tracking the position of all domains is a part of do_domain_ids design, and till MDEV-9033 is fixed. After that, we can re-visit it.

            When trying to implement the 6 servers setup across 2 cities I described in MDEV-9107 I experienced both problems I described here:

            • the transient one across servers on the same city and
            • the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities.

            I will try to implement the setup I envision using gtid-ignore-duplicates as you suggested. Thanks for pointing me to this option. It might work and it will be great if it does.

            But the problem I mention here is real. If you are not receiving more reports about that I bet is just because few people are exercising GTID based replication that much and being a new concept, when facing problems are probably just leaving it alone thinking "Am I doing something wrong?" Why I say so? Because MDEV-9033 is a terrible bug and you are not receiving tons of reports about it.

            If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with MDEV-9033 of course

            And the ones that are actually trying might be giving up prematurely because of issues like the ones we are discussing.

            Thanks again for your help. I will return with new info after my tests with gtid-ignore-duplicates.

            rsevero Rodrigo Severo added a comment - When trying to implement the 6 servers setup across 2 cities I described in MDEV-9107 I experienced both problems I described here: the transient one across servers on the same city and the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities. I will try to implement the setup I envision using gtid-ignore-duplicates as you suggested. Thanks for pointing me to this option. It might work and it will be great if it does. But the problem I mention here is real. If you are not receiving more reports about that I bet is just because few people are exercising GTID based replication that much and being a new concept, when facing problems are probably just leaving it alone thinking "Am I doing something wrong?" Why I say so? Because MDEV-9033 is a terrible bug and you are not receiving tons of reports about it. If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with MDEV-9033 of course And the ones that are actually trying might be giving up prematurely because of issues like the ones we are discussing. Thanks again for your help. I will return with new info after my tests with gtid-ignore-duplicates.

            One detail, that is probably important:

            all the times I faced the issues described here, I had actually stopped the replication for one reason or another. On START SLAVE I got the problems mentioned here.

            The network connection problem was just me trying to imagine a more common situation where these issues might happen but the network connection situation and the STOP SLAVE/START SLAVE situations don't share the "slave sending start GTIDs to master" step. Just the STOP SLAVE/START SLAVE one has this step. This is the step where these issues happen.

            Sorry for the confusion.

            rsevero Rodrigo Severo added a comment - One detail, that is probably important: all the times I faced the issues described here, I had actually stopped the replication for one reason or another. On START SLAVE I got the problems mentioned here. The network connection problem was just me trying to imagine a more common situation where these issues might happen but the network connection situation and the STOP SLAVE/START SLAVE situations don't share the "slave sending start GTIDs to master" step. Just the STOP SLAVE/START SLAVE one has this step. This is the step where these issues happen. Sorry for the confusion.
            elenst Elena Stepanova added a comment - - edited

            When trying to implement the 6 servers setup across 2 cities I described in MDEV-9107 I experienced both problems I described here:
            ... the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities.

            Please provide the error logs, binary logs and cnf files from both servers.

            If it's reproducible for you with STOP SLAVE, even better – please do reproduce with STOP SLAVE and attach the logs/configs and the SHOW ALL SLAVES STATUS output.

            I'm talking about the two-server setup that you described before.
            If you experience the problem only on your 6-server setup which also uses do_domain_ids, that's another story – MDEV-9033 can have all kinds of side-effects, it does not make sense to dig until it's fixed.

            If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with MDEV-9033 of course

            This is very true of course, but you've missed my point. The problem with two-server setup as you described it does not require trying GTID to it's full potential – it should affect pretty much every user who ever tries a simple 2-master topology and ever stops at least one slave.
            Once again, I'm not claiming that you did not encounter it, but there must be something more than you described. And I'm not just making theories, I did try it of course. That's why I would like you to actually try to do what you described (rather than describe what you previously did), and if it indeed happens for you according to your description, provide the logs and configuration because clearly there is something that's missing from the description.

            For one, I asked before but you never answered how you currently avoid MDEV-9033. Your setup should trigger it all the time, and your replication should not proceed further than one or two events on one of the slaves; yet, you describe rather advanced scenarios with restarting servers and all that, so apparently there is something in your configuration that you did not mention that lets you work around MDEV-9033 (but possibly causes some other problems that we don't know about).

            elenst Elena Stepanova added a comment - - edited When trying to implement the 6 servers setup across 2 cities I described in MDEV-9107 I experienced both problems I described here: ... the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities. Please provide the error logs, binary logs and cnf files from both servers. If it's reproducible for you with STOP SLAVE, even better – please do reproduce with STOP SLAVE and attach the logs/configs and the SHOW ALL SLAVES STATUS output. I'm talking about the two-server setup that you described before. If you experience the problem only on your 6-server setup which also uses do_domain_ids , that's another story – MDEV-9033 can have all kinds of side-effects, it does not make sense to dig until it's fixed. If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with MDEV-9033 of course This is very true of course, but you've missed my point. The problem with two-server setup as you described it does not require trying GTID to it's full potential – it should affect pretty much every user who ever tries a simple 2-master topology and ever stops at least one slave. Once again, I'm not claiming that you did not encounter it, but there must be something more than you described. And I'm not just making theories, I did try it of course. That's why I would like you to actually try to do what you described (rather than describe what you previously did), and if it indeed happens for you according to your description, provide the logs and configuration because clearly there is something that's missing from the description. For one, I asked before but you never answered how you currently avoid MDEV-9033 . Your setup should trigger it all the time, and your replication should not proceed further than one or two events on one of the slaves; yet, you describe rather advanced scenarios with restarting servers and all that, so apparently there is something in your configuration that you did not mention that lets you work around MDEV-9033 (but possibly causes some other problems that we don't know about).

            To see the catastrophic failure do the following:

            SETTING: Server_id: 1 | Domain_id: 1 | IP: 10.0.0.1
            STOP ALL SLAVES;
            CHANGE MASTER "S1_R2" TO
            master_host = "10.0.0.2",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (2);
            CHANGE MASTER "S1_R3" TO
            master_host = "10.0.0.3",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (3,4);
            START ALL SLAVES;

            SETTING: Server_id: 2 | Domain_id: 2 | IP: 10.0.0.2
            STOP ALL SLAVES;
            CHANGE MASTER "S2_R1" TO
            master_host = "10.0.0.1",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1,3,4);
            START ALL SLAVES;

            SETTING: Server_id: 3 | Domain_id: 3 | IP: 10.0.0.3
            STOP ALL SLAVES;
            CHANGE MASTER "S3_R1" TO
            master_host = "10.0.0.1",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1,2);
            CHANGE MASTER "S3_R4" TO
            master_host = "10.0.0.4",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (4);
            START ALL SLAVES;

            SETTING: Server_id: 4 | Domain_id: 4 | IP: 10.0.0.4
            STOP ALL SLAVES;
            CHANGE MASTER "S4_R3" TO
            master_host = "10.0.0.3",
            master_user = "replicator",
            master_use_gtid = slave_pos,
            master_password = "password",
            do_domain_ids = (1,2,3);
            START ALL SLAVES;

            All of them should have gtid_ignore_duplicates set to OFF.

            After starting all replication channels and confirming that they are all up and running:
            1. on server S1 do: CREATE TABLE t1 (i INT); INSERT INTO t1 VALUES (1); CREATE TABLE t2 (i INT);
            2. wait to be sure these commands are executed by all 4 servers
            3. on server S1 do: STOP SLAVE 'S1_R3';
            4. on server S3 do: STOP SLAVE 'S3_R1';
            3. on server S2 do: INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3);
            4. on server S4 do: INSERT INTO t2 VALUES (4); INSERT INTO t2 VALUES (5);
            3. try to restart replication channels S1_R3 and S3_R1 with START SLAVE 'S1_R3|S3_R1';

            Watch how both of them will fail with a message of the form:

            “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'”

            I have this setup working where servers 1 and 2 are at one city and servers 3 and 4 are at another one.

            rsevero Rodrigo Severo added a comment - To see the catastrophic failure do the following: SETTING: Server_id: 1 | Domain_id: 1 | IP: 10.0.0.1 STOP ALL SLAVES; CHANGE MASTER "S1_R2" TO master_host = "10.0.0.2", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (2); CHANGE MASTER "S1_R3" TO master_host = "10.0.0.3", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (3,4); START ALL SLAVES; SETTING: Server_id: 2 | Domain_id: 2 | IP: 10.0.0.2 STOP ALL SLAVES; CHANGE MASTER "S2_R1" TO master_host = "10.0.0.1", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1,3,4); START ALL SLAVES; SETTING: Server_id: 3 | Domain_id: 3 | IP: 10.0.0.3 STOP ALL SLAVES; CHANGE MASTER "S3_R1" TO master_host = "10.0.0.1", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1,2); CHANGE MASTER "S3_R4" TO master_host = "10.0.0.4", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (4); START ALL SLAVES; SETTING: Server_id: 4 | Domain_id: 4 | IP: 10.0.0.4 STOP ALL SLAVES; CHANGE MASTER "S4_R3" TO master_host = "10.0.0.3", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1,2,3); START ALL SLAVES; All of them should have gtid_ignore_duplicates set to OFF. After starting all replication channels and confirming that they are all up and running: 1. on server S1 do: CREATE TABLE t1 (i INT); INSERT INTO t1 VALUES (1); CREATE TABLE t2 (i INT); 2. wait to be sure these commands are executed by all 4 servers 3. on server S1 do: STOP SLAVE 'S1_R3'; 4. on server S3 do: STOP SLAVE 'S3_R1'; 3. on server S2 do: INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3); 4. on server S4 do: INSERT INTO t2 VALUES (4); INSERT INTO t2 VALUES (5); 3. try to restart replication channels S1_R3 and S3_R1 with START SLAVE 'S1_R3|S3_R1'; Watch how both of them will fail with a message of the form: “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” I have this setup working where servers 1 and 2 are at one city and servers 3 and 4 are at another one.

            Okay, good, so there are 4 servers after all, not just 2.
            In this case I'm inclined to think that the problem is indeed a side-effect of MDEV-9033, but I will try it shortly, hopefully I'll find a clear proof.

            elenst Elena Stepanova added a comment - Okay, good, so there are 4 servers after all, not just 2. In this case I'm inclined to think that the problem is indeed a side-effect of MDEV-9033 , but I will try it shortly, hopefully I'll find a clear proof.
            elenst Elena Stepanova made changes -
            Labels need_feedback
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ]

            I'm quite sure this isn't a side effect of MDEV-9033 as, AFAIU, MDEV-9033 is about MariaDB replication creating, out of the blue, a new GTID for a event received for a master which, obviously already had it's own GTID. Because this spurious GTID is created, the loop is created.

            This issue is about a slave asking for a GTID of a domain id that shouldn't be treated by a replication channel and the master refusing to start said replication channel because the asked GTID of this not-to-be-cared-of domain-id being too up to date.

            rsevero Rodrigo Severo added a comment - I'm quite sure this isn't a side effect of MDEV-9033 as, AFAIU, MDEV-9033 is about MariaDB replication creating, out of the blue, a new GTID for a event received for a master which, obviously already had it's own GTID. Because this spurious GTID is created, the loop is created. This issue is about a slave asking for a GTID of a domain id that shouldn't be treated by a replication channel and the master refusing to start said replication channel because the asked GTID of this not-to-be-cared-of domain-id being too up to date.

            rsevero,
            There is no need to theorize. Please do try to reproduce what you are describing with only two servers replicating from each other, no other replication channels whatsoever. If you succeed at doing so, please let me know.

            elenst Elena Stepanova added a comment - rsevero , There is no need to theorize. Please do try to reproduce what you are describing with only two servers replicating from each other, no other replication channels whatsoever. If you succeed at doing so, please let me know.

            With only 2 servers I don't see the problem. I can only reproduce it with 4 servers on the setup I detailed above.

            rsevero Rodrigo Severo added a comment - With only 2 servers I don't see the problem. I can only reproduce it with 4 servers on the setup I detailed above.

            The problem with the 4 servers setup still exists on MariaDB 10.1.10.

            rsevero Rodrigo Severo added a comment - The problem with the 4 servers setup still exists on MariaDB 10.1.10.
            rsevero Rodrigo Severo made changes -
            Affects Version/s 10.1.10 [ 20402 ]

            I believe there were some expectation that fixing MDEV-9033 would fix this issue but unfortunately it didn't happen.

            rsevero Rodrigo Severo added a comment - I believe there were some expectation that fixing MDEV-9033 would fix this issue but unfortunately it didn't happen.
            knielsen Kristian Nielsen added a comment - - edited

            Generally, a slave is not allowed to connect to a master on a GTID which is
            missing in the master's binlog. This is to prevent silent corruption.

            There are a couple of exceptions to this rule. One is that if the master has
            no GTIDs in a domain, then that domain is ignored. I think another is that
            the rule is relaxed in case of --gtid-ignore-duplicates=1, for reasons like
            described in this report.

            I think the request here is for another similar exception in case of
            --do-domain-ids. This could be reasonable, but it is not implemented
            currently.

            An implementation might be as the reporter suggests. When the slave sends
            its replication position to the master, omit those domains that are
            configured to be ignored. However, some careful thought is needed to
            consider all possible scenarios and ensure that this does not lead to
            incorrect results.

            knielsen Kristian Nielsen added a comment - - edited Generally, a slave is not allowed to connect to a master on a GTID which is missing in the master's binlog. This is to prevent silent corruption. There are a couple of exceptions to this rule. One is that if the master has no GTIDs in a domain, then that domain is ignored. I think another is that the rule is relaxed in case of --gtid-ignore-duplicates=1, for reasons like described in this report. I think the request here is for another similar exception in case of --do-domain-ids. This could be reasonable, but it is not implemented currently. An implementation might be as the reporter suggests. When the slave sends its replication position to the master, omit those domains that are configured to be ignored. However, some careful thought is needed to consider all possible scenarios and ensure that this does not lead to incorrect results.
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ] Kristian Nielsen [ knielsen ]

            knielsen, do you want it to be converted into a feature request?

            elenst Elena Stepanova added a comment - knielsen , do you want it to be converted into a feature request?
            DrMurx Jan Kunzmann (Inactive) made changes -
            Elkin Andrei Elkin made changes -
            Elkin Andrei Elkin added a comment -

            The ignore_domain_ids options could be helpful to
            make the 12012 post-gtid-enabled slave to successfully connect,
            not requiring the masters to forget/purge their old domain events.

            Elkin Andrei Elkin added a comment - The ignore_domain_ids options could be helpful to make the 12012 post-gtid-enabled slave to successfully connect, not requiring the masters to forget/purge their old domain events.
            elenst Elena Stepanova made changes -
            Fix Version/s 10.1 [ 16100 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 72497 ] MariaDB v4 [ 139906 ]
            michaeldg Michaël de groot made changes -
            michaeldg Michaël de groot made changes -
            michaeldg Michaël de groot added a comment - I created a work-around for this: https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh The work-around will remove undesired gtid domains from the primary.
            vlad.radu Vlad Radu made changes -
            Labels foundation

            People

              knielsen Kristian Nielsen
              rsevero Rodrigo Severo
              Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.