[MDEV-9108] "GTID not in master's binlog" error with {ignore|do}_domain_ids - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.1.8, 10.1.10
Fix Version/s: 10.1(EOL)
Component/s: Replication
Labels:
- foundation

Description

Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:

SETTING: Server_id: 1 IP: 10.0.3.223
STOP ALL SLAVES;
CHANGE MASTER "S1_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
CHANGE MASTER "S1_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;

SETTING: Server_id: 2 IP: 10.0.3.136
STOP ALL SLAVES;
CHANGE MASTER "S2_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S2_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;

SETTING: Server_id: 3 IP: 10.0.3.171
STOP ALL SLAVES;
CHANGE MASTER "S3_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S3_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
START ALL SLAVES;

After initially starting all replications:
1. stop server 1
2. issue a INSERT|UPDATE|DELETE on server 2
3. stop server 2
4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.

Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.

The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by

{ignore|do}

_domain_ids.

Attachments

Issue Links

includes

MDEV-34487 GTID positioning: Ignore filtered domain ids

Open

relates to

MDEV-9033 Incorrect statements binlogged on slave with do_domain_ids=(...)

Closed

MDEV-11969 Can't remove GTIDs for a stale GTID Domain ID

Closed

MDEV-12012 gtid_domain_id doesn't work with multisource between 10.1 and 10.0 + GTID

Closed

MDEV-34485 Ignored GTID domain IDs still appear in gtid_slave_pos

Open

MDEV-9107 GTID Slave Pos of untrack domain ids being updated

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Rodrigo Severo created issue - 2015-11-09 21:31

Rodrigo Severo made changes - 2015-11-09 21:34

Field	Original Value	New Value
Description	Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with: SETTING: Server_id: 1 \| IP: 10.0.3.223 STOP ALL SLAVES; CHANGE MASTER "S1_R2" TO master_host = "10.0.3.136", master_user = "replicator", master_use_gtid = slave_pos, master_password = "fabrica", do_domain_ids = (2) CHANGE MASTER "S1_R3" TO master_host = "10.0.3.171", master_user = "replicator", master_use_gtid = slave_pos, master_password = "fabrica", do_domain_ids = (3) START ALL SLAVES; SETTING: Server_id: 2 \| IP: 10.0.3.136 STOP ALL SLAVES; CHANGE MASTER "S2_R1" TO master_host = "10.0.3.223", master_user = "replicator", master_use_gtid = slave_pos, master_password = "fabrica", do_domain_ids = (1) CHANGE MASTER "S2_R3" TO master_host = "10.0.3.171", master_user = "replicator", master_use_gtid = slave_pos, master_password = "fabrica", do_domain_ids = (3) START ALL SLAVES; SETTING: Server_id: 3 \| IP: 10.0.3.171 STOP ALL SLAVES; CHANGE MASTER "S3_R1" TO master_host = "10.0.3.223", master_user = "replicator", master_use_gtid = slave_pos, master_password = "fabrica", do_domain_ids = (1) CHANGE MASTER "S3_R2" TO master_host = "10.0.3.136", master_user = "replicator", master_use_gtid = slave_pos, master_password = "fabrica", do_domain_ids = (2) START ALL SLAVES; After initially starting all replications: 1. stop server 1 2. issue a INSERT\|UPDATE\|DELETE on server 2 3. stop server 2 4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped. 5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2. Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1. The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore\|do}_domain_ids.	Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with: SETTING: Server_id: 1 \| IP: 10.0.3.223 STOP ALL SLAVES; CHANGE MASTER "S1_R2" TO master_host = "10.0.3.136", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (2) CHANGE MASTER "S1_R3" TO master_host = "10.0.3.171", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (3) START ALL SLAVES; SETTING: Server_id: 2 \| IP: 10.0.3.136 STOP ALL SLAVES; CHANGE MASTER "S2_R1" TO master_host = "10.0.3.223", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1) CHANGE MASTER "S2_R3" TO master_host = "10.0.3.171", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (3) START ALL SLAVES; SETTING: Server_id: 3 \| IP: 10.0.3.171 STOP ALL SLAVES; CHANGE MASTER "S3_R1" TO master_host = "10.0.3.223", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1) CHANGE MASTER "S3_R2" TO master_host = "10.0.3.136", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (2) START ALL SLAVES; After initially starting all replications: 1. stop server 1 2. issue a INSERT\|UPDATE\|DELETE on server 2 3. stop server 2 4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped. 5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2. Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1. The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore\|do}_domain_ids.

Elena Stepanova made changes - 2015-11-10 01:59

Description

Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:

SETTING: Server_id: 1 | IP: 10.0.3.223
STOP ALL SLAVES;
CHANGE MASTER "S1_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
CHANGE MASTER "S1_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;

SETTING: Server_id: 2 | IP: 10.0.3.136
STOP ALL SLAVES;
CHANGE MASTER "S2_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S2_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;

SETTING: Server_id: 3 | IP: 10.0.3.171
STOP ALL SLAVES;
CHANGE MASTER "S3_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S3_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
START ALL SLAVES;

After initially starting all replications:
1. stop server 1
2. issue a INSERT|UPDATE|DELETE on server 2
3. stop server 2
4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.

Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.

The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids.

Let's consider a 3 master setup where each server has 2 replication channels, one to each of the other 2 servers where these replication channels where setup with:

{code:sql|title=SETTING: Server_id: 1 IP: 10.0.3.223}
STOP ALL SLAVES;
CHANGE MASTER "S1_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
CHANGE MASTER "S1_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;
{code}

{code:sql|title=SETTING: Server_id: 2 IP: 10.0.3.136}
STOP ALL SLAVES;
CHANGE MASTER "S2_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S2_R3" TO
master_host = "10.0.3.171",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3)
START ALL SLAVES;
{code}

{code:sql|title=SETTING: Server_id: 3 IP: 10.0.3.171}
STOP ALL SLAVES;
CHANGE MASTER "S3_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
CHANGE MASTER "S3_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
START ALL SLAVES;
{code}

After initially starting all replications:
1. stop server 1
2. issue a INSERT|UPDATE|DELETE on server 2
3. stop server 2
4. start server 1. At this point replication channel S1_R3 will go up and running immediately as server 3 never stopped.
5. start server 2. At this point replication channel S2_R3 will go up and running immediately as server 3 never stopped. BUT replication channel S2_R1 will not go up and will present a message error like “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” mentioning that server 1 haven't the most up to date transaction from domain id 2.

Observe that replication channel S2_R1 is presenting a error about a domain ID (2) that it has been explicitly told not to track at all! S2_R1 is supposed to track only domain ID 1.

The solution for this issue seems to be to MariaDB, on replication channel start, only send the GTID Slave Pos for the domain IDs that it should keep track as defined by {ignore|do}_domain_ids.

Elena Stepanova made changes - 2015-11-10 15:27

Link

This issue relates to ~~MDEV-9107~~ [ ~~MDEV-9107~~ ]

Elena Stepanova made changes - 2015-11-10 15:27

Link

This issue relates to ~~MDEV-9033~~ [ ~~MDEV-9033~~ ]

Elena Stepanova added a comment - 2015-11-10 15:35

Assuming for the time being that my theory in the comment to ~~MDEV-9107~~ is correct, and events from the domain ID should actually be tracked (just not applied) – does the S2's binlog actually have GTID 2-2-10?

Elena Stepanova added a comment - 2015-11-10 15:35 Assuming for the time being that my theory in the comment to MDEV-9107 is correct, and events from the domain ID should actually be tracked (just not applied) – does the S2's binlog actually have GTID 2-2-10?

Elena Stepanova made changes - 2015-11-10 15:35

Labels

need_feedback

Rodrigo Severo added a comment - 2015-11-10 16:34

S2's binlog certainly has GTID 2-2-10 as it points to an event that was issued directly on S2.

S3 has GTID 2-2-10 on it's binlog also as it got from S2 through replication channel S3_R2 .

But S1 doesn't have GTID 2-2-10 in it's binlog as it never got it from S2 (observe that GTID 2-2-10 was issued in S2 after S1 was stopped). This fact isn't a problem at all.

The problem is that because GTIDs for ALL tracked domain ids on the slave are sent to the master on replication channel (re)start, when S2 tries to re-establish replication channel S2_R1 with S1, it will fail several times until S1 finally gets GTID 2-2-10.

The point here is that in this particular situation we are dealing "only" with a transient, annoying and silly failure as S1 will eventually get GTID 2-2-10 through replication channel S1_R2. I say it's silly because why the eventual unavailability of a GTID related to a domain id that should be filtered out of the channel blocks the channel from going up? Observe that the problematic GTID is 2-2-10 and channel S2_R1 should deal only with domain id 1. I would really expect that whatever the GTID position of domain id 2 is, it won't affect the availability of a replication channel that is set to deal with domain id 1.

Here I'm mentioning a transient failure but now I will mention an even simplier and more common scenario where this replication scenario start failure will be permanent and catastrophic:

Let's consider 2 masters:

SETTING: Server_id: 1 | IP: 10.0.3.223
STOP ALL SLAVES;
CHANGE MASTER "S1_R2" TO
master_host = "10.0.3.136",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2)
START ALL SLAVES;

SETTING: Server_id: 2 | IP: 10.0.3.136
STOP ALL SLAVES;
CHANGE MASTER "S2_R1" TO
master_host = "10.0.3.223",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1)
START ALL SLAVES;

This could be used so that having 2 different databases, each databases were changed on one server and the other works as slave.

Because a network connection problem between the servers replication channels from both servers stop. But only the network connection between the servers have stopped. Both are still accessible from their main softwares.

In this case the GTID of both servers will be updated during the inter network connection breakup.

When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will.

This happens because during replication channel start GTID for all tracked domain on the slave are sent to the master. If the slave only sent GTIDs for the domain ids the replication channel should actually deal with, this problem wouldn't happen at all and the 2 replication channels of all these setups would go back immediatelly without any hiccups at all.

And again, please observe that on the first scenario we are talking about an annoying delay on the replication channel restart but on the second scenario we are talking about a fatal failure that actually blocks indefinitely the replication channels from being restarted at all.

Rodrigo Severo added a comment - 2015-11-10 16:34 S2's binlog certainly has GTID 2-2-10 as it points to an event that was issued directly on S2. S3 has GTID 2-2-10 on it's binlog also as it got from S2 through replication channel S3_R2 . But S1 doesn't have GTID 2-2-10 in it's binlog as it never got it from S2 (observe that GTID 2-2-10 was issued in S2 after S1 was stopped). This fact isn't a problem at all. The problem is that because GTIDs for ALL tracked domain ids on the slave are sent to the master on replication channel (re)start, when S2 tries to re-establish replication channel S2_R1 with S1, it will fail several times until S1 finally gets GTID 2-2-10. The point here is that in this particular situation we are dealing "only" with a transient, annoying and silly failure as S1 will eventually get GTID 2-2-10 through replication channel S1_R2. I say it's silly because why the eventual unavailability of a GTID related to a domain id that should be filtered out of the channel blocks the channel from going up? Observe that the problematic GTID is 2-2-10 and channel S2_R1 should deal only with domain id 1. I would really expect that whatever the GTID position of domain id 2 is, it won't affect the availability of a replication channel that is set to deal with domain id 1. Here I'm mentioning a transient failure but now I will mention an even simplier and more common scenario where this replication scenario start failure will be permanent and catastrophic: Let's consider 2 masters: SETTING: Server_id: 1 | IP: 10.0.3.223 STOP ALL SLAVES; CHANGE MASTER "S1_R2" TO master_host = "10.0.3.136", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (2) START ALL SLAVES; SETTING: Server_id: 2 | IP: 10.0.3.136 STOP ALL SLAVES; CHANGE MASTER "S2_R1" TO master_host = "10.0.3.223", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1) START ALL SLAVES; This could be used so that having 2 different databases, each databases were changed on one server and the other works as slave. Because a network connection problem between the servers replication channels from both servers stop. But only the network connection between the servers have stopped. Both are still accessible from their main softwares. In this case the GTID of both servers will be updated during the inter network connection breakup. When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will. This happens because during replication channel start GTID for all tracked domain on the slave are sent to the master. If the slave only sent GTIDs for the domain ids the replication channel should actually deal with, this problem wouldn't happen at all and the 2 replication channels of all these setups would go back immediatelly without any hiccups at all. And again, please observe that on the first scenario we are talking about an annoying delay on the replication channel restart but on the second scenario we are talking about a fatal failure that actually blocks indefinitely the replication channels from being restarted at all.

Elena Stepanova added a comment - 2015-11-11 02:09

rsevero,

Did you actually try the scenario you have described in the last comment? It's a good one, because as you said it is (almost) deterministic, so you'll get a reliable result. If you didn't try it yet, please do. If you tried and got the failure you are describing, please send cnf files from both servers, binary logs, error logs and output of show all slaves status \G where you observe the errors. Maybe you have encountered another bug there.

Meanwhile, please consider – if it works as you described with do_domain_ids, it should be surely so without any do_domain_ids clauses, right? In that case, servers would have no excuse whatsoever for not tracking both domains, so the starting GTIDs are certainly stored, and following the same logic, they will definitely fail upon reconnect?
But at the same time, if any M<=>M setup failed so badly upon any disconnect, we would have complaints all over, and it is not happening.

I think the reason is here:

In this case the GTID of both servers will be updated during the inter network connection breakup.

When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will.

What will be updated during network problems is @@gtid_current_pos, @@gtid_binlog_pos, and Gtid_Slave_Pos in the slave status. But these are not the values that the slave uses while reconnecting to the master. Neither @@gtid_slave_pos, nor mysql.gtid_slave_pos table, nor Gtid_IO_Pos will be updated. So, when the network connection is restored, the slave will connect to the master requesting the exact same position it had upon replicating the last event.

Note: While looking into it, I realized there is a good reason for confusion – we were trying so hard to give meaningful names for GTID fields and variables, but the result turned out horrible: we have @@gtid_slave_pos variable, and we have Gtid_Slave_Pos field in the SHOW SLAVE STATUS output, and these values have nearly nothing in common...

Now, back to your original scenario – I cannot reproduce the problem. I have no doubts that you actually observed it; it's possible that there is a race condition that I'm failing to hit, but it's also possible that it is a weird consequence of ~~MDEV-9033~~. If you are getting the problem fairly frequently, it would be interesting to see if you can reproduce it without do_domain_ids (but with gtid-ignore-duplicates of course). As said above, if the problem is as you describe, do_domain_ids should make no difference; on the other hand, removing them will rule out ~~MDEV-9033~~ effect.
If it is not reproducible or you cannot experiment, I suggest to keep it open and wait till nirbhay_c confirms/declines that tracking the position of all domains is a part of do_domain_ids design, and till ~~MDEV-9033~~ is fixed. After that, we can re-visit it.

Elena Stepanova added a comment - 2015-11-11 02:09 rsevero , Did you actually try the scenario you have described in the last comment? It's a good one, because as you said it is (almost) deterministic, so you'll get a reliable result. If you didn't try it yet, please do. If you tried and got the failure you are describing, please send cnf files from both servers, binary logs, error logs and output of show all slaves status \G where you observe the errors. Maybe you have encountered another bug there. Meanwhile, please consider – if it works as you described with do_domain_ids , it should be surely so without any do_domain_ids clauses, right? In that case, servers would have no excuse whatsoever for not tracking both domains, so the starting GTIDs are certainly stored, and following the same logic, they will definitely fail upon reconnect? But at the same time, if any M<=>M setup failed so badly upon any disconnect, we would have complaints all over, and it is not happening. I think the reason is here: In this case the GTID of both servers will be updated during the inter network connection breakup. When the network connection between both servers are restored, both will try to re-establish their replication channels with the other server but both will fail as during channel restart both will ask for GTIDs for their own domain ids that the other server haven't got and never will. What will be updated during network problems is @@gtid_current_pos , @@gtid_binlog_pos , and Gtid_Slave_Pos in the slave status. But these are not the values that the slave uses while reconnecting to the master. Neither @@gtid_slave_pos , nor mysql.gtid_slave_pos table, nor Gtid_IO_Pos will be updated. So, when the network connection is restored, the slave will connect to the master requesting the exact same position it had upon replicating the last event. Note: While looking into it, I realized there is a good reason for confusion – we were trying so hard to give meaningful names for GTID fields and variables, but the result turned out horrible: we have @@gtid_slave_pos variable, and we have Gtid_Slave_Pos field in the SHOW SLAVE STATUS output, and these values have nearly nothing in common... Now, back to your original scenario – I cannot reproduce the problem. I have no doubts that you actually observed it; it's possible that there is a race condition that I'm failing to hit, but it's also possible that it is a weird consequence of MDEV-9033 . If you are getting the problem fairly frequently, it would be interesting to see if you can reproduce it without do_domain_ids (but with gtid-ignore-duplicates of course). As said above, if the problem is as you describe, do_domain_ids should make no difference; on the other hand, removing them will rule out MDEV-9033 effect. If it is not reproducible or you cannot experiment, I suggest to keep it open and wait till nirbhay_c confirms/declines that tracking the position of all domains is a part of do_domain_ids design, and till MDEV-9033 is fixed. After that, we can re-visit it.

Rodrigo Severo added a comment - 2015-11-11 13:30

When trying to implement the 6 servers setup across 2 cities I described in ~~MDEV-9107~~ I experienced both problems I described here:

the transient one across servers on the same city and
the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities.

I will try to implement the setup I envision using gtid-ignore-duplicates as you suggested. Thanks for pointing me to this option. It might work and it will be great if it does.

But the problem I mention here is real. If you are not receiving more reports about that I bet is just because few people are exercising GTID based replication that much and being a new concept, when facing problems are probably just leaving it alone thinking "Am I doing something wrong?" Why I say so? Because ~~MDEV-9033~~ is a terrible bug and you are not receiving tons of reports about it.

If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with ~~MDEV-9033~~ of course

And the ones that are actually trying might be giving up prematurely because of issues like the ones we are discussing.

Thanks again for your help. I will return with new info after my tests with gtid-ignore-duplicates.

Rodrigo Severo added a comment - 2015-11-11 13:30 When trying to implement the 6 servers setup across 2 cities I described in MDEV-9107 I experienced both problems I described here: the transient one across servers on the same city and the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities. I will try to implement the setup I envision using gtid-ignore-duplicates as you suggested. Thanks for pointing me to this option. It might work and it will be great if it does. But the problem I mention here is real. If you are not receiving more reports about that I bet is just because few people are exercising GTID based replication that much and being a new concept, when facing problems are probably just leaving it alone thinking "Am I doing something wrong?" Why I say so? Because MDEV-9033 is a terrible bug and you are not receiving tons of reports about it. If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with MDEV-9033 of course And the ones that are actually trying might be giving up prematurely because of issues like the ones we are discussing. Thanks again for your help. I will return with new info after my tests with gtid-ignore-duplicates.

Rodrigo Severo added a comment - 2015-11-11 13:40

One detail, that is probably important:

all the times I faced the issues described here, I had actually stopped the replication for one reason or another. On START SLAVE I got the problems mentioned here.

The network connection problem was just me trying to imagine a more common situation where these issues might happen but the network connection situation and the STOP SLAVE/START SLAVE situations don't share the "slave sending start GTIDs to master" step. Just the STOP SLAVE/START SLAVE one has this step. This is the step where these issues happen.

Sorry for the confusion.

Rodrigo Severo added a comment - 2015-11-11 13:40 One detail, that is probably important: all the times I faced the issues described here, I had actually stopped the replication for one reason or another. On START SLAVE I got the problems mentioned here. The network connection problem was just me trying to imagine a more common situation where these issues might happen but the network connection situation and the STOP SLAVE/START SLAVE situations don't share the "slave sending start GTIDs to master" step. Just the STOP SLAVE/START SLAVE one has this step. This is the step where these issues happen. Sorry for the confusion.

Elena Stepanova added a comment - 2015-11-11 13:41 - edited

When trying to implement the 6 servers setup across 2 cities I described in ~~MDEV-9107~~ I experienced both problems I described here:
... the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities.

Please provide the error logs, binary logs and cnf files from both servers.

If it's reproducible for you with STOP SLAVE, even better – please do reproduce with STOP SLAVE and attach the logs/configs and the SHOW ALL SLAVES STATUS output.

I'm talking about the two-server setup that you described before.
If you experience the problem only on your 6-server setup which also uses do_domain_ids, that's another story – ~~MDEV-9033~~ can have all kinds of side-effects, it does not make sense to dig until it's fixed.

If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with ~~MDEV-9033~~ of course

This is very true of course, but you've missed my point. The problem with two-server setup as you described it does not require trying GTID to it's full potential – it should affect pretty much every user who ever tries a simple 2-master topology and ever stops at least one slave.
Once again, I'm not claiming that you did not encounter it, but there must be something more than you described. And I'm not just making theories, I did try it of course. That's why I would like you to actually try to do what you described (rather than describe what you previously did), and if it indeed happens for you according to your description, provide the logs and configuration because clearly there is something that's missing from the description.

For one, I asked before but you never answered how you currently avoid ~~MDEV-9033~~. Your setup should trigger it all the time, and your replication should not proceed further than one or two events on one of the slaves; yet, you describe rather advanced scenarios with restarting servers and all that, so apparently there is something in your configuration that you did not mention that lets you work around ~~MDEV-9033~~ (but possibly causes some other problems that we don't know about).

Elena Stepanova added a comment - 2015-11-11 13:41 - edited When trying to implement the 6 servers setup across 2 cities I described in MDEV-9107 I experienced both problems I described here: ... the catastrophic one with the 2 servers, one on each city, that where supposed to exchange updates done on their cities. Please provide the error logs, binary logs and cnf files from both servers. If it's reproducible for you with STOP SLAVE, even better – please do reproduce with STOP SLAVE and attach the logs/configs and the SHOW ALL SLAVES STATUS output. I'm talking about the two-server setup that you described before. If you experience the problem only on your 6-server setup which also uses do_domain_ids , that's another story – MDEV-9033 can have all kinds of side-effects, it does not make sense to dig until it's fixed. If people were really trying to use GTID based replication to it's full potential, I'm sure you would be hearing reports like mine a lot, starting with MDEV-9033 of course This is very true of course, but you've missed my point. The problem with two-server setup as you described it does not require trying GTID to it's full potential – it should affect pretty much every user who ever tries a simple 2-master topology and ever stops at least one slave. Once again, I'm not claiming that you did not encounter it, but there must be something more than you described. And I'm not just making theories, I did try it of course. That's why I would like you to actually try to do what you described (rather than describe what you previously did), and if it indeed happens for you according to your description, provide the logs and configuration because clearly there is something that's missing from the description. For one, I asked before but you never answered how you currently avoid MDEV-9033 . Your setup should trigger it all the time, and your replication should not proceed further than one or two events on one of the slaves; yet, you describe rather advanced scenarios with restarting servers and all that, so apparently there is something in your configuration that you did not mention that lets you work around MDEV-9033 (but possibly causes some other problems that we don't know about).

Rodrigo Severo added a comment - 2015-11-11 15:40

To see the catastrophic failure do the following:

SETTING: Server_id: 1 | Domain_id: 1 | IP: 10.0.0.1
STOP ALL SLAVES;
CHANGE MASTER "S1_R2" TO
master_host = "10.0.0.2",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (2);
CHANGE MASTER "S1_R3" TO
master_host = "10.0.0.3",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (3,4);
START ALL SLAVES;

SETTING: Server_id: 2 | Domain_id: 2 | IP: 10.0.0.2
STOP ALL SLAVES;
CHANGE MASTER "S2_R1" TO
master_host = "10.0.0.1",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1,3,4);
START ALL SLAVES;

SETTING: Server_id: 3 | Domain_id: 3 | IP: 10.0.0.3
STOP ALL SLAVES;
CHANGE MASTER "S3_R1" TO
master_host = "10.0.0.1",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1,2);
CHANGE MASTER "S3_R4" TO
master_host = "10.0.0.4",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (4);
START ALL SLAVES;

SETTING: Server_id: 4 | Domain_id: 4 | IP: 10.0.0.4
STOP ALL SLAVES;
CHANGE MASTER "S4_R3" TO
master_host = "10.0.0.3",
master_user = "replicator",
master_use_gtid = slave_pos,
master_password = "password",
do_domain_ids = (1,2,3);
START ALL SLAVES;

All of them should have gtid_ignore_duplicates set to OFF.

After starting all replication channels and confirming that they are all up and running:
1. on server S1 do: CREATE TABLE t1 (i INT); INSERT INTO t1 VALUES (1); CREATE TABLE t2 (i INT);
2. wait to be sure these commands are executed by all 4 servers
3. on server S1 do: STOP SLAVE 'S1_R3';
4. on server S3 do: STOP SLAVE 'S3_R1';
3. on server S2 do: INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3);
4. on server S4 do: INSERT INTO t2 VALUES (4); INSERT INTO t2 VALUES (5);
3. try to restart replication channels S1_R3 and S3_R1 with START SLAVE 'S1_R3|S3_R1';

Watch how both of them will fail with a message of the form:

“Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'”

I have this setup working where servers 1 and 2 are at one city and servers 3 and 4 are at another one.

Rodrigo Severo added a comment - 2015-11-11 15:40 To see the catastrophic failure do the following: SETTING: Server_id: 1 | Domain_id: 1 | IP: 10.0.0.1 STOP ALL SLAVES; CHANGE MASTER "S1_R2" TO master_host = "10.0.0.2", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (2); CHANGE MASTER "S1_R3" TO master_host = "10.0.0.3", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (3,4); START ALL SLAVES; SETTING: Server_id: 2 | Domain_id: 2 | IP: 10.0.0.2 STOP ALL SLAVES; CHANGE MASTER "S2_R1" TO master_host = "10.0.0.1", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1,3,4); START ALL SLAVES; SETTING: Server_id: 3 | Domain_id: 3 | IP: 10.0.0.3 STOP ALL SLAVES; CHANGE MASTER "S3_R1" TO master_host = "10.0.0.1", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1,2); CHANGE MASTER "S3_R4" TO master_host = "10.0.0.4", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (4); START ALL SLAVES; SETTING: Server_id: 4 | Domain_id: 4 | IP: 10.0.0.4 STOP ALL SLAVES; CHANGE MASTER "S4_R3" TO master_host = "10.0.0.3", master_user = "replicator", master_use_gtid = slave_pos, master_password = "password", do_domain_ids = (1,2,3); START ALL SLAVES; All of them should have gtid_ignore_duplicates set to OFF. After starting all replication channels and confirming that they are all up and running: 1. on server S1 do: CREATE TABLE t1 (i INT); INSERT INTO t1 VALUES (1); CREATE TABLE t2 (i INT); 2. wait to be sure these commands are executed by all 4 servers 3. on server S1 do: STOP SLAVE 'S1_R3'; 4. on server S3 do: STOP SLAVE 'S3_R1'; 3. on server S2 do: INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3); 4. on server S4 do: INSERT INTO t2 VALUES (4); INSERT INTO t2 VALUES (5); 3. try to restart replication channels S1_R3 and S3_R1 with START SLAVE 'S1_R3|S3_R1'; Watch how both of them will fail with a message of the form: “Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-10, which is not in the master's binlog'” I have this setup working where servers 1 and 2 are at one city and servers 3 and 4 are at another one.

Elena Stepanova added a comment - 2015-11-11 15:55

Okay, good, so there are 4 servers after all, not just 2.
In this case I'm inclined to think that the problem is indeed a side-effect of ~~MDEV-9033~~, but I will try it shortly, hopefully I'll find a clear proof.

Elena Stepanova added a comment - 2015-11-11 15:55 Okay, good, so there are 4 servers after all, not just 2. In this case I'm inclined to think that the problem is indeed a side-effect of MDEV-9033 , but I will try it shortly, hopefully I'll find a clear proof.

Elena Stepanova made changes - 2015-11-11 15:55

Labels

need_feedback

Elena Stepanova made changes - 2015-11-11 15:56

Assignee

Elena Stepanova [ elenst ]

Rodrigo Severo added a comment - 2015-11-11 18:48

I'm quite sure this isn't a side effect of ~~MDEV-9033~~ as, AFAIU, ~~MDEV-9033~~ is about MariaDB replication creating, out of the blue, a new GTID for a event received for a master which, obviously already had it's own GTID. Because this spurious GTID is created, the loop is created.

This issue is about a slave asking for a GTID of a domain id that shouldn't be treated by a replication channel and the master refusing to start said replication channel because the asked GTID of this not-to-be-cared-of domain-id being too up to date.

Rodrigo Severo added a comment - 2015-11-11 18:48 I'm quite sure this isn't a side effect of MDEV-9033 as, AFAIU, MDEV-9033 is about MariaDB replication creating, out of the blue, a new GTID for a event received for a master which, obviously already had it's own GTID. Because this spurious GTID is created, the loop is created. This issue is about a slave asking for a GTID of a domain id that shouldn't be treated by a replication channel and the master refusing to start said replication channel because the asked GTID of this not-to-be-cared-of domain-id being too up to date.

Elena Stepanova added a comment - 2015-11-11 19:11

rsevero,
There is no need to theorize. Please do try to reproduce what you are describing with only two servers replicating from each other, no other replication channels whatsoever. If you succeed at doing so, please let me know.

Elena Stepanova added a comment - 2015-11-11 19:11 rsevero , There is no need to theorize. Please do try to reproduce what you are describing with only two servers replicating from each other, no other replication channels whatsoever. If you succeed at doing so, please let me know.

Rodrigo Severo added a comment - 2015-11-11 19:14

With only 2 servers I don't see the problem. I can only reproduce it with 4 servers on the setup I detailed above.

Rodrigo Severo added a comment - 2015-11-11 19:14 With only 2 servers I don't see the problem. I can only reproduce it with 4 servers on the setup I detailed above.

Rodrigo Severo added a comment - 2016-01-04 21:53

The problem with the 4 servers setup still exists on MariaDB 10.1.10.

Rodrigo Severo added a comment - 2016-01-04 21:53 The problem with the 4 servers setup still exists on MariaDB 10.1.10.

Rodrigo Severo made changes - 2016-01-05 14:45

Affects Version/s

10.1.10 [ 20402 ]

Rodrigo Severo added a comment - 2016-01-06 15:45

I believe there were some expectation that fixing ~~MDEV-9033~~ would fix this issue but unfortunately it didn't happen.

Rodrigo Severo added a comment - 2016-01-06 15:45 I believe there were some expectation that fixing MDEV-9033 would fix this issue but unfortunately it didn't happen.

Kristian Nielsen added a comment - 2016-01-15 16:08 - edited

Generally, a slave is not allowed to connect to a master on a GTID which is
missing in the master's binlog. This is to prevent silent corruption.

There are a couple of exceptions to this rule. One is that if the master has
no GTIDs in a domain, then that domain is ignored. I think another is that
the rule is relaxed in case of --gtid-ignore-duplicates=1, for reasons like
described in this report.

I think the request here is for another similar exception in case of
--do-domain-ids. This could be reasonable, but it is not implemented
currently.

An implementation might be as the reporter suggests. When the slave sends
its replication position to the master, omit those domains that are
configured to be ignored. However, some careful thought is needed to
consider all possible scenarios and ensure that this does not lead to
incorrect results.

Kristian Nielsen added a comment - 2016-01-15 16:08 - edited Generally, a slave is not allowed to connect to a master on a GTID which is missing in the master's binlog. This is to prevent silent corruption. There are a couple of exceptions to this rule. One is that if the master has no GTIDs in a domain, then that domain is ignored. I think another is that the rule is relaxed in case of --gtid-ignore-duplicates=1, for reasons like described in this report. I think the request here is for another similar exception in case of --do-domain-ids. This could be reasonable, but it is not implemented currently. An implementation might be as the reporter suggests. When the slave sends its replication position to the master, omit those domains that are configured to be ignored. However, some careful thought is needed to consider all possible scenarios and ensure that this does not lead to incorrect results.

Elena Stepanova made changes - 2016-01-19 00:16

Assignee

Elena Stepanova [ elenst ]

Kristian Nielsen [ knielsen ]

Elena Stepanova added a comment - 2016-01-19 00:17

knielsen, do you want it to be converted into a feature request?

Elena Stepanova added a comment - 2016-01-19 00:17 knielsen , do you want it to be converted into a feature request?

Jan Kunzmann (Inactive) made changes - 2017-02-01 17:08

Link

This issue relates to ~~MDEV-11969~~ [ ~~MDEV-11969~~ ]

Andrei Elkin made changes - 2017-09-13 09:02

Link

This issue relates to ~~MDEV-12012~~ [ ~~MDEV-12012~~ ]

Andrei Elkin added a comment - 2017-09-13 09:02

The ignore_domain_ids options could be helpful to
make the 12012 post-gtid-enabled slave to successfully connect,
not requiring the masters to forget/purge their old domain events.

Andrei Elkin added a comment - 2017-09-13 09:02 The ignore_domain_ids options could be helpful to make the 12012 post-gtid-enabled slave to successfully connect, not requiring the masters to forget/purge their old domain events.

Elena Stepanova made changes - 2017-11-05 16:52

Fix Version/s

10.1 [ 16100 ]

Sergei Golubchik made changes - 2021-12-06 21:32

Workflow

MariaDB v3 [ 72497 ]

MariaDB v4 [ 139906 ]

Michaël de groot made changes - 2024-06-28 10:24

Link

This issue relates to MDEV-34485 [ MDEV-34485 ]

Michaël de groot made changes - 2024-06-28 12:49

Link

This issue includes MDEV-34487 [ MDEV-34487 ]

Michaël de groot added a comment - 2024-06-29 09:31

I created a work-around for this:
https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and
and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh

The work-around will remove undesired gtid domains from the primary.

Michaël de groot added a comment - 2024-06-29 09:31 I created a work-around for this: https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/galera-remove-local-domain.sh and and https://gitlab.com/de-groot-consultancy-ansible-roles/dba-toolkit/-/blob/main/files/remove-mariadb-gtid-domain.sh The work-around will remove undesired gtid domains from the primary.

Vlad Radu made changes - 2025-02-06 07:19

Labels

foundation

People

Assignee:: Kristian Nielsen

Reporter:: Rodrigo Severo

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2015-11-09 21:31

Updated:: 2025-02-06 07:19

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration