[MDEV-11969] Can't remove GTIDs for a stale GTID Domain ID Created: 2017-02-01  Updated: 2017-12-12  Resolved: 2017-12-12

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.19, 10.1.20
Fix Version/s: 10.1.30

Type: Bug Priority: Major
Reporter: Jan Kunzmann (Inactive) Assignee: Andrei Elkin
Resolution: Duplicate Votes: 1
Labels: None

Issue Links:
Duplicate
is duplicated by MDEV-12012 gtid_domain_id doesn't work with mult... Closed
Relates
relates to MDEV-9108 "GTID not in master's binlog" error w... Open

 Description   

I've simplified my situation for this bug description:

We've two independent MariaDB clusters, both are a regular master-slave setup (let's call the masters A and B). There's also a MariaDB as data warehouse, using multi-source replication from both masters. All replications were created in the pre-GTID era using binlog_file and binlog_pos. Of course, both masters already generated GTIDs for the default domain id 0.

When we migrated to a GTID based replication, we configured master A with domain id 1 and B with domain id 2. All slaves in group A have now 2 GTIDs in gtid_slave_pos: one with domain 1 with a increasing sequence counter, and one with a static sequence counter the former default domain 0. Master A also keeps track of this GTID of domain 0 via gtid_binlog_pos (and gtid_binlog_state).

For master B and its slaves the same applies for domain 2 and 0, respectively. So far this is not a problem.

However, it's not possible to introduce GTID based replication on the warehouse. The last statement written in the pre-GTID era for the default domain id 0 originated from master B and has a lower sequence number than the GTID for domain 0 on master A.

Therefore, when executing

CHANGE MASTER "A" TO master_use_gtid = slave_pos, do_domain_id = (1), ignore_domain_id = ();

on the warehouse to allow its replication to use GTID, A attempts to scan the binlog not only for domain 1, but also for domain 0 (despite do_domain_id). Because the sequence number for domain 0 is lower than the one in A's gtid_binlog_pos, A refuses the connection with

Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-XXXX-YYYY, which is not in the master's binlog

There's no way to ditch knowledge about domain 0 on the masters of A and B except setting gtid_binlog_state which would cause a RESET MASTER and therefore isn't applicable in live operation.

I assume that this issue is similar to MDEV-9108 which (as far as I understood) basically wants that do_domain_id also tells the master to ignore all other domains when scanning the binlogs for the starting position. This would solve my issue.

But after all I believe that it's easier to allow altering gtid_binlog_pos on the master (not directly or via gtid_binlog_state, but through a function call) to forget GTIDs for a specific domain id without issuing RESET MASTER.



 Comments   
Comment by Elena Stepanova [ 2017-02-01 ]

plinux, I'll leave it to you to choose and set the 'Fix version'.

Comment by Michael Gmelin [ 2017-03-04 ]

We're facing this issue in various similar replication setups as well. After changing a server's gtid_domain_id, it's not possible to get rid of the last gtid of the previous domain in gtid_binlog_state on the master without using "reset master" and the slaves get stuck with fatal error 1236. do_domain_id doesn't help, as the slaves always check gtid_binlog_state and try to lookup all gtids in the master's binary log.

Comment by Andrei Elkin [ 2017-09-06 ]

Lixun, hello.

Let me grab the ticket from you since I am implementing the very requested measure in mdev-12012.
You're welcome with your feedback!

Andrei

Comment by Andrei Elkin [ 2017-12-12 ]

MDEV-12012 fixed this issue in 10.1.30.

Generated at Thu Feb 08 07:54:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.